lamp.data.bytesegmentencoding

Greedy contraction of consecutive n-grams

Attributes

Members list

Type members

Classlikes

case class ByteSegmentCodec(trained: Vector[(Vector[Byte], Char)], unknownToken: Char, unknownByte: Byte) extends Codec

Attributes

Supertypes
trait Serializable
trait Product
trait Equals
trait Codec
class Object
trait Matchable
class Any
Show all
case class ByteSegmentCodecFactory(vocabularyMin: Char, vocabularyMax: Char, maxMergedSegmentLength: Int, unknownToken: Char, unknownByte: Byte) extends CodecFactory[ByteSegmentCodec]

Attributes

Supertypes
trait Serializable
trait Product
trait Equals
class Object
trait Matchable
class Any
Show all

Value members

Concrete methods

def decode(encoded: Array[Char], encoding: Vector[(Vector[Byte], Char)], unknown: Byte): Array[Byte]
def encode(corpus: Array[Byte], encoding: Vector[(Vector[Byte], Char)], unknownToken: Char): Array[Char]
def saveEncodingToFile(file: File, encoding: Vector[(Vector[Byte], Char)], unknownToken: Char, unknownByte: Byte): Unit
def train(corpus: Array[Byte], vocabularyMin: Char, vocabularyMax: Char, maxMergedSegmentLength: Int): Vector[(Vector[Byte], Char)]

Trains BPE encoding

Trains BPE encoding

Char here is used as unsigned 16 bit integer

Attributes