lamp.data.bytesegmentencoding
package lamp.data.bytesegmentencoding
Greedy contraction of consecutive n-grams
Attributes
Members list
Type members
Classlikes
case class ByteSegmentCodec(trained: Vector[(Vector[Byte], Char)], unknownToken: Char, unknownByte: Byte) extends Codec
Attributes
- Supertypes
-
trait Serializabletrait Producttrait Equalstrait Codecclass Objecttrait Matchableclass AnyShow all
case class ByteSegmentCodecFactory(vocabularyMin: Char, vocabularyMax: Char, maxMergedSegmentLength: Int, unknownToken: Char, unknownByte: Byte) extends CodecFactory[ByteSegmentCodec]
Attributes
- Supertypes
-
trait Serializabletrait Producttrait Equalstrait CodecFactory[ByteSegmentCodec]class Objecttrait Matchableclass AnyShow all
Value members
Concrete methods
def decode(encoded: Array[Char], encoding: Vector[(Vector[Byte], Char)], unknown: Byte): Array[Byte]
def encode(corpus: Array[Byte], encoding: Vector[(Vector[Byte], Char)], unknownToken: Char): Array[Char]
def saveEncodingToFile(file: File, encoding: Vector[(Vector[Byte], Char)], unknownToken: Char, unknownByte: Byte): Unit
def train(corpus: Array[Byte], vocabularyMin: Char, vocabularyMax: Char, maxMergedSegmentLength: Int): Vector[(Vector[Byte], Char)]
Trains BPE encoding
Trains BPE encoding
Char here is used as unsigned 16 bit integer
Attributes
In this article