Class SimpleDetector

java.lang.Object
com.yahoo.language.simple.SimpleDetector
All Implemented Interfaces:
Detector

public class SimpleDetector extends Object implements Detector
Includes functionality for determining the langCode from a sample or from the encoding. There are two ways to guess a String's langCode, by encoding and by character set. If the encoding is available this is a very good indication of the langCode. If the encoding is not available, then the actual characters in the string can be used to make an educated guess at the String's langCode. Recall a String in Java is unicode. Therefore, we can simply look at the unicode blocks of the characters in the string. Unfortunately, its not 100% fool-proof. From what I've been able to determine, Korean characters do not overlap with Japanese or Chinese characters, so their presence is a good indication of Korean. If a string contains phonetic japanese, this is a good indication of Japanese. However, Japanese and Chinese characters occupy many of the same character blocks, so if there are no definitive signs of Japanese then it is assumed that the String is Chinese.
Author:
Rich Pito, bjorncs
  • Constructor Details

    • SimpleDetector

      public SimpleDetector()
  • Method Details

    • detect

      public Detection detect(byte[] input, int offset, int length, Hint hint)
      Description copied from interface: Detector
      Detects language and encoding of the supplied byte array, possibly using a language/encoding hint.
      Specified by:
      detect in interface Detector
      Parameters:
      input - the buffer that is to be inspected
      offset - the offset to detect from
      length - the size to detect from
      hint - a hint to the detector, or null for no hint
      Returns:
      an array of possible language/encoding pairs, sorted by decreasing confidence (possibly empty, but never null)
    • detect

      public Detection detect(ByteBuffer input, Hint hint)
      Description copied from interface: Detector
      Detects language and encoding of the supplied ByteBuffer, possibly using a language/encoding hint.
      Specified by:
      detect in interface Detector
      Parameters:
      input - the buffer that is to be inspected, from its current position to its limit
      hint - a hint to the detector, or null for no hint
      Returns:
      an array of possible language/encoding pairs, sorted by decreasing confidence (possibly empty, but never null)
    • detect

      public Detection detect(String input, Hint hint)
      Description copied from interface: Detector
      Detects language of the supplied String, possibly using a language hint.
      Specified by:
      detect in interface Detector
      Parameters:
      input - the string that is to be inspected
      hint - a hint to the detector, or null for no hint
      Returns:
      an array of possible language/encoding pairs, sorted by decreasing confidence (possibly empty, but never null)
    • guessLanguage

      public Language guessLanguage(byte[] buf, int offset, int length)
    • guessLanguage

      public Language guessLanguage(String input)
    • guessEncoding

      public String guessEncoding(byte[] input)
    • guessEncoding

      public String guessEncoding(byte[] input, int offset, int length)