Class FieldMatchMetricsComputer


  • public final class FieldMatchMetricsComputer
    extends java.lang.Object

    Calculates a set of metrics capturing information about the degree of agreement between a query and a field string. This algorithm attempts to capture the property of text that very close tokens are usually part of the same semantic structure, while tokens farther apart are much more loosely related. The algorithm will locate alternative such regions containing multiple query tokens (segments), do a more detailed analysis of these segments and choose the ones producing the best overall set of match metrics (subject to certain resource constraints).

    Such segments are found by looking at query terms in sequence from left to right and finding matches in the field. All alternative segment start points are explored, and the segmentation achieving the best overall string match metric score is preferred. Dynamic programming is used to avoid redoing work on segmentations.

    When a segment start point is found, subsequent tokens from the query are searched in the field from this starting point in "semantic order". This search order can be defined independently of the algorithm. The current order searches proximityLimit tokens ahead first, then the same distance backwards (so if you need to go two steps backwards in the field from the segment starting point, the real distance is -2, but the "semantic distance" is proximityLimit+2).

    The actual metrics are calculated during execution of this algorithm by the FieldMatchMetrics class, by receiving events emitted from the algorithm. Any set of metrics derivable from these events are computable using this algorithm.

    Terminology:

    • Sequence - A set of adjacent matched tokens in the field
    • Segment - A field area containing matches to a continuous section of the query
    • Gap - A chunk of adjacent tokens inside a segment separating two matched characters
    • Semantic distance - A non-continuous distance between tokens in j, where the non-continuousness is mean to capture the semantic similarity between the query and those tokens.

    Notation: A position index in the query is denoted i. A position index in the field is denoted j.

    This class is not multithread safe, but is reusable across queries for a single thread.

    Author:
    bratseth
    • Constructor Detail

      • FieldMatchMetricsComputer

        public FieldMatchMetricsComputer()
        Creates a feature computer using default settings
      • FieldMatchMetricsComputer

        public FieldMatchMetricsComputer​(FieldMatchMetricsParameters parameters)
        Creates a feature computer with the given parameters. The parameters are frozen if they were not already, this may cause validation exceptions to be thrown from this.
    • Method Detail

      • compute

        public FieldMatchMetrics compute​(java.lang.String queryString,
                                         java.lang.String fieldString)
        Computes the string match metrics from a query and field string.
      • compute

        public FieldMatchMetrics compute​(Query query,
                                         java.lang.String fieldString)
        Computes the string match metrics from a query and field string.
      • compute

        public FieldMatchMetrics compute​(Query query,
                                         java.lang.String fieldString,
                                         boolean collectTrace)
        Computes the string match metrics from a query and field string.
        Parameters:
        query - the query to compute over
        fieldString - the field value to compute over - tokenized by splitting on space
        collectTrace - true to accumulate trace information in the trace returned with the metrics
      • compute

        public FieldMatchMetrics compute​(Query query,
                                         Field field,
                                         boolean collectTrace)
        Computes the string match metrics from a query and field.
        Parameters:
        query - the query to compute over
        field - the field value to compute over
        collectTrace - true to accumulate trace information in the trace returned with the metrics
      • toString

        public java.lang.String toString()
        Overrides:
        toString in class java.lang.Object