Class DataFileStatistics

Object
io.delta.kernel.statistics.DataFileStatistics

public class DataFileStatistics extends Object
Encapsulates statistics for a data file in a Delta Lake table and provides methods to serialize those stats to JSON with basic physical-type validation. Note that connectors (e.g. Spark, Flink) are responsible for ensuring the correctness of collected stats, including any necessary string truncation, prior to constructing this class.
  • Field Details

    • MICROSECONDS_PER_SECOND

      public static final int MICROSECONDS_PER_SECOND
      See Also:
    • NANOSECONDS_PER_MICROSECOND

      public static final int NANOSECONDS_PER_MICROSECOND
      See Also:
  • Constructor Details

    • DataFileStatistics

      public DataFileStatistics(long numRecords, Map<Column,Literal> minValues, Map<Column,Literal> maxValues, Map<Column,Long> nullCount)
      Create a new instance of DataFileStatistics. The minValues, maxValues, and nullCount are all required fields. This class is primarily used to serialize stats to JSON with type checking when constructing file actions and NOT used during data skipping. As such the column names in minValues, maxValues and nullCount should be that of the physical data schema that's reflected in the parquet files and NOT logical schema.
      Parameters:
      numRecords - Number of records in the data file.
      minValues - Map of column to minimum value of it in the data file. If the data file has all nulls for the column, the value will be null or not present in the map.
      maxValues - Map of column to maximum value of it in the data file. If the data file has all nulls for the column, the value will be null or not present in the map.
      nullCount - Map of column to number of nulls in the data file.
  • Method Details

    • getNumRecords

      public long getNumRecords()
      Get the number of records in the data file.
      Returns:
      Number of records in the data file.
    • getMinValues

      public Map<Column,Literal> getMinValues()
      Get the minimum values of the columns in the data file. The map may contain statistics for only a subset of columns in the data file.
      Returns:
      Map of column to minimum value of it in the data file.
    • getMaxValues

      public Map<Column,Literal> getMaxValues()
      Get the maximum values of the columns in the data file. The map may contain statistics for only a subset of columns in the data file.
      Returns:
      Map of column to minimum value of it in the data file.
    • getNullCount

      public Map<Column,Long> getNullCount()
      Get the number of nulls of columns in the data file. The map may contain statistics for only a subset of columns in the data file.
      Returns:
      Map of column to number of nulls in the data file.
    • serializeAsJson

      public String serializeAsJson(StructType physicalSchema)
      Serializes the statistics as a JSON string.

      Example: For nested column structures:

       Input:
         minValues = {
           new Column(new String[]{"a", "b", "c"}) mapped to Literal.ofInt(10),
           new Column("d") mapped to Literal.ofString("value")
         }
      
       Output JSON:
         {
           "minValues": {
             "a": {
               "b": {
                 "c": 10
               }
             },
             "d": "value"
           }
         }
       
      Parameters:
      physicalSchema - the optional physical schema. If provided, all min/max values and null counts will be included and validated against their physical types. If null, only numRecords will be serialized without validation.
      Returns:
      a JSON representation of the statistics.
      Throws:
      KernelException - if dataSchema is provided and there's a type mismatch between the Literal values and the expected types in the schema, or if an unsupported data type is found.
    • equals

      public boolean equals(Object o)
      Overrides:
      equals in class Object
    • hashCode

      public int hashCode()
      Overrides:
      hashCode in class Object
    • toString

      public String toString()
      Overrides:
      toString in class Object