Package io.delta.kernel.statistics
Class DataFileStatistics
Object
io.delta.kernel.statistics.DataFileStatistics
Encapsulates statistics for a data file in a Delta Lake table and provides methods to serialize
those stats to JSON with basic physical-type validation. Note that connectors (e.g. Spark, Flink)
are responsible for ensuring the correctness of collected stats, including any necessary string
truncation, prior to constructing this class.
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final int
static final int
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionboolean
Get the maximum values of the columns in the data file.Get the minimum values of the columns in the data file.Get the number of nulls of columns in the data file.long
Get the number of records in the data file.int
hashCode()
serializeAsJson
(StructType physicalSchema) Serializes the statistics as a JSON string.toString()
-
Field Details
-
MICROSECONDS_PER_SECOND
public static final int MICROSECONDS_PER_SECOND- See Also:
-
NANOSECONDS_PER_MICROSECOND
public static final int NANOSECONDS_PER_MICROSECOND- See Also:
-
-
Constructor Details
-
DataFileStatistics
public DataFileStatistics(long numRecords, Map<Column, Literal> minValues, Map<Column, Literal> maxValues, Map<Column, Long> nullCount) Create a new instance ofDataFileStatistics
. The minValues, maxValues, and nullCount are all required fields. This class is primarily used to serialize stats to JSON with type checking when constructing file actions and NOT used during data skipping. As such the column names in minValues, maxValues and nullCount should be that of the physical data schema that's reflected in the parquet files and NOT logical schema.- Parameters:
numRecords
- Number of records in the data file.minValues
- Map of column to minimum value of it in the data file. If the data file has all nulls for the column, the value will be null or not present in the map.maxValues
- Map of column to maximum value of it in the data file. If the data file has all nulls for the column, the value will be null or not present in the map.nullCount
- Map of column to number of nulls in the data file.
-
-
Method Details
-
getNumRecords
public long getNumRecords()Get the number of records in the data file.- Returns:
- Number of records in the data file.
-
getMinValues
Get the minimum values of the columns in the data file. The map may contain statistics for only a subset of columns in the data file.- Returns:
- Map of column to minimum value of it in the data file.
-
getMaxValues
Get the maximum values of the columns in the data file. The map may contain statistics for only a subset of columns in the data file.- Returns:
- Map of column to minimum value of it in the data file.
-
getNullCount
Get the number of nulls of columns in the data file. The map may contain statistics for only a subset of columns in the data file.- Returns:
- Map of column to number of nulls in the data file.
-
serializeAsJson
Serializes the statistics as a JSON string.Example: For nested column structures:
Input: minValues = { new Column(new String[]{"a", "b", "c"}) mapped to Literal.ofInt(10), new Column("d") mapped to Literal.ofString("value") } Output JSON: { "minValues": { "a": { "b": { "c": 10 } }, "d": "value" } }
- Parameters:
physicalSchema
- the optional physical schema. If provided, all min/max values and null counts will be included and validated against their physical types. If null, only numRecords will be serialized without validation.- Returns:
- a JSON representation of the statistics.
- Throws:
KernelException
- if dataSchema is provided and there's a type mismatch between the Literal values and the expected types in the schema, or if an unsupported data type is found.
-
equals
-
hashCode
public int hashCode() -
toString
-