Stats
Attributes
- Graph
-
- Supertypes
-
class Objecttrait Matchableclass Any
- Self type
-
Stats.type
Members list
Type members
Types
Extensions
Extensions
Computes comprehensive categorical summaries for all columns in the dataset.
Computes comprehensive categorical summaries for all columns in the dataset.
This method processes an Iterator
of NamedTuple
s and calculates descriptive statistics for each column focusing on categorical/text data patterns. All column types are supported, with values converted to their string representation for analysis. The method is particularly useful for understanding the distribution and variety of non-numeric data.
The method handles missing values gracefully:
None
values inOption
types are excluded from calculations- Empty strings are ignored
- Columns with all missing values will show appropriate default values
Attributes
- Returns
-
A list of named tuples, one per column, containing: -
name
: Column name (String) -uniqueEntries
: Count of unique non-missing values in the column -mostFrequent
: The most frequently occurring value (as Option[String]), or None if all values are missing -frequency
: Number of times the most frequent value appears -sample
: A comma-separated sample of up to 5 unique values, randomly selected and truncated to 75 characters with "..." if needed - See also
-
nonNumericSummary for the
Iterable
version that doesn't consume the collection - Note
-
This method consumes the iterator. If you need to preserve the data, consider converting to a collection first using the
Iterable
version. - Example
-
val data = Iterator( (name = "Alice", city = "New York", department = "Engineering"), (name = "Bob", city = "Boston", department = "Sales"), (name = "Charlie", city = "New York", department = "Engineering") ) val summary = data.nonNumericSummary // Returns categorical statistics for all columns
Computes comprehensive statistical summaries for all numeric columns in the dataset.
Computes comprehensive statistical summaries for all numeric columns in the dataset.
This method processes an Iterator
of NamedTuple
s and calculates descriptive statistics for each column that contains numeric data (Int, Long, Double, or Option-wrapped versions). Non-numeric columns are ignored. The computation uses T-Digest for efficient quantile estimation, making it suitable for large datasets.
The method handles missing values gracefully:
None
values inOption
types are excluded from calculations- Columns with all missing values will show appropriate default values
- Type inference works even when the first few values are missing
Attributes
- Returns
-
A list of named tuples, one per numeric column, containing: -
name
: Column name (String) -typ
: Detected data type ("Int", "Long", "Double", or "Unknown") -mean
: Arithmetic mean of non-missing values -min
: Minimum value -0.25
: 25th percentile (first quartile) -median
: 50th percentile (median) -0.75
: 75th percentile (third quartile) -max
: Maximum value - See also
-
numericSummary for the
Iterable
version that doesn't consume the collection - Note
-
This method consumes the iterator. If you need to preserve the data, consider converting to a collection first using the
Iterable
version. - Example
-
val data = Iterator( (name = "Alice", age = 25, salary = 50000.0), (name = "Bob", age = 30, salary = 60000.0), (name = "Charlie", age = 35, salary = None) ) val stats = data.numericSummary // Returns statistics for 'age' and 'salary' columns only // 'name' column is ignored as it's non-numeric
Computes comprehensive categorical summaries for all columns in the dataset.
Computes comprehensive categorical summaries for all columns in the dataset.
This method processes an Iterable
of NamedTuple
s and calculates descriptive statistics for each column focusing on categorical/text data patterns. All column types are supported, with values converted to their string representation for analysis. The method is particularly useful for understanding the distribution and variety of non-numeric data.
The method handles missing values gracefully:
None
values inOption
types are excluded from calculations- Empty strings are ignored
- Columns with all missing values will show appropriate default values
Attributes
- Returns
-
A list of named tuples, one per column, containing: -
name
: Column name (String) -uniqueEntries
: Count of unique non-missing values in the column -mostFrequent
: The most frequently occurring value (as Option[String]), or None if all values are missing -frequency
: Number of times the most frequent value appears -sample
: A comma-separated sample of up to 5 unique values, randomly selected and truncated to 75 characters with "..." if needed - See also
-
numericSummary for numeric column analysis
- Note
-
Unlike the
Iterator
version, this method does not consume the collection, allowing for multiple analyses on the same dataset. - Example
-
val data = List( (name = "Alice", city = "New York", department = "Engineering"), (name = "Bob", city = "Boston", department = "Sales"), (name = "Charlie", city = "New York", department = "Engineering"), (name = "Diana", city = "Chicago", department = "Marketing") ) val summary = data.nonNumericSummary // Results might look like: // name: uniqueEntries=4, mostFrequent=None, frequency=1, sample="Alice, Bob, Charlie, Diana" // city: uniqueEntries=3, mostFrequent=Some("New York"), frequency=2, sample="New York, Boston, Chicago" // department: uniqueEntries=3, mostFrequent=Some("Engineering"), frequency=2, sample="Engineering, Sales, Marketing"
val dataWithOptions = List( (id = 1, category = Some("A"), notes = None), (id = 2, category = Some("B"), notes = Some("Important")), (id = 3, category = Some("A"), notes = None), (id = 4, category = None, notes = Some("Review")) ) val summary = dataWithOptions.nonNumericSummary // The category column will show uniqueEntries=2 (A, B), excluding None values // The notes column will show uniqueEntries=2 (Important, Review), excluding None values
Computes comprehensive statistical summaries for all numeric columns in the dataset.
Computes comprehensive statistical summaries for all numeric columns in the dataset.
This method processes an Iterable
of NamedTuple
s and calculates descriptive statistics for each column that contains numeric data (Int, Long, Double, or Option-wrapped versions). Non-numeric columns are ignored. The computation uses T-Digest for efficient quantile estimation, making it suitable for large datasets.
The method handles missing values gracefully:
None
values inOption
types are excluded from calculations- Columns with all missing values will show appropriate default values
- Type inference works even when the first few values are missing
Attributes
- Returns
-
A list of named tuples, one per numeric column, containing: -
name
: Column name (String) -typ
: Detected data type ("Int", "Long", "Double", or "Unknown") -mean
: Arithmetic mean of non-missing values -min
: Minimum value -0.25
: 25th percentile (first quartile) -median
: 50th percentile (median) -0.75
: 75th percentile (third quartile) -max
: Maximum value - See also
-
numericSummary for the
Iterator
version - Note
-
Unlike the
Iterator
version, this method does not consume the collection, allowing for multiple statistical computations on the same dataset. - Example
-
val data = List( (name = "Alice", age = 25, salary = 50000.0), (name = "Bob", age = 30, salary = 60000.0), (name = "Charlie", age = 35, salary = None) ) val stats = data.numericSummary // Returns statistics for 'age' and 'salary' columns only // 'name' column is ignored as it's non-numeric // Can be called multiple times since data is not consumed val moreStats = data.numericSummary
Computes comprehensive statistical summaries for both numeric and non-numeric columns.
Computes comprehensive statistical summaries for both numeric and non-numeric columns.
This method processes an Iterable
of NamedTuple
s and calculates both numeric and categorical statistics for all columns in the dataset. It automatically separates columns into numeric (Int, Long, Double, or Option-wrapped versions) and non-numeric types, providing comprehensive analysis in a single call.
Attributes
- Returns
-
A named tuple containing: -
numeric
: List of numeric column statistics (same asnumericSummary
) -nonNumeric
: List of categorical column statistics (same asnonNumericSummary
) - See also
-
numericSummary for numeric-only analysis
nonNumericSummary for categorical-only analysis
summary for the
Iterator
version - Note
-
Unlike the
Iterator
version, this method does not consume the collection, allowing for multiple analyses on the same dataset. - Example
-
val data = List( (name = "Alice", age = 25, city = "New York", salary = 50000.0), (name = "Bob", age = 30, city = "Boston", salary = 60000.0), (name = "Charlie", age = 35, city = "New York", salary = None) ) val summary = data.summary // summary.numeric contains statistics for 'age' and 'salary' columns // summary.nonNumeric contains statistics for 'name' and 'city' columns // Can be called multiple times since data is not consumed val moreSummary = data.summary