Stats

io.github.quafadas.scautable.Stats
object Stats

Attributes

Graph
Supertypes
class Object
trait Matchable
class Any
Self type
Stats.type

Members list

Type members

Types

type NonNumericStatsContext[T] = (uniqueValues: Set[String], counts: Map[String, Int])
type StatsContext[T] = (typ: String, sum: Double, count: Int, mean: Double, digest: TDigest)

Extensions

Extensions

extension [K <: Tuple, V <: Tuple](nt: Iterator[NamedTuple[K, V]])
inline def nonNumericSummary: List[(name: String, uniqueEntries: Int, mostFrequent: String, frequency: Int, sample: String)]

Computes comprehensive categorical summaries for all columns in the dataset.

Computes comprehensive categorical summaries for all columns in the dataset.

This method processes an Iterator of NamedTuples and calculates descriptive statistics for each column focusing on categorical/text data patterns. All column types are supported, with values converted to their string representation for analysis. The method is particularly useful for understanding the distribution and variety of non-numeric data.

The method handles missing values gracefully:

  • None values in Option types are excluded from calculations
  • Empty strings are ignored
  • Columns with all missing values will show appropriate default values

Attributes

Returns

A list of named tuples, one per column, containing: - name: Column name (String) - uniqueEntries: Count of unique non-missing values in the column - mostFrequent: The most frequently occurring value (as Option[String]), or None if all values are missing - frequency: Number of times the most frequent value appears - sample: A comma-separated sample of up to 5 unique values, randomly selected and truncated to 75 characters with "..." if needed

See also

nonNumericSummary for the Iterable version that doesn't consume the collection

Note

This method consumes the iterator. If you need to preserve the data, consider converting to a collection first using the Iterable version.

Example
val data = Iterator(
 (name = "Alice", city = "New York", department = "Engineering"),
 (name = "Bob", city = "Boston", department = "Sales"),
 (name = "Charlie", city = "New York", department = "Engineering")
)
val summary = data.nonNumericSummary
// Returns categorical statistics for all columns
inline def numericSummary: List[(name: String, typ: String, mean: Double, min: Double, 0.25: Double, median: Double, 0.75: Double, max: Double)]

Computes comprehensive statistical summaries for all numeric columns in the dataset.

Computes comprehensive statistical summaries for all numeric columns in the dataset.

This method processes an Iterator of NamedTuples and calculates descriptive statistics for each column that contains numeric data (Int, Long, Double, or Option-wrapped versions). Non-numeric columns are ignored. The computation uses T-Digest for efficient quantile estimation, making it suitable for large datasets.

The method handles missing values gracefully:

  • None values in Option types are excluded from calculations
  • Columns with all missing values will show appropriate default values
  • Type inference works even when the first few values are missing

Attributes

Returns

A list of named tuples, one per numeric column, containing: - name: Column name (String) - typ: Detected data type ("Int", "Long", "Double", or "Unknown") - mean: Arithmetic mean of non-missing values - min: Minimum value - 0.25: 25th percentile (first quartile) - median: 50th percentile (median) - 0.75: 75th percentile (third quartile) - max: Maximum value

See also

numericSummary for the Iterable version that doesn't consume the collection

Note

This method consumes the iterator. If you need to preserve the data, consider converting to a collection first using the Iterable version.

Example
val data = Iterator(
 (name = "Alice", age = 25, salary = 50000.0),
 (name = "Bob", age = 30, salary = 60000.0),
 (name = "Charlie", age = 35, salary = None)
)
val stats = data.numericSummary
// Returns statistics for 'age' and 'salary' columns only
// 'name' column is ignored as it's non-numeric
extension [K <: Tuple, V <: Tuple](nt: Iterable[NamedTuple[K, V]])
inline def describe: Unit
inline def nonNumericSummary: List[(name: String, uniqueEntries: Int, mostFrequent: String, frequency: Int, sample: String)]

Computes comprehensive categorical summaries for all columns in the dataset.

Computes comprehensive categorical summaries for all columns in the dataset.

This method processes an Iterable of NamedTuples and calculates descriptive statistics for each column focusing on categorical/text data patterns. All column types are supported, with values converted to their string representation for analysis. The method is particularly useful for understanding the distribution and variety of non-numeric data.

The method handles missing values gracefully:

  • None values in Option types are excluded from calculations
  • Empty strings are ignored
  • Columns with all missing values will show appropriate default values

Attributes

Returns

A list of named tuples, one per column, containing: - name: Column name (String) - uniqueEntries: Count of unique non-missing values in the column - mostFrequent: The most frequently occurring value (as Option[String]), or None if all values are missing - frequency: Number of times the most frequent value appears - sample: A comma-separated sample of up to 5 unique values, randomly selected and truncated to 75 characters with "..." if needed

See also

numericSummary for numeric column analysis

Note

Unlike the Iterator version, this method does not consume the collection, allowing for multiple analyses on the same dataset.

Example
val data = List(
 (name = "Alice", city = "New York", department = "Engineering"),
 (name = "Bob", city = "Boston", department = "Sales"),
 (name = "Charlie", city = "New York", department = "Engineering"),
 (name = "Diana", city = "Chicago", department = "Marketing")
)
val summary = data.nonNumericSummary
// Results might look like:
// name: uniqueEntries=4, mostFrequent=None, frequency=1, sample="Alice, Bob, Charlie, Diana"
// city: uniqueEntries=3, mostFrequent=Some("New York"), frequency=2, sample="New York, Boston, Chicago"
// department: uniqueEntries=3, mostFrequent=Some("Engineering"), frequency=2, sample="Engineering, Sales, Marketing"
val dataWithOptions = List(
 (id = 1, category = Some("A"), notes = None),
 (id = 2, category = Some("B"), notes = Some("Important")),
 (id = 3, category = Some("A"), notes = None),
 (id = 4, category = None, notes = Some("Review"))
)
val summary = dataWithOptions.nonNumericSummary
// The category column will show uniqueEntries=2 (A, B), excluding None values
// The notes column will show uniqueEntries=2 (Important, Review), excluding None values
inline def numericSummary: List[(name: String, typ: String, mean: Double, min: Double, 0.25: Double, median: Double, 0.75: Double, max: Double)]

Computes comprehensive statistical summaries for all numeric columns in the dataset.

Computes comprehensive statistical summaries for all numeric columns in the dataset.

This method processes an Iterable of NamedTuples and calculates descriptive statistics for each column that contains numeric data (Int, Long, Double, or Option-wrapped versions). Non-numeric columns are ignored. The computation uses T-Digest for efficient quantile estimation, making it suitable for large datasets.

The method handles missing values gracefully:

  • None values in Option types are excluded from calculations
  • Columns with all missing values will show appropriate default values
  • Type inference works even when the first few values are missing

Attributes

Returns

A list of named tuples, one per numeric column, containing: - name: Column name (String) - typ: Detected data type ("Int", "Long", "Double", or "Unknown") - mean: Arithmetic mean of non-missing values - min: Minimum value - 0.25: 25th percentile (first quartile) - median: 50th percentile (median) - 0.75: 75th percentile (third quartile) - max: Maximum value

See also

numericSummary for the Iterator version

Note

Unlike the Iterator version, this method does not consume the collection, allowing for multiple statistical computations on the same dataset.

Example
val data = List(
 (name = "Alice", age = 25, salary = 50000.0),
 (name = "Bob", age = 30, salary = 60000.0),
 (name = "Charlie", age = 35, salary = None)
)
val stats = data.numericSummary
// Returns statistics for 'age' and 'salary' columns only
// 'name' column is ignored as it's non-numeric
// Can be called multiple times since data is not consumed
val moreStats = data.numericSummary
inline def summary: (numeric: List[(name: String, typ: String, mean: Double, min: Double, 0.25: Double, median: Double, 0.75: Double, max: Double)], nonNumeric: List[(name: String, uniqueEntries: Int, mostFrequent: String, frequency: Int, sample: String)])

Computes comprehensive statistical summaries for both numeric and non-numeric columns.

Computes comprehensive statistical summaries for both numeric and non-numeric columns.

This method processes an Iterable of NamedTuples and calculates both numeric and categorical statistics for all columns in the dataset. It automatically separates columns into numeric (Int, Long, Double, or Option-wrapped versions) and non-numeric types, providing comprehensive analysis in a single call.

Attributes

Returns

A named tuple containing: - numeric: List of numeric column statistics (same as numericSummary) - nonNumeric: List of categorical column statistics (same as nonNumericSummary)

See also

numericSummary for numeric-only analysis

nonNumericSummary for categorical-only analysis

summary for the Iterator version

Note

Unlike the Iterator version, this method does not consume the collection, allowing for multiple analyses on the same dataset.

Example
val data = List(
 (name = "Alice", age = 25, city = "New York", salary = 50000.0),
 (name = "Bob", age = 30, city = "Boston", salary = 60000.0),
 (name = "Charlie", age = 35, city = "New York", salary = None)
)
val summary = data.summary
// summary.numeric contains statistics for 'age' and 'salary' columns
// summary.nonNumeric contains statistics for 'name' and 'city' columns
// Can be called multiple times since data is not consumed
val moreSummary = data.summary