Package

com.ebiznext.comet.schema

model

Permalink

package model

Linear Supertypes
AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. model
  2. AnyRef
  3. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Type Members

  1. case class AssertionCall(comment: String, name: String, paramValues: List[String], sql: String) extends Product with Serializable

    Permalink
  2. case class AssertionCalls(assertions: Map[String, String]) extends Product with Serializable

    Permalink
  3. case class AssertionDefinition(fullName: String, name: String, params: List[String], sql: String) extends Product with Serializable

    Permalink
  4. case class AssertionDefinitions(assertions: Map[String, String]) extends Product with Serializable

    Permalink
  5. case class Attribute(name: String, type: String = "string", array: Option[Boolean] = None, required: Boolean = true, privacy: PrivacyLevel = PrivacyLevel.None, comment: Option[String] = None, rename: Option[String] = None, metricType: Option[MetricType] = None, attributes: Option[List[Attribute]] = None, position: Option[Position] = None, default: Option[String] = None, tags: Option[Set[String]] = None, trim: Option[Trim] = None, script: Option[String] = None) extends LazyLogging with Product with Serializable

    Permalink

    A field in the schema.

    A field in the schema. For struct fields, the field "attributes" contains all sub attributes

    name

    : Attribute name as defined in the source dataset and as received in the file

    array

    : Is it an array ?

    required

    : Should this attribute always be present in the source

    privacy

    : Should this attribute be applied a privacy transformation at ingestion time

    comment

    : free text for attribute description

    rename

    : If present, the attribute is renamed with this name

    metricType

    : If present, what kind of stat should be computed for this field

    attributes

    : List of sub-attributes (valid for JSON and XML files only)

    position

    : Valid only when file format is POSITION

    default

    : Default value for this attribute when it is not present.

    tags

    : Tags associated with this attribute

    trim

    : Should we trim the attribute value ?

    script

    : Scripted field : SQL request on renamed column

  6. case class AutoJobDesc(name: String, tasks: List[AutoTaskDesc], area: Option[StorageArea] = None, format: Option[String], coalesce: Option[Boolean], udf: Option[String] = None, views: Option[Map[String, String]] = None, engine: Option[Engine] = None) extends Product with Serializable

    Permalink

    A job is a set of transform tasks executed using the specified engine.

    A job is a set of transform tasks executed using the specified engine.

    tasks

    List of transform tasks to execute

    area

    Area where the data is located. When using the BigQuery engine, teh area corresponds to the dataset name we will be working on in this job. When using the Spark engine, this is folder where the data should be store. Default value is "business"

    format

    output file format when using Spark engine. Ingored for BigQuery. Default value is "parquet"

    coalesce

    When outputting files, should we coalesce it to a single file. Useful when CSV is the output format.

    udf

    : Register UDFs written in this JVM class when using Spark engine Register UDFs stored at this location when using BigQuery engine

    views

    : Create temporary views using where the key is the view name and the map the SQL request corresponding to this view using the SQL engine supported syntax.

    engine

    : SPARK or BQ. Default value is SPARK.

  7. case class AutoTaskDesc(sql: String, domain: String, dataset: String, write: WriteMode, partition: Option[List[String]] = None, presql: Option[List[String]] = None, postsql: Option[List[String]] = None, area: Option[StorageArea] = None, sink: Option[Sink] = None, rls: Option[List[RowLevelSecurity]] = None, assertions: Option[Map[String, String]] = None) extends Product with Serializable

    Permalink

    Task executed in the context of a job.

    Task executed in the context of a job. Each task is executed in its own session.

    sql

    Main SQL request to exexute (do not forget to prefix table names with the database name to avoid conflicts)

    domain

    Output domain in output Area (Will be the Database name in Hive or Dataset in BigQuery)

    dataset

    Dataset Name in output Area (Will be the Table name in Hive & BigQuery)

    write

    Append to or overwrite existing data

    partition

    List of columns used for partitioning the outtput.

    presql

    List of SQL requests to executed before the main SQL request is run

    postsql

    List of SQL requests to executed after the main SQL request is run

    area

    Target Area where domain / dataset will be stored.

    sink

    Where to sink the data

    rls

    Row level security policy to apply too the output data.

  8. final case class BigQuerySink(name: Option[String] = None, location: Option[String] = None, timestamp: Option[String] = None, clustering: Option[Seq[String]] = None, days: Option[Int] = None, requirePartitionFilter: Option[Boolean] = None, options: Option[Map[String, String]] = None) extends Sink with Product with Serializable

    Permalink

    When the sink *type* field is set to BQ, the options below should be provided.

    When the sink *type* field is set to BQ, the options below should be provided.

    location

    : Database location (EU, US, ...)

    Annotations
    @JsonTypeName()
  9. case class CometArrayType(fields: CometStructType) extends CometDataType with Product with Serializable

    Permalink
  10. trait CometDataType extends AnyRef

    Permalink
  11. case class CometSimpleType(simpleType: DataType, attribute: Attribute, tpe: Type) extends CometDataType with Product with Serializable

    Permalink
  12. case class CometStructField(sparkField: StructField, attribute: Attribute, tpe: Type) extends CometDataType with Product with Serializable

    Permalink
  13. case class CometStructType(fields: Array[CometStructField]) extends CometDataType with Product with Serializable

    Permalink
  14. case class Domain(name: String, directory: String, metadata: Option[Metadata] = None, schemas: List[Schema] = Nil, comment: Option[String] = None, extensions: Option[List[String]] = None, ack: Option[String] = None) extends Product with Serializable

    Permalink

    Let's say you are willing to import customers and orders from your Sales system.

    Let's say you are willing to import customers and orders from your Sales system. Sales is therefore the domain and customer & order are your datasets. In a DBMS, A Domain would be implemented by a DBMS schema and a dataset by a DBMS table. In BigQuery, The domain name would be the Big Query dataset name and the dataset would be implemented by a Big Query table.

    name

    Domain name. Make sure you use a name that may be used as a folder name on the target storage.

    • When using HDFS or Cloud Storage, files once ingested are stored in a sub-directory named after the domain name.
    • When used with BigQuery, files are ingested and sorted in tables under a dataset named after the domain name.
    directory

    : Folder on the local filesystem where incoming files are stored. Typically, this folder will be scanned periodically to move the dataset to the cluster for ingestion. Files located in this folder are moved to the pending folder for ingestion by the "import" command.

    metadata

    : Default Schema metadata. This metadata is applied to the schemas defined in this domain. Metadata properties may be redefined at the schema level. See Metadata Entity for more details.

    schemas

    : List of schemas for each dataset in this domain A domain ususally contains multiple schemas. Each schema defining how the contents of the input file should be parsed. See Schema for more details.

    comment

    : Domain Description (free text)

    extensions

    : recognized filename extensions. json, csv, dsv, psv are recognized by default Only files with these extensions will be moved to the pending folder.

    ack

    : Ack extension used for each file. ".ack" if not specified. Files are moved to the pending folder only once a file with the same name as the source file and with this extension is present. To move a file without requiring an ack file to be present, set explicitly this property to the empty string value "".

  15. sealed case class Engine(value: String) extends Product with Serializable

    Permalink

    Big versus Fast data ingestion.

    Big versus Fast data ingestion. Are we ingesting a file or a message stream ?

    value

    : FILE or STREAM

    Annotations
    @JsonSerialize() @JsonDeserialize()
  16. class EngineDeserializer extends JsonDeserializer[Engine]

    Permalink
  17. final class EngineSerializer extends JsonSerializer[Engine]

    Permalink
  18. case class Env(env: Map[String, String]) extends Product with Serializable

    Permalink
  19. final case class EsSink(name: Option[String] = None, id: Option[String] = None, timestamp: Option[String] = None, options: Option[Map[String, String]] = None) extends Sink with Product with Serializable

    Permalink

    When the sink *type* field is set to ES, the options below should be provided.

    When the sink *type* field is set to ES, the options below should be provided. Elasticsearch options are specified in the application.conf file.

    Annotations
    @JsonTypeName()
  20. sealed case class Format(value: String) extends Product with Serializable

    Permalink

    Recognized file type format.

    Recognized file type format. This will select the correct parser

    value

    : SIMPLE_JSON, JSON of DSV Simple Json is made of a single level attributes of simple types (no arrray or map or sub objects)

    Annotations
    @JsonSerialize() @JsonDeserialize()
  21. class FormatDeserializer extends JsonDeserializer[Format]

    Permalink
  22. sealed case class IndexMapping(value: String) extends Product with Serializable

    Permalink

    This attribute property let us know what statistics should be computed for this field when analyze is active.

    This attribute property let us know what statistics should be computed for this field when analyze is active.

    value

    : DISCRETE or CONTINUOUS or TEXT or NONE

    Annotations
    @JsonSerialize() @JsonDeserialize()
  23. class IndexMappingDeserializer extends JsonDeserializer[IndexMapping]

    Permalink
  24. case class JDBCSource(config: String, schemas: List[Map[String, List[String]]]) extends Product with Serializable

    Permalink
  25. final case class JdbcSink(name: Option[String] = None, connection: String, partitions: Option[Int] = None, batchsize: Option[Int] = None, options: Option[Map[String, String]] = None) extends Sink with Product with Serializable

    Permalink

    When the sink *type* field is set to JDBC, the options below should be provided.

    When the sink *type* field is set to JDBC, the options below should be provided.

    Annotations
    @JsonTypeName()
  26. case class MergeOptions(key: List[String], delete: Option[String] = None, timestamp: Option[String] = None, queryFilter: Option[String] = None) extends Product with Serializable

    Permalink

    How dataset are merged

    How dataset are merged

    key

    list of attributes to join existing with incoming dataset. Use renamed columns here.

    delete

    Optional valid sql condition on the incoming dataset. Use renamed column here.

    timestamp

    Timestamp column used to identify last version, if not specified currently ingested row is considered the last

  27. case class Metadata(mode: Option[Mode] = None, format: Option[Format] = None, encoding: Option[String] = None, multiline: Option[Boolean] = None, array: Option[Boolean] = None, withHeader: Option[Boolean] = None, separator: Option[String] = None, quote: Option[String] = None, escape: Option[String] = None, write: Option[WriteMode] = None, partition: Option[Partition] = None, sink: Option[Sink] = None, ignore: Option[String] = None, clustering: Option[Seq[String]] = None, xml: Option[Map[String, String]] = None) extends Product with Serializable

    Permalink

    Specify Schema properties.

    Specify Schema properties. These properties may be specified at the schema or domain level Any property not specified at the schema level is taken from the one specified at the domain level or else the default value is returned.

    mode

    : FILE mode by default. FILE and STREAM are the two accepted values. FILE is currently the only supported mode.

    format

    : DSV by default. Supported file formats are :

    • DSV : Delimiter-separated values file. Delimiter value iss specified in the "separator" field.
    • POSITION : FIXED format file where values are located at an exact position in each line.
    • SIMPLE_JSON : For optimisation purpose, we differentiate JSON with top level values from JSON with deep level fields. SIMPLE_JSON are JSON files with top level fields only.
    • JSON : Deep JSON file. Use only when your json documents contain subdocuments, otherwise prefer to use SIMPLE_JSON since it is much faster.
    • XML : XML files
    encoding

    : UTF-8 if not specified.

    multiline

    : are json objects on a single line or multiple line ? Single by default. false means single. false also means faster

    array

    : Is the json stored as a single object array ? false by default. This means that by default we have on json document per line.

    withHeader

    : does the dataset has a header ? true bu default

    separator

    : the values delimiter, ';' by default value may be a multichar string starting from Spark3

    quote

    : The String quote char, '"' by default

    escape

    : escaping char '\' by default

    write

    : Write mode, APPEND by default

    partition

    : Partition columns, no partitioning by default

    sink

    : should the dataset be indexed in elasticsearch after ingestion ?

    ignore

    : Pattern to ignore or UDF to apply to ignore some lines

    clustering

    : List of attributes to use for clustering

    xml

    : com.databricks.spark.xml options to use (eq. rowTag)

  28. sealed case class MetricType(value: String) extends Product with Serializable

    Permalink

    This attribute property let us know what statistics should be computed for this field when analyze is active.

    This attribute property let us know what statistics should be computed for this field when analyze is active.

    value

    : DISCRETE or CONTINUOUS or TEXT or NONE

    Annotations
    @JsonSerialize() @JsonDeserialize()
  29. class MetricTypeDeserializer extends JsonDeserializer[MetricType]

    Permalink
  30. sealed case class Mode(value: String) extends Product with Serializable

    Permalink

    Big versus Fast data ingestion.

    Big versus Fast data ingestion. Are we ingesting a file or a message stream ?

    value

    : FILE or STREAM

    Annotations
    @JsonSerialize() @JsonDeserialize()
  31. class ModeDeserializer extends JsonDeserializer[Mode]

    Permalink
  32. final class ModeSerializer extends JsonSerializer[Mode]

    Permalink
  33. final case class NoneSink(name: Option[String] = None, options: Option[Map[String, String]] = None) extends Sink with Product with Serializable

    Permalink
    Annotations
    @JsonTypeName()
  34. case class Partition(sampling: Option[Double], attributes: Option[List[String]]) extends Product with Serializable

    Permalink

    sampling

    : 0.0 means no sampling, > 0 && < 1 means sample dataset, >=1 absolute number of partitions.

    attributes

    : Attributes used to partition de dataset.

    Annotations
    @JsonDeserialize()
  35. class PartitionDeserializer extends JsonDeserializer[Partition]

    Permalink
  36. case class Position(first: Int, last: Int) extends Product with Serializable

    Permalink

    How (the attribute should be transformed at ingestion time ?

    How (the attribute should be transformed at ingestion time ?

    first

    : First char position

    last

    : last char position

  37. sealed abstract case class PrimitiveType extends Product with Serializable

    Permalink

    Spark supported primitive types.

    Spark supported primitive types. These are the only valid raw types. Dataframes columns are converted to these types before the dataset is ingested

    Annotations
    @JsonSerialize() @JsonDeserialize()
  38. class PrimitiveTypeDeserializer extends JsonDeserializer[PrimitiveType]

    Permalink
  39. sealed case class PrivacyLevel(value: String) extends Product with Serializable

    Permalink

    How (the attribute should be transformed at ingestion time ?

    How (the attribute should be transformed at ingestion time ?

    value

    algorithm to use : NONE, HIDE, MD5, SHA1, SHA256, SHA512, AES

    Annotations
    @JsonSerialize() @JsonDeserialize()
  40. class PrivacyLevelDeserializer extends JsonDeserializer[PrivacyLevel]

    Permalink
  41. case class RowLevelSecurity(name: String, predicate: String = "TRUE", grants: Set[String]) extends Product with Serializable

    Permalink

    User / Group and Service accounts rights on a subset of the table.

    User / Group and Service accounts rights on a subset of the table.

    name

    : This Row Level Security unique name

    predicate

    : The condition that goes to the WHERE clause and limitt the visible rows.

    grants

    : user / groups / service accounts to which this security level is applied. ex : user:[email protected],group:[email protected],serviceAccount:[email protected]

  42. case class Schema(name: String, pattern: Pattern, attributes: List[Attribute], metadata: Option[Metadata], merge: Option[MergeOptions], comment: Option[String], presql: Option[List[String]], postsql: Option[List[String]] = None, tags: Option[Set[String]] = None, rls: Option[List[RowLevelSecurity]] = None, assertions: Option[Map[String, String]] = None) extends Product with Serializable

    Permalink

    Dataset Schema

    Dataset Schema

    name

    : Schema name, must be unique among all the schemas belonging to the same domain. Will become the hive table name On Premise or BigQuery Table name on GCP.

    pattern

    : filename pattern to which this schema must be applied. This instructs the framework to use this schema to parse any file with a filename that match this pattern.

    attributes

    : Attributes parsing rules. See :ref:attribute_concept

    metadata

    : Dataset metadata See :ref:metadata_concept

    comment

    : free text

    presql

    : Reserved for future use.

    postsql

    : We use this attribute to execute sql queries before writing the final dataFrame after ingestion

    tags

    : Set of string to attach to this Schema

    rls

    : Experimental. Row level security to this to this schema. See :ref:rowlevelsecurity_concept

  43. sealed abstract class Sink extends AnyRef

    Permalink

    Once ingested, files may be sinked to BigQuery, Elasticsearch or any JDBC compliant Database.

    Once ingested, files may be sinked to BigQuery, Elasticsearch or any JDBC compliant Database.

    Annotations
    @JsonTypeInfo() @JsonSubTypes()
  44. sealed case class SinkType(value: String) extends Product with Serializable

    Permalink

    Recognized file type format.

    Recognized file type format. This will select the correct parser

    value

    : NONE, FS, JDBC, BQ, ES One of the possible supported sinks

    Annotations
    @JsonSerialize() @JsonDeserialize()
  45. class SinkTypeDeserializer extends JsonDeserializer[SinkType]

    Permalink
  46. sealed case class Stage(value: String) extends Product with Serializable

    Permalink

    Big versus Fast data ingestion.

    Big versus Fast data ingestion. Are we ingesting a file or a message stream ?

    value

    : FILE or STREAM

    Annotations
    @JsonSerialize() @JsonDeserialize()
  47. class StageDeserializer extends JsonDeserializer[Stage]

    Permalink
  48. sealed case class Trim(value: String) extends Product with Serializable

    Permalink

    Big versus Fast data ingestion.

    Big versus Fast data ingestion. Are we ingesting a file or a message stream ?

    value

    : FILE or STREAM

    Annotations
    @JsonSerialize() @JsonDeserialize()
  49. class TrimDeserializer extends JsonDeserializer[Trim]

    Permalink
  50. case class Type(name: String, pattern: String, primitiveType: PrimitiveType = PrimitiveType.string, zone: Option[String] = None, sample: Option[String] = None, comment: Option[String] = None, indexMapping: Option[IndexMapping] = None) extends Product with Serializable

    Permalink

    Semantic Type

    Semantic Type

    name

    : Type name

    pattern

    : Pattern use to check that the input data matches the pattern

    primitiveType

    : Spark Column Type of the attribute

  51. case class Types(types: List[Type]) extends Product with Serializable

    Permalink

    List of globally defined types

    List of globally defined types

    types

    : Type list

  52. sealed case class UserType(value: String) extends Product with Serializable

    Permalink

    Recognized file type format.

    Recognized file type format. This will select the correct parser

    value

    : SIMPLE_JSON, JSON of DSV Simple Json is made of a single level attributes of simple types (no arrray or map or sub objects)

    Annotations
    @JsonSerialize() @JsonDeserialize()
  53. class UserTypeDeserializer extends JsonDeserializer[UserType]

    Permalink
  54. case class Views(views: Map[String, String] = Map.empty) extends Product with Serializable

    Permalink

    Thisi tag appears in files and allow import of views and assertions definitions into the current files.

  55. class WriteDeserializer extends JsonDeserializer[WriteMode]

    Permalink
  56. sealed case class WriteMode(value: String) extends Product with Serializable

    Permalink

    During ingestion, should the data be appended to the previous ones or should it replace the existing ones ? see Spark SaveMode for more options.

    During ingestion, should the data be appended to the previous ones or should it replace the existing ones ? see Spark SaveMode for more options.

    value

    : OVERWRITE / APPEND / ERROR_IF_EXISTS / IGNORE.

    Annotations
    @JsonSerialize() @JsonDeserialize()

Value Members

  1. object AssertionCall extends Serializable

    Permalink
  2. object AssertionDefinition extends Serializable

    Permalink
  3. object Engine extends Serializable

    Permalink
  4. object Format extends Serializable

    Permalink
  5. object IndexMapping extends Serializable

    Permalink
  6. object Metadata extends Serializable

    Permalink
  7. object MetricType extends Serializable

    Permalink
  8. object Mode extends Serializable

    Permalink
  9. object PrimitiveType extends Serializable

    Permalink
  10. object PrivacyLevel extends Serializable

    Permalink
  11. object Rejection

    Permalink

    Contains classes used to describe rejected records.

    Contains classes used to describe rejected records. Recjected records are stored in parquet file in teh rejected area. A reject row contains

    • the list of columns and for each column wether it has been accepted or not. A row is rejected if at least one of its column is rejected
  12. object RowLevelSecurity extends Serializable

    Permalink
  13. object Schema extends Serializable

    Permalink
  14. object Sink

    Permalink
  15. object SinkType extends Serializable

    Permalink
  16. object Stage extends Serializable

    Permalink
  17. object Trim extends Serializable

    Permalink
  18. object UserType extends Serializable

    Permalink
  19. object Views extends Serializable

    Permalink
  20. object WriteMode extends Serializable

    Permalink
  21. package atlas

    Permalink
  22. def combine(errors1: Either[List[String], Boolean], errors2: Either[List[String], Boolean]*): Either[List[String], Boolean]

    Permalink
  23. def duplicates(values: List[String], errorMessage: String): Either[List[String], Boolean]

    Permalink

    Utility to extract duplicates and their number of occurrences

    Utility to extract duplicates and their number of occurrences

    values

    : Liste of strings

    errorMessage

    : Error Message that should contains placeholders for the value(%s) and number of occurrences (%d)

    returns

    List of tuples contains for ea ch duplicate the number of occurrences

Inherited from AnyRef

Inherited from Any

Ungrouped