A field in the schema.
A field in the schema. For struct fields, the field "attributes" contains all sub attributes
: Attribute name as defined in the source dataset
: Is it an array ?
: Should this attribute always be present in the source
: Should this attribute be applied a privacy transformaiton at ingestion time
: free text for attribute description
: If present, the attribute is renamed with this name
: If present, what kind of stat should be computed for this field
: List of sub-attributes
: Valid only where file format is POSITION
Job logical name
List of business tasks to execute
Task executed in teh context of a job
Task executed in teh context of a job
SQL request to exexute (do not forget to prefix table names with the database name
Output domain in Business Area (Will be the Database name in Hive or Dataset in BigQuery)
Dataset Name in Business Area (Will be the Table name in Hive & BigQuery)
Append to or overwrite existing dataset
Target Area where domain / dataset will be stored
Let's say you are wiling to import from you Sales system customers and orders.
Let's say you are wiling to import from you Sales system customers and orders. Sales is therefore the domain and cusomer & order are syour datasets.
: Domain name
: Folder on the local filesystem where incomping files are stored. This folder will be scanned regurlaly to move the dataset to the cluster
: Default Schema meta data.
: List of schema for each dataset in this domain
: Free text
: recognized filename extensions (json, csv, dsv, psv) are recognized by default
: Ack extension used for each file
Recognized file type format.
Recognized file type format. This will select the correct parser
: SIMPLE_JSON, JSON of DSV Simple Json is made of a single level attributes of simple types (no arrray or map or sub objects)
This attribute property let us know what statistics should be computed for this field when analyze is active.
This attribute property let us know what statistics should be computed for this field when analyze is active.
: DISCRETE or CONTINUOUS or TEXT or NONE
Recognized file type format.
Recognized file type format. This will select the correct parser
: SIMPLE_JSON, JSON of DSV Simple Json is made of a single level attributes of simple types (no arrray or map or sub objects)
How dataset are merge
How dataset are merge
list of attributes to join existing with incoming dataset. Use renamed columns here.
Optional valid sql condition on the incoming dataset. Use renamed column here.
Timestamp column used to identify last version, if not specified currently ingested row is considered the last
Specify Schema properties.
Specify Schema properties. These properties may be specified at the schema or domain level Any property non specified at the schema level is taken from the one specified at the domain level or else the default value is returned.
: FILE mode by default
: DSV by default
: UTF-8 by default
: are json objects on a single line or multiple line ? Single by default. false means single. false also means faster
: Is a json stored as a single object array ? false by default
: does the dataset has a header ? true bu default
: the column separator, ';' by default
: The String quote char, '"' by default
: escaping char '\' by default
: Write mode, APPEND by default
: Partition columns, no partitioning by default
: should the dataset be indexed in elasticsearch after ingestion ?
This attribute property let us know what statistics should be computed for this field when analyze is active.
This attribute property let us know what statistics should be computed for this field when analyze is active.
: DISCRETE or CONTINUOUS or TEXT or NONE
Big versus Fast data ingestion.
Big versus Fast data ingestion. Are we ingesting a file or a message stream ?
: FILE or STREAM
: 0.0 means no sampling, > 0 && < 1 means sample dataset, >=1 absolute number of partitions.
: Attributes used to partition de dataset.
How (the attribute should be transformed at ingestion time ?
How (the attribute should be transformed at ingestion time ?
: First char position
: last char position
Spark supported primitive types.
Spark supported primitive types. These are the only valid raw types. Dataframes columns are converted to these types before the dataset is ingested
How (the attribute should be transformed at ingestion time ?
How (the attribute should be transformed at ingestion time ?
algorithm to use : NONE, HIDE, MD5, SHA1, SHA256, SHA512, AES
Dataset Schema
Dataset Schema
: Schema name, must be unique in the domain. Will become the hive table name
: filename pattern to which this schema must be applied
: datasets columns
: Dataset metadata
: free text
: SQL code executed before the file is ingested
: SQL code executed right after the file has been ingested
Big versus Fast data ingestion.
Big versus Fast data ingestion. Are we ingesting a file or a message stream ?
: FILE or STREAM
Big versus Fast data ingestion.
Big versus Fast data ingestion. Are we ingesting a file or a message stream ?
: FILE or STREAM
Semantic Type
Semantic Type
: Type name
: Pattern use to check that the input data matches the pattern
: Spark Column Type of the attribute
List of globally defined types
List of globally defined types
: Type list
During ingestion, should the data be appended to the previous ones or should it replace the existing ones ? see Spark SaveMode for more options.
During ingestion, should the data be appended to the previous ones or should it replace the existing ones ? see Spark SaveMode for more options.
: OVERWRITE / APPEND / ERROR_IF_EXISTS / IGNORE.
Contains classes used to describe rejected records.
Contains classes used to describe rejected records. Recjected records are stored in parquet file in teh rejected area. A reject row contains
Utility to extract duplicates and their number of occurrences
Utility to extract duplicates and their number of occurrences
: Liste of strings
: Error Message that should contains placeholders for the value(%s) and number of occurrences (%d)
List of tuples contains for ea ch duplicate the number of occurrences