A field in the schema.
A field in the schema. For struct fields, the field "attributes" contains all sub attributes
: Attribute name as defined in the source dataset and as received in the file
: Is it an array ?
: Should this attribute always be present in the source
: Should this attribute be applied a privacy transformation at ingestion time
: free text for attribute description
: If present, the attribute is renamed with this name
: If present, what kind of stat should be computed for this field
: List of sub-attributes (valid for JSON and XML files only)
: Valid only when file format is POSITION
: Default value for this attribute when it is not present.
: Tags associated with this attribute
: Should we trim the attribute value ?
: Scripted field : SQL request on renamed column
A job is a set of transform tasks executed using the specified engine.
A job is a set of transform tasks executed using the specified engine.
List of transform tasks to execute
Area where the data is located. When using the BigQuery engine, teh area corresponds to the dataset name we will be working on in this job. When using the Spark engine, this is folder where the data should be store. Default value is "business"
output file format when using Spark engine. Ingored for BigQuery. Default value is "parquet"
When outputting files, should we coalesce it to a single file. Useful when CSV is the output format.
: Register UDFs written in this JVM class when using Spark engine Register UDFs stored at this location when using BigQuery engine
: Create temporary views using where the key is the view name and the map the SQL request corresponding to this view using the SQL engine supported syntax.
: SPARK or BQ. Default value is SPARK.
Task executed in the context of a job.
Task executed in the context of a job. Each task is executed in its own session.
Main SQL request to exexute (do not forget to prefix table names with the database name to avoid conflicts)
Output domain in output Area (Will be the Database name in Hive or Dataset in BigQuery)
Dataset Name in output Area (Will be the Table name in Hive & BigQuery)
Append to or overwrite existing data
List of columns used for partitioning the outtput.
List of SQL requests to executed before the main SQL request is run
List of SQL requests to executed after the main SQL request is run
Target Area where domain / dataset will be stored.
Where to sink the data
Row level security policy to apply too the output data.
When the sink *type* field is set to BQ, the options below should be provided.
When the sink *type* field is set to BQ, the options below should be provided.
: Database location (EU, US, ...)
Let's say you are willing to import customers and orders from your Sales system.
Let's say you are willing to import customers and orders from your Sales system. Sales is therefore the domain and customer & order are your datasets. In a DBMS, A Domain would be implemented by a DBMS schema and a dataset by a DBMS table. In BigQuery, The domain name would be the Big Query dataset name and the dataset would be implemented by a Big Query table.
Domain name. Make sure you use a name that may be used as a folder name on the target storage.
: Folder on the local filesystem where incoming files are stored. Typically, this folder will be scanned periodically to move the dataset to the cluster for ingestion. Files located in this folder are moved to the pending folder for ingestion by the "import" command.
: Default Schema metadata. This metadata is applied to the schemas defined in this domain. Metadata properties may be redefined at the schema level. See Metadata Entity for more details.
: List of schemas for each dataset in this domain A domain ususally contains multiple schemas. Each schema defining how the contents of the input file should be parsed. See Schema for more details.
: Domain Description (free text)
: recognized filename extensions. json, csv, dsv, psv are recognized by default Only files with these extensions will be moved to the pending folder.
: Ack extension used for each file. ".ack" if not specified. Files are moved to the pending folder only once a file with the same name as the source file and with this extension is present. To move a file without requiring an ack file to be present, set explicitly this property to the empty string value "".
Big versus Fast data ingestion.
Big versus Fast data ingestion. Are we ingesting a file or a message stream ?
: FILE or STREAM
When the sink *type* field is set to ES, the options below should be provided.
When the sink *type* field is set to ES, the options below should be provided. Elasticsearch options are specified in the application.conf file.
Recognized file type format.
Recognized file type format. This will select the correct parser
: SIMPLE_JSON, JSON of DSV Simple Json is made of a single level attributes of simple types (no arrray or map or sub objects)
This attribute property let us know what statistics should be computed for this field when analyze is active.
This attribute property let us know what statistics should be computed for this field when analyze is active.
: DISCRETE or CONTINUOUS or TEXT or NONE
When the sink *type* field is set to JDBC, the options below should be provided.
When the sink *type* field is set to JDBC, the options below should be provided.
How dataset are merged
How dataset are merged
list of attributes to join existing with incoming dataset. Use renamed columns here.
Optional valid sql condition on the incoming dataset. Use renamed column here.
Timestamp column used to identify last version, if not specified currently ingested row is considered the last
Specify Schema properties.
Specify Schema properties. These properties may be specified at the schema or domain level Any property not specified at the schema level is taken from the one specified at the domain level or else the default value is returned.
: FILE mode by default. FILE and STREAM are the two accepted values. FILE is currently the only supported mode.
: DSV by default. Supported file formats are :
: UTF-8 if not specified.
: are json objects on a single line or multiple line ? Single by default. false means single. false also means faster
: Is the json stored as a single object array ? false by default. This means that by default we have on json document per line.
: does the dataset has a header ? true bu default
: the values delimiter, ';' by default value may be a multichar string starting from Spark3
: The String quote char, '"' by default
: escaping char '\' by default
: Write mode, APPEND by default
: Partition columns, no partitioning by default
: should the dataset be indexed in elasticsearch after ingestion ?
: Pattern to ignore or UDF to apply to ignore some lines
: List of attributes to use for clustering
: com.databricks.spark.xml options to use (eq. rowTag)
This attribute property let us know what statistics should be computed for this field when analyze is active.
This attribute property let us know what statistics should be computed for this field when analyze is active.
: DISCRETE or CONTINUOUS or TEXT or NONE
Big versus Fast data ingestion.
Big versus Fast data ingestion. Are we ingesting a file or a message stream ?
: FILE or STREAM
: 0.0 means no sampling, > 0 && < 1 means sample dataset, >=1 absolute number of partitions.
: Attributes used to partition de dataset.
How (the attribute should be transformed at ingestion time ?
How (the attribute should be transformed at ingestion time ?
: First char position
: last char position
Spark supported primitive types.
Spark supported primitive types. These are the only valid raw types. Dataframes columns are converted to these types before the dataset is ingested
How (the attribute should be transformed at ingestion time ?
How (the attribute should be transformed at ingestion time ?
algorithm to use : NONE, HIDE, MD5, SHA1, SHA256, SHA512, AES
User / Group and Service accounts rights on a subset of the table.
User / Group and Service accounts rights on a subset of the table.
: This Row Level Security unique name
: The condition that goes to the WHERE clause and limitt the visible rows.
: user / groups / service accounts to which this security level is applied. ex : user:[email protected],group:[email protected],serviceAccount:[email protected]
Dataset Schema
Dataset Schema
: Schema name, must be unique among all the schemas belonging to the same domain. Will become the hive table name On Premise or BigQuery Table name on GCP.
: filename pattern to which this schema must be applied. This instructs the framework to use this schema to parse any file with a filename that match this pattern.
: Attributes parsing rules.
See :ref:attribute_concept
: Dataset metadata
See :ref:metadata_concept
: free text
: Reserved for future use.
: We use this attribute to execute sql queries before writing the final dataFrame after ingestion
: Set of string to attach to this Schema
: Experimental. Row level security to this to this schema.
See :ref:rowlevelsecurity_concept
Once ingested, files may be sinked to BigQuery, Elasticsearch or any JDBC compliant Database.
Once ingested, files may be sinked to BigQuery, Elasticsearch or any JDBC compliant Database.
Recognized file type format.
Recognized file type format. This will select the correct parser
: NONE, FS, JDBC, BQ, ES One of the possible supported sinks
Big versus Fast data ingestion.
Big versus Fast data ingestion. Are we ingesting a file or a message stream ?
: FILE or STREAM
Big versus Fast data ingestion.
Big versus Fast data ingestion. Are we ingesting a file or a message stream ?
: FILE or STREAM
Semantic Type
Semantic Type
: Type name
: Pattern use to check that the input data matches the pattern
: Spark Column Type of the attribute
List of globally defined types
List of globally defined types
: Type list
Recognized file type format.
Recognized file type format. This will select the correct parser
: SIMPLE_JSON, JSON of DSV Simple Json is made of a single level attributes of simple types (no arrray or map or sub objects)
Thisi tag appears in files and allow import of views and assertions definitions into the current files.
During ingestion, should the data be appended to the previous ones or should it replace the existing ones ? see Spark SaveMode for more options.
During ingestion, should the data be appended to the previous ones or should it replace the existing ones ? see Spark SaveMode for more options.
: OVERWRITE / APPEND / ERROR_IF_EXISTS / IGNORE.
Contains classes used to describe rejected records.
Contains classes used to describe rejected records. Recjected records are stored in parquet file in teh rejected area. A reject row contains
Utility to extract duplicates and their number of occurrences
Utility to extract duplicates and their number of occurrences
: Liste of strings
: Error Message that should contains placeholders for the value(%s) and number of occurrences (%d)
List of tuples contains for ea ch duplicate the number of occurrences