Helper Utilities for reading/writing data from/to different data sources.
Spark utilities for handling rest api connections.
Library of all spark functions which implements different abinitio functions used in abinitio workflows.
Utility class with different UDFs to take care of miscellaneous tasks.
Column Dependency Calculator for two:
Converts 1 digit julian year to 4 digits julian year.
Converts 1 digit julian year to 4 digits julian year.
date in Julian in "YJJJ" format
date in "yyyyMMdd" format
a date in "YYYYJJJ"
begining of input should have Cyy
begining of input should have Cyy
Appends a trailer data to every single file in the data directory.
Appends a trailer data to every single file in the data directory. A single trailer file in the
pathOutputTrailer
directory should correspond to a single data file in the pathOutputData
directory.
If a trailer for a given file does not exist, the file is moved as is to the output directory.
Input data files directory
Input trailer files directory
Output concatenated files directory
Hadoop configuration (preferably sparkSession.sparkContext.hadoopConfiguration
)
Function to take variable number of values and create an array column out of it.
Function to take variable number of values and create an array column out of it.
input value
variable number of input values.
an array of column.
UDF to find and return element in arr sequence at passed index.
UDF to find and return element in arr sequence at passed index. If no element found then null is returned.
Spark UDF that makes a single blocking rest API call to a given url.
Spark UDF that makes a single blocking rest API call to a given url. The result of this udf is always produced, contains a proper error if it failed at any stage, and never interrupts the job execution (unless called with invalid signature).
The default timeout can be configured through the spark.network.timeout
Spark configuration option.
Parameters:
Response - a struct with the following fields:
name: value
response headers (e.g. [Server: akka-http/10.1.10, Date: Tue, 07 Sep 2021 18:11:47 GMT])Taken from upstream Spark
Taken from upstream Spark
Function to add new typecasted column in input dataframe.
Function to add new typecasted column in input dataframe. Newly added column is typecasted version of passed column. Typecast operation is supported for string, boolean, byte, short, int, long, float, double, decimal, date, timestamp
spark session
input dataframe
input column to be typecasted
datatype to cast column to.
column name to be added in dataframe.
new dataframe with new typecasted column.
Method to get data from multiple source paths and combine it into single destination path.
Method to get data from multiple source paths and combine it into single destination path.
multiple source paths from which to merge the data.
destination path to combine all data to.
flag to compress final output file into gzip format
Method to get data from multiple part files in source directory and combine it into single file.
Method to get data from multiple part files in source directory and combine it into single file.
spark session
file extension. e.g.: ".csv", ".txt"
write mode in spark. Can be overwrite, append, error and ignore
input directory containing part files
output single file path
flag to delete temp source directory
flag to exclude headers from file
Method used for abinitio's reinterpret_as function to read necessary bytes from byteArray for input data and convert into struct format as per provided in typeInfo sequence.
Method used for abinitio's reinterpret_as function to read necessary bytes from byteArray for input data and convert into struct format as per provided in typeInfo sequence.
TypeInfo can have multiple entries, each could be either decimal or string type. Depending on the argument passed within decimal or string bytes are read from input byte array.
If decimal or string argument has some integer then that many bytes are read from input byte array or if decimal or string has some string delimiter as its argument then from the current position bytes are read until string delimiter is found in input byte array.
Method to read values from inputData and create dataframe with column name as columnName and column type as columnType for the values in inputData delimiter by delimiter.
Method to read values from inputData and create dataframe with column name as columnName and column type as columnType for the values in inputData delimiter by delimiter.
Function registers 4 different UDFs with spark registry.
Function registers 4 different UDFs with spark registry. UDF for lookup_match, lookup_count, lookup_row and lookup functions are registered. This function stores the data of input dataframe in a broadcast variable, then uses this broadcast variable in different lookup functions.
lookup : This function returns the first matching row for given input keys lookup_count : This function returns the count of all matching rows for given input keys. lookup_match : This function returns 0 if there is no matching row and 1 for some matching rows for given input keys. lookup_row : This function returns all the matching rows for given input keys.
This function registers for upto 10 matching keys as input to these lookup functions.
UDF Name
input dataframe
spark session
columns to be used as keys in lookup functions.
schema of entire row which will be stored for each matching key.
registered UDF definitions for lookup functions. These UDF functions returns different results depending on the lookup function.
Method to create UDF which looks for passed input double in input dataframe.
Method to create UDF which looks for passed input double in input dataframe. This function first loads the data of dataframe in broadcast variable and then defines a UDF which looks for input double value in the data stored in broadcast variable. If input double lies between passed col1 and col2 values then it adds corresponding row in the returned result. If value of input double doesn't lie between col1 and col2 then it simply returns null for current row in result.
created UDF name
input dataframe
spark session
column whose value to be considered as minimum in comparison.
column whose value to be considered as maximum in comparison.
remaining column names to be part of result.
registers UDF which in turn returns rows corresponding to each row in dataframe on which range UDF is called.
Returns the internal representation of a date resulting from adding (or subtracting) a number of months to the specified date.
Returns the internal representation of a date resulting from adding (or subtracting) a number of months to the specified date.
in yyyy-MM-dd format
Computes number of days between two specified dates in "yyyyMMdd" format
Computes number of days between two specified dates in "yyyyMMdd" format
input date
input date
number of days between laterDate and earlierDate or null if either one is null
Returns the internal representation of a timestamp resulting from adding (or subtracting) a number of months to the specified timestamp.
Returns the internal representation of a timestamp resulting from adding (or subtracting) a number of months to the specified timestamp.
timestamp in yyyy-MM-dd HH:mm:ss.SSSS format
Returns the number of hours between two specified dates in standard format yyyy-MM-dd HH:mm:ss.SSSS.
Returns the number of hours between two specified dates in standard format yyyy-MM-dd HH:mm:ss.SSSS.
Returns the number of minutes between two specified dates in standard format yyyy-MM-dd HH:mm:ss.SSSS.
Returns the number of minutes between two specified dates in standard format yyyy-MM-dd HH:mm:ss.SSSS.
Method uses a java regex to identify decimal numbers from input string.
Method uses a java regex to identify decimal numbers from input string. This decimal number could be of 3 types 1. Simple integral number. e.g. 013334848. This part is identified by regex. 2. decimal number with explicit decimal point. e.g. 123456.90. This part is identified by combination of [0-9]+(\$$decimal_point_char)[0-9]+ and (0\$$decimal_point_char)[0-9]+ regex
After extracting decimal number this code checks if length of decimal number is more than len parameter or not. If length is more than len parameter then it simply returns this extracted decimal number. Otherwise it first left pad decimal number with char_to_pad_with to make its length equal to len parameter and then adjusts minus sign (-) to left most part of decimal number.
input string.
length of characters.
character to left pad with. default value is "0"
A string that specifies the character that represents the decimal point.
a decimal string of the specified length or longer, left-padded with a specified character as needed and trimmed of leading zeros.
Method uses a java regex to identify decimal numbers from input string.
Method uses a java regex to identify decimal numbers from input string. This decimal number could be of 3 types 1. Simple integral number. e.g. 013334848. This part is identified by combination of [1-9][0-9]*[0-9] and [1-9]+ regex 2. decimal number with explicit decimal point. e.g. 123456.90. This part is identified by combination of [1-9][0-9]*(\\\$$decimal_point_char)[0-9]+ and (0\\\$$decimal_point_char)[0-9]*[0-9] regex
After extracting decimal number this code checks if length of decimal number is more than len parameter or not. If length is more than len parameter then it simply returns this extracted decimal number. Otherwise it first left pad decimal number with char_to_pad_with to make its length equal to len parameter and then adjusts minus sign (-) to left most part of decimal number.
input string.
length of characters.
character to left pad with. default value is "0"
A string that specifies the character that represents the decimal point.
a decimal string of the specified length or longer, left-padded with a specified character as needed and trimmed of leading zeros.
Function returns a value which is rounded down to right_digits number of digits to the right of decimal point.
Function returns a value which is rounded down to right_digits number of digits to the right of decimal point.
Returns a number rounded up to a specified number of places to the right of the decimal point.
Returns a number rounded up to a specified number of places to the right of the decimal point.
Function uses a java regex to identify decimal numbers from input string.
Function uses a java regex to identify decimal numbers from input string. This decimal number could be of 3 types 1. Simple integral number. e.g. 013334848. This part is identified by combination of [1-9][0-9 ]*[0-9] and [1-9]+ regex 2. decimal number with explicit decimal point. e.g. 123456.90. This part is identified by combination of [1-9][0-9]*(\$$decimal_point_char)[0-9 ]+ and (0\$$decimal_point_char)[0-9 ]*[0-9] regex
After extracting decimal number this code looks for minus sign before extracted number in input and appends it with decimal number if found minus sign.
In the end it replaces all whitespaces with empty string in the final resultant decimal number.
input string
A string that specifies the character that represents the decimal point.
a decimal from a string that has been trimmed of leading zeros and non-numeric characters.
UDF to get record of type decode_datetime_type.
UDF to get record of type decode_datetime_type. This record will have all its fields populated with corresponding entries in input date/timestamp.
Returned record will have following schema.
integer(8) year; integer(8) month; integer(8) day; integer(8) hour; integer(8) minute; integer(8) second; integer(8) microsecond;
Note: Supported Input time is in yyyy-MM-dd HH:mm:ss.SSSSSS or yyyy-MM-dd HH:mm:ss or yyyy-MM-dd formats only. Additional handling is done to support timestamp retrieved from now() function call.
Function to drop passed columns from input dataframe.
Function to drop passed columns from input dataframe.
spark session
input dataframe.
list of columns to be dropped from dataframe.
new dataframe with dropped columns.
integer values specifying days relative to January 1, 1900.
integer values specifying days relative to January 1, 1900. This function returns the internal representation of a date given the year, month, and date. encode_date returns the internal representation of the date specified by the year 1998, the month 5, and the day 18:encode_date(1998, 5, 18) = 35931
Returns true if string columns ends with given suffix
Returns true if string columns ends with given suffix
Method to return the result of evaluating a string expression in the context of a specified input column.
Method to return the result of evaluating a string expression in the context of a specified input column. Here input column could be struct type record, simple column, array type etc. Here expr could be reference to nested column inside input column or any expression which requires values from input column for its evaulation.
Note: Current implementation only supports scenerio where input column is of struct type and expr is simply dot separated column reference to input struct.
UDF to get file information for passed input file path.
UDF to get file information for passed input file path.
Method to identify and return first non null expression.
Method to identify and return first non null expression.
Method to create dataframe with single column containing increasing sequence id from start to end.
Method to create dataframe with single column containing increasing sequence id from start to end.
Function to create sequence of array between two passed numbers
Function to create sequence of array between two passed numbers
starting point of generated sequence
terminating point of generated sequence.
column containing sequence of integers.
UDF to generate column with sequence of integers between two passed start and end columns.
UDF to generate column with sequence of integers between two passed start and end columns.
UDF to get last Byte from ByteArray of input data.
UDF to get last Byte from ByteArray of input data.
Method to get empty dataframe with below abinitio log schema.
Method to get empty dataframe with below abinitio log schema.
record string("|") node, timestamp, component, subcomponent, event_type; string("|\n") event_text; end
Computes number of days in February month in a given year
Computes number of days in February month in a given year
year whose number of days in February needs to be calculated
number of days
Method to get field at specific position from struct column
Method to get field at specific position from struct column
UDF to get integer comprising of last 4 Bytes from ByteArray of input data.
UDF to get integer comprising of last 4 Bytes from ByteArray of input data.
UDF to get long comprising of last 8 Bytes from ByteArray of input data.
UDF to get long comprising of last 8 Bytes from ByteArray of input data.
UDF to get long comprising of last 8 Bytes from ByteArray of input data.
UDF to get long comprising of last 8 Bytes from ByteArray of input data.
UDF to get short comprising of last 2 Bytes from ByteArray of input data.
UDF to get short comprising of last 2 Bytes from ByteArray of input data.
let string("") get_partial_drug_details = get_partial_drug_details_Udf(v_prod_id,v_gpi_no,v_filled_dt,v_mel_thrgh_dt,v_mpa_thrgh_dt,lookup_row("lkp_cag_ndc_gpi",v_cag_sk),for (j, j < count): lookup_row("lkp_ndc_gpi_list",first_defined(lookup_row("lkp_cag_ndc_gpi",v_cag_sk)[j].drug_name_list,'-'))); out :: (string("")) string_concat((string(""))(decimal(""))get_partial_drug_details,'|',(string(""))(decimal("")) v_gf_flg); out :: get_drug_details(v_cag_sk, v_prod_id, v_gpi_no, v_filled_dt, v_mel_thrgh_dt, v_mpa_thrgh_dt,v_prior_auth_nbr)= begin let string("") pattern=first_defined(lookup("lkp_cag_ndc_gpi",v_cag_sk).idntfn_gf,'-'); let string("") flag=first_defined(lookup("lkp_cag_ndc_gpi",v_cag_sk).include_exclude_gf,'-'); let int count = lookup_count("lkp_cag_ndc_gpi",v_cag_sk); let int ndc_count = 0; let int rule_cnt = 0; let int ndc_rule_cnt = 0; let rec_vec v_rec = allocate_with_nulls(); let ndc_rec_vec v_ndc_rec = allocate_with_nulls(); let int drug_match = 0; let string(1) drug_match_ind = '0'; let int claim_flg = 0; let int mbr_flg = 0; let int exact_flg = 0; let int c_valid_rec = 0; let int m_valid_rec = 0; let int v_prior_auth_flg = 0; let int exact_inc_flg = 0; let int exact_exc_flg = 0; let string("") v_rec_drug_name_list = "-"; let int v_gf_flg=0; v_gf_flg= if(flag=='I' and string_index(v_prior_auth_nbr,pattern)!=0)1 else if(flag=='E' and string_index(v_prior_auth_nbr,pattern)!=0)0 else if(flag=='I' and string_index(v_prior_auth_nbr,pattern)==0)0 else if(flag=='E' and string_index(v_prior_auth_nbr,pattern)==0)1 else if(flag=='-' or pattern=='-')2 else 2; while (count > rule_cnt) begin v_rec = lookup_next("lkp_cag_ndc_gpi"); v_rec_drug_name_list = first_defined(v_rec.drug_name_list,'-'); if(string_upcase(v_rec.drug_type) member [vector 'GPI LIST','NDC LIST'] && !is_null(v_rec.drug_name_list)) begin ndc_count = lookup_count("lkp_ndc_gpi_list",v_rec_drug_name_list); ndc_rule_cnt=0; while (ndc_count > ndc_rule_cnt) begin v_ndc_rec = lookup_next("lkp_ndc_gpi_list"); drug_match = if(is_null(v_rec.drug_type) || is_blank(v_rec.drug_type) || first_defined(v_rec.drug_type,'-') == '-' || (string_upcase(v_rec.drug_type) == 'GPI LIST' && (starts_with(string_downcase(v_gpi_no), string_downcase(v_ndc_rec.ndc_gpi)) == 1 || v_ndc_rec.list_name == '-')) || (string_upcase(v_rec.drug_type) == 'NDC LIST' && (starts_with(string_downcase(v_prod_id), string_downcase(v_ndc_rec.ndc_gpi)) == 1 || v_ndc_rec.list_name == '-'))) 1 else 0;
let string("") get_partial_drug_details = get_partial_drug_details_Udf(v_prod_id,v_gpi_no,v_filled_dt,v_mel_thrgh_dt,v_mpa_thrgh_dt,lookup_row("lkp_cag_ndc_gpi",v_cag_sk),for (j, j < count): lookup_row("lkp_ndc_gpi_list",first_defined(lookup_row("lkp_cag_ndc_gpi",v_cag_sk)[j].drug_name_list,'-'))); out :: (string("")) string_concat((string(""))(decimal(""))get_partial_drug_details,'|',(string(""))(decimal("")) v_gf_flg); out :: get_drug_details(v_cag_sk, v_prod_id, v_gpi_no, v_filled_dt, v_mel_thrgh_dt, v_mpa_thrgh_dt,v_prior_auth_nbr)= begin let string("") pattern=first_defined(lookup("lkp_cag_ndc_gpi",v_cag_sk).idntfn_gf,'-'); let string("") flag=first_defined(lookup("lkp_cag_ndc_gpi",v_cag_sk).include_exclude_gf,'-'); let int count = lookup_count("lkp_cag_ndc_gpi",v_cag_sk); let int ndc_count = 0; let int rule_cnt = 0; let int ndc_rule_cnt = 0; let rec_vec v_rec = allocate_with_nulls(); let ndc_rec_vec v_ndc_rec = allocate_with_nulls(); let int drug_match = 0; let string(1) drug_match_ind = '0'; let int claim_flg = 0; let int mbr_flg = 0; let int exact_flg = 0; let int c_valid_rec = 0; let int m_valid_rec = 0; let int v_prior_auth_flg = 0; let int exact_inc_flg = 0; let int exact_exc_flg = 0; let string("") v_rec_drug_name_list = "-"; let int v_gf_flg=0; v_gf_flg= if(flag=='I' and string_index(v_prior_auth_nbr,pattern)!=0)1 else if(flag=='E' and string_index(v_prior_auth_nbr,pattern)!=0)0 else if(flag=='I' and string_index(v_prior_auth_nbr,pattern)==0)0 else if(flag=='E' and string_index(v_prior_auth_nbr,pattern)==0)1 else if(flag=='-' or pattern=='-')2 else 2; while (count > rule_cnt) begin v_rec = lookup_next("lkp_cag_ndc_gpi"); v_rec_drug_name_list = first_defined(v_rec.drug_name_list,'-'); if(string_upcase(v_rec.drug_type) member [vector 'GPI LIST','NDC LIST'] && !is_null(v_rec.drug_name_list)) begin ndc_count = lookup_count("lkp_ndc_gpi_list",v_rec_drug_name_list); ndc_rule_cnt=0; while (ndc_count > ndc_rule_cnt) begin v_ndc_rec = lookup_next("lkp_ndc_gpi_list"); drug_match = if(is_null(v_rec.drug_type) || is_blank(v_rec.drug_type) || first_defined(v_rec.drug_type,'-') == '-' || (string_upcase(v_rec.drug_type) == 'GPI LIST' && (starts_with(string_downcase(v_gpi_no), string_downcase(v_ndc_rec.ndc_gpi)) == 1 || v_ndc_rec.list_name == '-')) || (string_upcase(v_rec.drug_type) == 'NDC LIST' && (starts_with(string_downcase(v_prod_id), string_downcase(v_ndc_rec.ndc_gpi)) == 1 || v_ndc_rec.list_name == '-'))) 1 else 0;
claim_flg = if (date_difference_days((date("YYYYMMDD")) (datetime("YYYYMMDD")) now(), (date("YYYYMMDD")) v_filled_dt) <= (int) (decimal("")) v_rec.lookback_days) 1 else 0; mbr_flg = if (date_difference_days((date("YYYYMMDD")) v_mel_thrgh_dt, (date("YYYYMMDD")) (datetime("YYYYMMDD")) now()) >= (int) (decimal("")) v_rec.mel_lookforward) 1 else 0;
c_valid_rec = if (v_rec.include_exclude == 'I' && exact_inc_flg == 0 && exact_exc_flg == 0) if (drug_match == 0) 0 else if (drug_match == 1 && claim_flg == 1) 1 else 0 else if (v_rec.include_exclude == 'E' && exact_exc_flg == 0 && exact_inc_flg == 1) if (drug_match == 1 && claim_flg == 1) 0 else 1 else if (v_rec.include_exclude == 'E' && exact_exc_flg == 0) if (drug_match == 1 && claim_flg == 1) 0 else if (drug_match == 1 && claim_flg == 0 && (rule_cnt != 0 && c_valid_rec == 1)) 1 else if (drug_match == 0 && claim_flg == 1 && ((rule_cnt != 0 && c_valid_rec == 1) || rule_cnt == 0)) 1 else 0 else c_valid_rec; m_valid_rec = if (v_rec.include_exclude == 'I' && exact_inc_flg == 0 && exact_exc_flg == 0) if (drug_match == 0) 0 else if (drug_match == 1 && mbr_flg == 1) 1 else 0 else if (v_rec.include_exclude == 'E' && exact_exc_flg == 0 && exact_inc_flg == 1) if (drug_match == 1 && mbr_flg == 1) 0 else 1 else if (v_rec.include_exclude == 'E' && exact_exc_flg == 0) if (drug_match == 1 && mbr_flg == 1) 0 else if (drug_match == 1 && mbr_flg == 0 && (rule_cnt != 0 && m_valid_rec == 1)) 1 else if (drug_match == 0 && mbr_flg == 1 && ((rule_cnt != 0 && m_valid_rec == 1) || rule_cnt == 0)) 1 else 0 else m_valid_rec; exact_inc_flg = if (v_rec.include_exclude == 'I' && drug_match == 1) 1 else exact_inc_flg; exact_exc_flg = if (v_rec.include_exclude == 'E' && drug_match == 1) 1 else exact_exc_flg; v_prior_auth_flg = if ((v_prior_auth_flg == 0 && date_difference_days((date("YYYYMMDD")) v_mpa_thrgh_dt, (date("YYYYMMDD")) (datetime("YYYYMMDD")) now()) == (int) (decimal("")) v_rec.pa_lookforward) || v_prior_auth_flg == 1) 1 else 0; ndc_rule_cnt = ndc_rule_cnt + 1; end end else begin drug_match = if(is_null(v_rec.drug_type) || is_blank(v_rec.drug_type) || first_defined(v_rec.drug_type,'-') == '-' || (string_upcase(v_rec.drug_type) member [vector 'GPI LIST', 'NDC LIST'] && v_rec.drug_name_list == '-') || (string_upcase(v_rec.drug_type) == 'GPI' && (starts_with(string_downcase(v_gpi_no), string_downcase(v_rec.drug_name_list)) == 1 || v_rec.drug_name_list == '-')) || (string_upcase(v_rec.drug_type) == 'NDC' && (starts_with(string_downcase(v_prod_id), string_downcase(v_rec.drug_name_list)) == 1 || v_rec.drug_name_list == '-'))) 1 else 0;
claim_flg = if (date_difference_days((date("YYYYMMDD")) (datetime("YYYYMMDD")) now(), (date("YYYYMMDD")) v_filled_dt) <= (int) (decimal("")) v_rec.lookback_days) 1 else 0;
mbr_flg = if (date_difference_days((date("YYYYMMDD")) v_mel_thrgh_dt, (date("YYYYMMDD")) (datetime("YYYYMMDD")) now()) >= (int) (decimal("")) v_rec.mel_lookforward) 1 else 0;
c_valid_rec = if (v_rec.include_exclude == 'I' && exact_inc_flg == 0 && exact_exc_flg == 0) if (drug_match == 0) 0 else if (drug_match == 1 && claim_flg == 1) 1 else 0 else if (v_rec.include_exclude == 'E' && exact_exc_flg == 0 && exact_inc_flg == 1) if (drug_match == 1 && claim_flg == 1) 0 else 1 else if (v_rec.include_exclude == 'E' && exact_exc_flg == 0) if (drug_match == 1 && claim_flg == 1) 0 else if (drug_match == 1 && claim_flg == 0 && (rule_cnt != 0 && c_valid_rec == 1)) 1 else if (drug_match == 0 && claim_flg == 1 && ((rule_cnt != 0 && c_valid_rec == 1) || rule_cnt == 0)) 1 else 0 else c_valid_rec; m_valid_rec = if (v_rec.include_exclude == 'I' && exact_inc_flg == 0 && exact_exc_flg == 0) if (drug_match == 0) 0 else if (drug_match == 1 && mbr_flg == 1) 1 else 0 else if (v_rec.include_exclude == 'E' && exact_exc_flg == 0 && exact_inc_flg == 1) if (drug_match == 1 && mbr_flg == 1) 0 else 1 else if (v_rec.include_exclude == 'E' && exact_exc_flg == 0) if (drug_match == 1 && mbr_flg == 1) 0 else if (drug_match == 1 && mbr_flg == 0 && (rule_cnt != 0 && m_valid_rec == 1)) 1 else if (drug_match == 0 && mbr_flg == 1 && ((rule_cnt != 0 && m_valid_rec == 1) || rule_cnt == 0)) 1 else 0 else m_valid_rec; exact_inc_flg = if (v_rec.include_exclude == 'I' && drug_match == 1) 1 else exact_inc_flg; exact_exc_flg = if (v_rec.include_exclude == 'E' && drug_match == 1) 1 else exact_exc_flg; v_prior_auth_flg = if ((v_prior_auth_flg == 0 && date_difference_days((date("YYYYMMDD")) v_mpa_thrgh_dt, (date("YYYYMMDD")) (datetime("YYYYMMDD")) now()) == (int) (decimal("")) v_rec.pa_lookforward) || v_prior_auth_flg == 1) 1 else 0; end rule_cnt = rule_cnt + 1; end
out :: (string("")) string_concat((string(""))(decimal(""))c_valid_rec,'|',(string(""))(decimal("")) m_valid_rec,'|',(string(""))(decimal("")) v_prior_auth_flg,'|',(string(""))(decimal("")) v_gf_flg); end;
Method to check if current column is null or has empty value.
Method to check if current column is null or has empty value.
Checks if a string is ascii
Checks if a string is ascii
column to be checked
true if the input string is ascii otherwise false
Method to identify if input string is a blank string or not.
Method to identify if input string is a blank string or not.
input string.
return 1 if given string contains all blank character or is a zero length string, otherwise it returns 0
Tests whether an object is composed of all binary zero bytes.
Tests whether an object is composed of all binary zero bytes. This function returns: 1. 1 if obj contains only binary zero bytes or is a zero-length string 2. 0 if obj contains any non-zero bytes 3. NULL if obj is NULL
Checks if an input string contains only ascii code and numbers
Checks if an input string contains only ascii code and numbers
string to be checked
true if input string contains only ascii code and numbers or null if input is null
Method to identify if passed input column is a valid expression after typecasting to passed dataType.
Method to identify if passed input column is a valid expression after typecasting to passed dataType. Also while typecasting if len is present then this function also makes sure the max length of input column after typecasting operation is not greater than len.
input column expression to be identified if is valid.
datatype to which input column expression must be typecasted. If datatype is a string then it is treated as timestamp format. If it is a list of string then it is treated as having current timestamp format and and new timestamp format to which input column needs to be typecasted.
max length of input column after typecasting it to dataType.
0 if input column is not valid after typecasting or 1 if it is valid.
Validates date against a input format
Validates date against a input format
A pattern such as yyyy-MM-dd
or yyyy-MM-dd HH:mm:ss.SSSS
or dd.MM.yyyy
Input date to be validated
true if the input date is valid otherwise false
By default returns only the first matching record
By default returns only the first matching record
Returns the last matching record
Returns the last matching record
Boolean Column
UDF to return a flag for each character if it is present or not in input String.
UDF to return a flag for each character if it is present or not in input String.
Method to create array of size "size" containing seedVal as each entry
Method to create array of size "size" containing seedVal as each entry
Method to create array of size "size" containing seedVal as each entry
Method to create array of size "size" containing seedVal as each entry
UDF to get multifile information for passed input file path.
UDF to get multifile information for passed input file path.
UDF for murmur hash generation for any column type
UDF for murmur hash generation for any column type
Method to get current timestamp.
Method to get current timestamp.
current timestamp in YYYYMMddHHmmssSSSSSS format.
udf to group input decimal into multiple groups separated by separator
udf to group input decimal into multiple groups separated by separator
Returns the first string in a target string that matches a regular expression.
Returns the first string in a target string that matches a regular expression.
UDF wrapper over re_index function.
UDF wrapper over re_index function.
Returns the first string in a target string that matches a regular expression.
Returns the first string in a target string that matches a regular expression.
Replaces all substrings in a target string that match a specified regular expression.
Replaces all substrings in a target string that match a specified regular expression.
A string that the function searches for a substring that matches pattern_expr.
regular expression
replacement string
Number of characters, from the beginning of str, to skip before searching.
a replaced string in which all substrings, which matches a specified regular expression, are replaced.
Replaces only the first regex matching occurrence in the target string.
Replaces only the first regex matching occurrence in the target string.
A string that the function searches for a substring that matches pattern_expr.
regular expression
replacement string
a replaced string in which first substring, which matches a specified regular expression, is replaced.
UDF to split input string via pattern string and remove all empty subtrings.
UDF to split input string via pattern string and remove all empty subtrings.
Method to read data from hive table.
Method to read data from hive table.
spark session
hive database
hive table.
hive table partition to read data specifically from if provided.
dataframe with data read from Hive Table.
Reads a full hive table partition, by reading every subpartition separately and performing a union on all the final DataFrames
Reads a full hive table partition, by reading every subpartition separately and performing a union on all the final DataFrames
This function is meant to temporarily solve the problem with Hive metastore crashing when querying too many partitions at the same time.
spark session
hive database name
hive table name
top-level partition's key
top-level partition's value
A complete DataFrame with the selected hive table partition
Method removes any non-digit characters from the specified string column.
Method removes any non-digit characters from the specified string column.
input String Column
Cleaned string column or null
Method to replace String Columns with Empty value to Null.
Method to replace String Columns with Empty value to Null.
Function to add new column in passed dataframe.
Function to add new column in passed dataframe. Newly added column value is decided by the presence of value corresponding to inputCol in array comprised of value and values. If inputCol is found then value of replaceWith is added in new column otherwise inputCol value is added.
spark session.
input dataframe.
name of new column to be added.
column name whose value is searched.
value with which to replace searched value if found.
element to be combined in array column
all values to be combined in array column for searching purpose.
dataframe with new column with column name outputCol
Function to add new column in passed dataframe.
Function to add new column in passed dataframe. Newly added column value is decided by the presence of value corresponding to inputCol in array comprised of value and values and null. If inputCol is found then value of replaceWith is added in new column otherwise inputCol value is added.
spark session.
input dataframe.
name of new column to be added.
column name whose value is searched.
value with which to replace searched value if found.
element to be combined in array column
all values to be combined in array column for searching purpose.
dataframe with new column with column name outputCol
Function to add new column in passed dataframe.
Function to add new column in passed dataframe. Newly added column value is decided by the presence of value corresponding to inputCol in array comprised of value and values and null. If inputCol is found then value of null is added in new column otherwise inputCol value is added.
spark session.
input dataframe.
name of new Column to be added.
column name whose value is searched.
element to be combined in array column.
all values to be combined in array column for searching purpose.
dataframe with new column with column name outputCol
UDF to find str in input sequence toBeReplaced and return replace if found.
UDF to find str in input sequence toBeReplaced and return replace if found. Otherwise str is returned.
UDF to find str in input sequence toBeReplaced and return null if found.
UDF to find str in input sequence toBeReplaced and return null if found. Otherwise str is returned.
Adds an explicit sign to the number.
Adds an explicit sign to the number. E.g. 2 -> +2; -004 -> -004; 0 -> +0
Function to split column with colName in input dataframe using split pattern into multiple columns.
Function to split column with colName in input dataframe using split pattern into multiple columns. If prefix name is provided each new generated column is prefixed with prefix followed by column number, otherwise original column name is used.
spark session.
input dataframe.
column in dataframe which needs to be split into multiple columns.
regex with which column in input dataframe will be split into multiple columns.
column prefix to be used with all newly generated columns.
new dataframe with new columns where new column values are generated after splitting original column colName.
UDF to break input string into multiple string via delimiter.
UDF to break input string into multiple string via delimiter. Number of strings after split are adjusted as per passed width parameter. If number of strings are less then empty strings are added otherwise in case of more number of strings, first width number of entries are picked and remaining are discarded.
Returns true if string columns starts with given prefix
Returns true if string columns starts with given prefix
Method to return character code of character at index position in inputStr string.
Method to return character code of character at index position in inputStr string.
input string
location of character to get code.
integer column.
This implementation is incorrect.
This implementation is incorrect.
Converts a string from one character set to another, replacing inconvertible characters with a specified string.
Converts a string from one character set to another, replacing inconvertible characters with a specified string.
Method which returns string of characters present in both of the strings in the same order as appearing in first string
Method which returns string of characters present in both of the strings in the same order as appearing in first string
Compares two input strings, then returns characters that appear in one string but not in the other.
Compares two input strings, then returns characters that appear in one string but not in the other.
UDF to find index of seekStr in inputStr.
UDF to find index of seekStr in inputStr. Returned index will be 1 based index.
UDF to find index of seekStr in inputStr from offset index onwards.
UDF to find index of seekStr in inputStr from offset index onwards. Returned string position is 1 based position.
Method which returns true if input string contains all alphabetic characters, or false otherwise.
Method which returns true if input string contains all alphabetic characters, or false otherwise.
Method which returns true if input string contains all numeric characters, or false otherwise.
Method which returns true if input string contains all numeric characters, or false otherwise.
Concatenates the elements of column using the delimiter.
Concatenates the elements of column using the delimiter.
Method to test whether a string matches a specified pattern.
Method to test whether a string matches a specified pattern. This function returns 1 if the input string matches a specified pattern, and 0 if the string does not match the pattern.
In abinitio version % character in pattern means to match zero or more characters and _ character means matches a single character.
Left-pad the input string column with pad_char to a length of len.
Left-pad the input string column with pad_char to a length of len. If length of input column is more than len then returns input column unmodified.
function trims the string and then pad the string with given character upto given length.
function trims the string and then pad the string with given character upto given length. if the length of trimmed string is equal to or greater than given length than it return input string
input string
length in number of characters.
A character used to pad input string to length len.
string of a specified length, trimmed of leading and trailing blanks and left-padded with a given character.
function pads input on the right with the character char_to_pad_with to make the string length len.
function pads input on the right with the character char_to_pad_with to make the string length len. If str is already len or more characters long, the function returns input unmodified.
function trims the string and then pad the string on right side with given character upto given length.
function trims the string and then pad the string on right side with given character upto given length. if the length of trimmed string is equal to or greater than given length than it return input string
input string
length in number of characters.
A character used to pad input string to length len.
string of a specified length, trimmed of leading and trailing blanks and left-padded with a given character.
Function to replace occurrence of seekStr with newStr string in input string after offset characters from first character.
Function to replace occurrence of seekStr with newStr string in input string after offset characters from first character.
input string on which to perform replace operation.
string to be replaced in input string.
string to be used instead of seekStr in input string.
number of characters to skip from begining in input string before performing string_replace operation.
modified string where seekStr is replaced with newStr in input string.
Returns the index of the first character of the last occurrence of a seek string within another input string.
Returns the index of the first character of the last occurrence of a seek string within another input string. Returned index is 1 based.
UDF to find index of seekStr in inputStr from end of inputStr skipping offset number of characters from end.
UDF to find index of seekStr in inputStr from end of inputStr skipping offset number of characters from end. Offset index is number of characters, from the end of str, to skip before searching. Returned string position is 1 based position.
UDF to split input string via delimiter string.
UDF to split input string via delimiter string.
UDF to split input string via delimiter string and remove all empty subtrings.
UDF to split input string via delimiter string and remove all empty subtrings.
Method to find substring of input string.
Method to find substring of input string.
string on which to find substring.
1 based starting position to find substring from.
total length of substring to be found.
substring of input string
UDF to return nth element from last in passed array of elements.
UDF to return nth element from last in passed array of elements. In case input sequence has less number of elements than n then first element is returned.
UDF to take Nth element from beginning.
UDF to take Nth element from beginning. In case input sequence has less element than N then exception is thrown.
UDF to identify the number of characters in inputStr which are present in charFlag
UDF to identify the number of characters in inputStr which are present in charFlag
Method to convert
Method to convert
Method to return integer value representing number of days to today from “1-1-1990”.
Method to return integer value representing number of days to today from “1-1-1990”.
integer value
UDF to return a string in the native character set made up of bytes from the given map.
UDF to return a string in the native character set made up of bytes from the given map. Each byte of the result is the value of map indexed by the character code of the corresponding byte of the input string str. The function returns NULL if any argument is NULL.
UDF to truncate microseconds part of timestamp.
UDF to truncate microseconds part of timestamp. This is needed as abinitio and spark has some incompatibility in microseconds part of timestamp format.
Method to take union of all passed dataframes.
Method to take union of all passed dataframes.
list of dataframes for which to take union of.
union of all passed input dataframes.
Method to write data passed in dataframe in specific file format.
Method to write data passed in dataframe in specific file format.
dataframe containing data.
path to write data to.
spark session.
underlying data source specific properties.
file format in which to persist data. Supported file formats are csv, text, json, parquet, orc
columns to be used for partitioning.
used to bucket the output by the given columns. If specified, the output is laid out on the file-system similar to Hive's bucketing scheme.
number of buckets to be used.
columns on which to order data while persisting.
table name for persisting data.
database name for persisting data.
UDF to write logging parameters to log port.
UDF to write logging parameters to log port.
Converts yyyyyMMdd to YYYYJJJ
Converts yyyyyMMdd to YYYYJJJ
date in yyyyMMdd format
a date converted to YYYYJJJ
Method to zip two arrays with first one having event_type and second one having event_text
Method to zip two arrays with first one having event_type and second one having event_text