A "bulk import" is a process involving a large amount of data to write in one or more connected tables, usually starting from a file.
Writing data means inserting records in table or updating them.
Typical problems to deal with when importing data are:
how to speed up the import process (e.g. loading 1M rows from a file in a few minutes and not in days or weeks)
updating already existing records, where a specific record can be identified by a set of fields which are not the PK but they can be used to identity a single record (e.g. COMPANY_ID + CODEs instead of a PK based on a single numeric field like PROG_ID)
validating input data in terms of:
NOT NULL constraints
type validation (e.g. an integer, e decimal number, a date expressed as yyyy-MM-dd, a date+time expressed as yyyy-MM-dd HH:mm)
length validation (e.g. a text no more larger than x characters, an integer having no more than N digits, a decimal number having no more than N/M int/dec digits)
allowed values (compared with an enumeration of allowed values, like "Y" or "N")
unique keys, i.e. a set of fields defined in this border table with represent a unique record (e.g. FIRST_NAME + LAST_NAME or EMAIL for a customer border table).
validating input data with respect to other tables, i.e. through a foreign key (e.g. a CODE passed in input to validate using another table where CODE is one of its fields to check out and then fetch its PROD_ID primary key)
The best way to manage all these issues is by splitting them into a sequence of steps:
loading a CSV file to a border table "x_csv". This table must have a simple structure: a CLOB field hosting the whole CSV raw, a numeric PK (e.g. ROWID), other fields to partition data per tenant (e.g. COMPANY_ID). The aim of this table is to forget the file and work only on a database level, which is way faster than working with files. A grid can be defined to show and edit the CLOB content, splitted in many columns, one for each CSV column. Editing is helpful to fix input data through the grid, in case of validation errors. Additional fields can be added:
ERRORS - text field to store a list of errors (e.g a JSON list of errors)
CONTAINS_ERRORS - char field used to store an error condition (Y/N)
validating data in terms of mandatory, type, length, enumeration and unique key (the last validation requires to fill in a second border table, where the uniqueness constraint can be reckoned easily); a second table is now needed in order to host each field; consequently a mapping is needed between the CLOB field in the first border table to N fields in the second border table. During this step, all validations reported above are performed. In case of errors, these are stored in the first border table: fixes can be applied in the first border table and then the validation can be re-executed.
validating data in terms of foreign keys, by using destination tables to check out values (FKs) - destination tables are needed to checkout a CODE and get the PK (e.g. PROG_ID); the second border table must contain not only CODE fields but also the FK (PROG_ID). In order to speed up this process, a limited amount of UPDATES must be carried out.
loading data from the border table to destination tables - the second border table contains all data required to fill in any destination table: only INSERT or UPDATE operation are now needed, except for the calculation of the PK for a destination table, in case this is based on an UUID or a counter (reckoned through a counter table). In order to speed up the calculation of counters, a unique UPDATE must be carried out: in this way, a lock is performed only once and one only update operation is performed.
The steps reported above can be executed using a few utility methods:
1.loading a CSV file to a border table "x_csv"
utils.readCsvAndLoadTable(settings);
where settings is a javascript object containing the following attributes
Example of x_csv border table and javascript:
TABLE:
create table BT01_ITEMS_CSV(
COMPANY_ID CHAR(5) NOT NULL,
ROW_ID NUMERIC(12) NOT NULL,
ROW_CONTENT MEDIUMTEXT,
ERRORS VARCHAR(4000),
CONTAINS_ERRORS CHAR(1),
PRIMARY KEY(COMPANY_ID,ROW_ID)
);
JAVASCRIPT:
var settings = {
srcDirId: 9,
srcFileName: "art.csv",
charSet:"UTF-8",
idAsNum:"ROW_ID",
idAsUuid: null,
additionalFields: {
COMPANY_ID: "00000",
ERRORS: "",
CONTAINS_ERRORS: "N"
},
headers: false,
commitSize: 10000,
clobFieldName: "ROW_CONTENT",
datasourceId: null,
tableName: "BT01_ITEMS_CSV"
};
utils.readCsvAndLoadTable(settings);
// Time spent to load 1M raws from a CSV file: 93474ms
2. validating data in terms of mandatory, type, length, enumeration, key uniqueness, by loading a second border table
var json =utils.mapClobFieldToTable(settings);
where settings is a javascript object containing the following attributes
Important note: in order to make it fast this validation process, it is essential to create an index for the "unique key" passed as argument to this method. DO NOT create a unique key!
Example of the second border table and javascript:
TABLE:
create table BT02_ITEMS_DATA(
COMPANY_ID CHAR(5) NOT NULL,
ROW_ID NUMERIC(12) NOT NULL,
ITEM_CODE VARCHAR(20),
DESCRIPTION VARCHAR(255),
BARCODE VARCHAR(255),
VAT_CODE VARCHAR(20),
FK_ART01 NUMERIC(12),
FK_ART02 NUMERIC(12),
FK_ART03 NUMERIC(12),
PRICE DECIMAL(18,5),
SELL_NR NUMERIC(18),
CREATE_DATE DATE,
LAST_UPDATE DATETIME,
ERRORS VARCHAR(4000),
CONTAINS_ERRORS CHAR(1),
PRIMARY KEY(COMPANY_ID,ROW_ID)
);
JAVASCRIPT:
var settings = {
srcDatasourceId: null,
srcTableName: "BT01_ITEMS_CSV",
destDatasourceId: null,
destTableName: "BT02_ITEMS_DATA",
clobFieldName: "ROW_CONTENT",
separator: ";",
errorFieldName: "ERRORS",
containsErrorsField: "CONTAINS_ERRORS",
asIsFields: {
COMPANY_ID: "COMPANY_ID",
ROW_ID: "ROW_ID",
},
defaultFields: {
ERRORS: "",
CONTAINS_ERRORS: "N",
},
mappedFields: [
{
fieldName: "ITEM_CODE",
fieldType: "TEXT",
fieldLength: 20,
mandatory: true
},
{
fieldName: "DESCRIPTION",
fieldType: "TEXT",
fieldLength: 255,
mandatory: true
},
{
fieldName: "BARCODE",
fieldType: "TEXT",
fieldLength: 255,
mandatory: true
},
{
fieldName: "VAT_CODE",
fieldType: "TEXT",
fieldLength: 20,
mandatory: Boolean.FALSE
},
{
fieldName: "PRICE",
fieldType: "DEC",
fieldInt: 8,
fieldDec: 2,
decSeparator: ".",
mandatory: true
},
{
fieldName: "SELL_NR",
fieldType: "INT",
fieldInt: 8,
mandatory: Boolean.FALSE
},
{
fieldName: "CREATE_DATE",
fieldType: "DATE",
dateFormat: "yyyy-MM-dd",
mandatory: Boolean.FALSE
},
{
fieldName: "LAST_UPDATE",
fieldType: "DATETIME",
dateFormat: "yyyy-MM-dd HH:mm",
mandatory: Boolean.FALSE
},
{
fieldName: "FLAG",
fieldType: "TEXT",
fieldLength: 1,
enumeration: "Y,N",
mandatory: Boolean.FALSE,
mapped: false // optional: if set to false, this will not be considered when insert a row in the border table
}
],
commitSize: 10000
};
var json = utils.mapClobFieldToTable(settings);
// Time spent to map 1M raws of a CSV content stored in a CLOB field of a table to another table: 112075ms
After invoking the mapClobFieldToTable function, the second border table will be filled in with all valid rows coming from the first border table. All invalid records (i.e. not passing the validation checkings described above) are not written in the second border table and the validation error is reported in the "errorFieldName" field defined in the first table and the "containsErrorsField" field in the same table is also set to "Y", to quickly identify the presence of an error for a specific row.
By an large, multiple validation errors could be fired for the same record (e.g a mandatory constraint for field A and a too long field for the field B of the same record). The "errorFieldName" field contains all errors founds. This field contains a JSON array list containing a list of objects, each one having the following structure:
{
fieldName: "...", // the field name violating the validation
errorType: "...", // the validation error code; it can be:
// "CANNOT BE EMPTY","INVALID TYPE",
// "INVALID LENGTH","INVALID DATA","DUPLICATED KEY","ROW TOO SHORT"
errorMessage: "...", // a long error message explaining the reason of the validation error (in english)
data: "..." // the field value violating the validation checking
}
This content can be helpful to provide a more detailed explanation about the reason why a CSV line has been rejected and it could be shown easily in a grid panel.
Note: this method returns a JSON string containing the following attributes:
success: true|fase - indicanting if the method has been completed correctly/with errors
errorMessage: "..." - optional; it is filled only if success = false and it described the error fired within the mapClobFieldToTable method
processedRecords - number of processed records
3. validating data in terms of foreign keys, by using destination tables to check out values (FKs)
utils.validateCode(settings); // fill in FKs fields in the second border table, by validating CODEs in destination tables; this method must be called for each CODE field
where settings is a javascript object containing the following attributes
TABLES:
create table ART02_BARCODES(
PROG_ID NUMERIC(12) NOT NULL,
COMPANY_ID CHAR(5) NOT NULL,
BARCODE VARCHAR(255),
PRIMARY KEY(PROG_ID)
) ;
create table ART03_VATS(
PROG_ID NUMERIC(12) NOT NULL,
COMPANY_ID CHAR(5) NOT NULL,
VAT_CODE VARCHAR(20),
VAT NUMERIC(5),
PRIMARY KEY(PROG_ID)
) ;
JAVASCRIPT:
var settings = {
srcDatasourceId: null,
fromTable: "BT02_ITEMS_data",
fromTableFilter: "company_id='00000'",
fromRowId: "ROW_ID",
destDatasourceId: null,
toTable: "ART03_VATS",
toTableFilter: "company_id='00000'",
errorFieldName: "ERRORS",
containsErrorsField: "CONTAINS_ERRORS",
fromFieldCode: "VAT_CODE",
toFieldCode: "VAT_CODE",
fromFK: "fk_art03",
toFK: "prog_id",
commitSize: 10000
};
utils.validateCode(settings); // 2112ms + 20709ms
// Time spent to decode all records for field VAT_CODE in table BT02_ITEMS_data: 21586ms
var settings = {
srcDatasourceId: null,
fromTable: "BT02_ITEMS_data",
fromTableFilter: "company_id='00000'",
fromRowId: "ROW_ID",
destDatasourceId: null,
toTable: "ART02_BARCODES",
toTableFilter: "company_id='00000'",
errorFieldName: "ERRORS",
containsErrorsField: "CONTAINS_ERRORS",
fromFieldCode: "BARCODE",
toFieldCode: "BARCODE",
fromFK: "fk_art02",
toFK: "prog_id",
commitSize: 10000
};
utils.validateCode(settings
// Time spent to decode all records for field BARCODE in table BT02_ITEMS_data: 31563ms
Important note: in order to make it fast this FK retrieval process, it is essential to create an index for the list fo fields based in "toTableFilter" + "toFieldCode".
Important note: in case you have the border table and the destination table to check in different schemas and the border table contains more than a thousand distinct codes to validate, you need to grant access to the destination table from the border table schema. A typical SQL instruction to grant access would be:
GRANT SELECT, SHOW VIEW ON destinationtableschema.DEST_TABLE_NAME to 'bordertableuser'
That grant is required only in case of a large amount of distinct codes to validate, because the validateCode method first checks for the number of distinct codes to validate: if the number is lower than a thousand, these codes are read from the destination table, cached and used to fill in each border table FK field; in case the distinct codes are greater than a thousand, a single UPDATE operation on the border table is performed, combined with a SELECT on the destination table: in such a scenario, a grant between the two schemas is needed in case of tables belonging to different schemas.
4. loading data from the border table to destination tables
utils.copyDataFromBorderTable(settings); // load a destination table from the seocnd border table; this method must be called for each destination table
where settings is a javascript object containing the following attributes
Example of javascript to use to update data from the second border table to the destination table: PROG_ID must be first reckoned and then used to update data.
var settings = {
fromTable: "BT02_ITEMS_data",
fromTableFilter: "company_id='00000'",
fromRowId: "ROW_ID",
toTable: "ART01_ITEMS",
toTableFilter: "company_id='00000'",
//: errorFieldName: "ERRORS");
//: containsErrorsField: "CONTAINS_ERRORS");
fromFieldCode: "ITEM_CODE",
toFieldCode: "ITEM_CODE",
fromFK: "fk_art01",
toFK: "prog_id",
commitSize: 10000
};
utils.validateCode(settings);
// Time spent to decode all records for field ITEM_CODE in table BT02_ITEMS_data: 18810ms
var settings = {
fromTable: "BT02_ITEMS_data",
fromTableFilter: "company_id='00000'",
fromFK: "FK_ART01",
//errorFieldName: "ERRORS");
//containsErrorsField: "CONTAINS_ERRORS");
operation: "UPDATE",
toTable: "ART01_ITEMS",
toTableFilter: "PROG_ID=?",
commitSize: 10000,
defaultFields: {},
defaultInLineFields: {},
mappingFields: {
COMPANY_ID: "COMPANY_ID",
ITEM_CODE: "ITEM_CODE",
DESCRIPTION: "DESCRIPTION",
CREATE_DATE: "CREATE_DATE",
LAST_UPDATE: "LAST_UPDATE",
FK_ART02: "FK_ART02",
FK_ART03: "FK_ART03"
}
};
utils.copyDataFromBorderTable(settings);
// Time spent to copy data from table BT02_ITEMS_data to table ART01_ITEMS: 130631ms
char set used to save the CSV file (if not specified, it is assumed it is "UTF-8")
idAsNum
optional field name (having numeric type) which must be created in the border table x_csv and used to host the numeric PK, generated automatically by Platform when importing the CSV
idAsUuid
optional field name (having string type) which must be created in the border table x_csv and used to host a text UUID PK, generated automatically by Platform when importing the CSV
additionalFields
javascript object containing additional fields included in the border table, which must be filled in and not coming from the CSV file (e.g. COMPANY_ID)
headers
flag true/false indicating whether the CSV file has a first row with the headers (read data starts from the second row)
commitSize
in case of a very large result set, commit is essential after a block of data written; e.g. 1000
clobFieldName
a CLOB type field name included in the border table x_csv which is used to host the whole row
datasourceId
optional datasource id (can be set to null to access the default schema) used to refer the border table
tableName
border table name
expectedFields
optional number: it is the number of expected fields for each line; if "expectedFields" and "separator" have been filled ad there is a line containing a lesser number of fields ad the last field starts with a ", then it will be read also the next line and concatenated (more and more til reaching the right number of fields per line)
separator
optional number: it is the fields separator; if "expectedFields" and "separator" have been filled ad there is a line containing a lesser number of fields ad the last field starts with a ", then it will be read also the next line and concatenated (more and more til reaching the right number of fields per line)
newline
optional flag (if not specified it is set to true), with true/false values; if set to true, it includes the new line found in a multi-line text field; if set to false, new lines are ignored and the field text is in a single line
srcDatasourceId
datasource id used to access the firstborder table (can be null: default schema)
srcTableName
x_csv first border table name (e.g. BT01_ITEMS_CSV)
srcFilter
destDatasourceId
datasource id used to access the secondborder table (can be null: default schema)
destTableName
x_data second border table name (e.g. BT02_ITEMS_DATTA)
clobFieldName
optional; in case of a table where the the field name in the border table, having CLOB type and used to store a CSV raw
separator
CSV separator: ; or .
errorFieldName
field name in the FIRST border table where saving validation errors fired when loading the second border table
commitSize
in case of a very large result set, commit is essential after a block of data written; e.g. 1000
containsErrorsField
field name in the FIRST border table used as flag Y/N, set to Y in case of validation errors fired when loading the second border table
asIsFields
javascript object containing the mapping between fields in the first border table and the second one
defaultFields
javascript object containing fields+values to fill in the second border table
mappingFields
javascript Array containing javascript objects, each related to a mapping from the CLOB field and fields in the second border table
uniqueKeys
optional: if specified, it is a String representing the list of fields defined in this border table which represent a unique key (e.g. "FIRST_NAME,LAST_NAME or "EMAIL"). Fields must be separated by a comma (this property is NOT an javascript array)
srcRowId
optional; required if uniqueKeys has been specified: it represent the field name in the first border table, to identify the record to update in case of duplicate key errors
destFilter
SQL condition to add to the checking query for duplicates
deleteFilter
optional SQL filter, used to DELETE data in the second border table before starting the writing process; if set to null, no DELETE is performed
commitSize
block of records to save in the second border table after that a commit is performed
srcDatasourceId
datasource id used to access the secondborder table (can be null: default schema)
fromTable
x_data second border table name (e.g. BT02_ITEMS_DATA)
fromTableFilter
SQL filter (to add to WHERE clause) to filter input data (e.g. COMPANY='00000')
fromRowId
field name in the border table containing the numeric ID and used to update a single record in the border table in case of validation errors
fromFieldCode
field name in the border table containing the code to validate in the application table;
in case of an application table having a unique key composed of multiple fields, you can here specify a list of fields, expressed as a String, where fields are separated by a comma
fromFK
field name where hosting the PK coming form the application table just validated by code (e.g. FK_ART02)
destDatasourceId
datasource id used to access the application table containing data to check out (can be null: default schema)
toTable
application table name (e.g. ART02_VATS)
toTableFilter
SQL filter (to add to WHERE clause) to filter the application table data (e.g. COMPANY='00000')
toFieldCode
field name in the application table, containing the code to check out, in order to validate the one in the border table;
in case of an application table having a unique key composed of multiple fields, you can here specify a list of fields, expressed as a String, where fields are separated by a comma
toFK
field name in the application table representing the PK to copy to the corresponding field in the border table (fromFK)
errorFieldName
field name in the FIRST border table where saving validation errors fired when loading the second border table
commitSize
in case of a very large result set, commit is essential after a block of data written; e.g. 1000
containsErrorsField
field name in the FIRST border table used as flag Y/N, set to Y in case of validation errors fired when loading the second border table
useJoin
optional: if not set, it is set to false; when set to true, the validation query combine "from" table and "to" table with a JOIN; the default setting use an inner query; PAY ATTENTION: do not use the JOIN if the two tables are not in the same schema and, in case it is set to true, it is up to you to ALWAYS include table prefix in the WHERE conditions (fromTableField and toTableFilter)
forceOneUpdate
optional: if not set, it is set to false and in that case, a COUNT(*) query is executed, in order to calculate the number of rows to update: if it is less than 1000, then multiple updates will be performed, otherwise, a single bulk update is performed; in case of a very complex table to update, even with less than 1000 records it would maybe be better to have one only bulk update: in such a scenario, you can set this flag to true; when set to true, no COUNT(*) query is performed and the bulk one only update is executed, independently of the number of rows to validate/update.
srcDatasourceId
datasource id used to access the secondborder table (can be null: default schema)
fromTable
second border table
fromTableFilter
SQL filter applied when reading records from the border table (e.g. COMPANY_ID = '00000')
operation
can be INSERT or UPDATE.
In case of INSERT, it is mandatory to fill in other attributes, like idAsxxx progIdxxx.
In case of UPDATE, the FK field in the border table representing the PK in the destination table must be already be retrieved. If it is not, use the validateCode method to fetch them.
destDatasourceId
datasource id used to access the destination table (can be null: default schema)
toTable
destination table name
idAsNum
optional field name (having numeric type) which must be created in the destination table and in the border table x_data and used to host the numeric PK, generated automatically by Platform before copying data to the destination table, starting from an application table used for counters.
progIdValueField
field in the application table for counters, used to maintain the current counter value (e.g. CURRENT_VALUE)
progIdIncField
field in the application table for counters, used to maintain the value to use to increment the counter (e.g. INCREMENT_VALUE)
progIdTable
application table used for counters
progIdWhere
SQL filter to apply for updating/select a single record in the application table for counters
idAsUuid
optional field name (having string type) which must be created in the border table/destination tableand used to host a text UUID PK, generated automatically by Platform when copying data to the destination table
defaultFields
javascript object containing additional fields to fill in the destination table (e.g. CREATE_DATE)
defaultInLineFields
javascript object used ONLY for UPDATEs, containing additional fields to fill in the destination table (e.g. ROW_VERSION+1 for the field ROW_VERSION)
mappingFields
javascript object containing field names in destination tables and thier mapping with the input data coming from the border table. Field types must be identical.
Example:
{
destField1: "borderTableFieldA",
...
}
commitSize
in case of a very large result set, commit is essential after a block of data written; e.g. 1000