Configuration
A number of configurations can be made and customised within Data Caterer to help control what gets run and/or where any metadata gets saved.
These configurations are defined from within your Java or Scala class via configuration
or for YAML file setup,
application.conf
file as seen
here.
Flags
Flags are used to control which processes are executed when you run Data Caterer.
Config | Default | Paid | Description |
---|---|---|---|
enableGenerateData |
true | N | Enable/disable data generation |
enableCount |
true | N | Count the number of records generated. Can be disabled to improve performance |
enableFailOnError |
true | N | Whilst saving generated data, if there is an error, it will stop any further data from being generated |
enableSaveReports |
true | N | Enable/disable HTML reports summarising data generated, metadata of data generated (if enableSinkMetadata is enabled) and validation results (if enableValidation is enabled). Sample here |
enableSinkMetadata |
true | N | Run data profiling for the generated data. Shown in HTML reports if enableSaveSinkMetadata is enabled |
enableValidation |
false | N | Run validations as described in plan. Results can be viewed from logs or from HTML report if enableSaveSinkMetadata is enabled. Sample here |
enableGeneratePlanAndTasks |
false | Y | Enable/disable plan and task auto generation based off data source connections |
enableRecordTracking |
false | Y | Enable/disable which data records have been generated for any data source |
enableDeleteGeneratedRecords |
false | Y | Delete all generated records based off record tracking (if enableRecordTracking has been set to true) |
enableGenerateValidations |
false | Y | If enabled, it will generate validations based on the data sources defined. |
configuration()
.enableGenerateData(true)
.enableCount(true)
.enableFailOnError(true)
.enableSaveReports(true)
.enableSinkMetadata(true)
.enableValidation(false)
.enableGeneratePlanAndTasks(false)
.enableRecordTracking(false)
.enableDeleteGeneratedRecords(false)
.enableGenerateValidations(false);
configuration
.enableGenerateData(true)
.enableCount(true)
.enableFailOnError(true)
.enableSaveReports(true)
.enableSinkMetadata(true)
.enableValidation(false)
.enableGeneratePlanAndTasks(false)
.enableRecordTracking(false)
.enableDeleteGeneratedRecords(false)
.enableGenerateValidations(false)
flags {
enableCount = false
enableCount = ${?ENABLE_COUNT}
enableGenerateData = true
enableGenerateData = ${?ENABLE_GENERATE_DATA}
enableFailOnError = true
enableFailOnError = ${?ENABLE_FAIL_ON_ERROR}
enableGeneratePlanAndTasks = false
enableGeneratePlanAndTasks = ${?ENABLE_GENERATE_PLAN_AND_TASKS}
enableRecordTracking = false
enableRecordTracking = ${?ENABLE_RECORD_TRACKING}
enableDeleteGeneratedRecords = false
enableDeleteGeneratedRecords = ${?ENABLE_DELETE_GENERATED_RECORDS}
enableGenerateValidations = false
enableGenerateValidations = ${?ENABLE_GENERATE_VALIDATIONS}
}
Folders
Depending on which flags are enabled, there are folders that get used to save metadata, store HTML reports or track the records generated.
These folder pathways can be defined as a cloud storage pathway (i.e. s3a://my-bucket/task
).
Config | Default | Paid | Description |
---|---|---|---|
planFilePath |
/opt/app/plan/customer-create-plan.yaml | N | Plan file path to use when generating and/or validating data |
taskFolderPath |
/opt/app/task | N | Task folder path that contains all the task files (can have nested directories) |
validationFolderPath |
/opt/app/validation | N | Validation folder path that contains all the validation files (can have nested directories) |
generatedReportsFolderPath |
/opt/app/report | N | Where HTML reports get generated that contain information about data generated along with any validations performed |
generatedPlanAndTaskFolderPath |
/tmp | Y | Folder path where generated plan and task files will be saved |
recordTrackingFolderPath |
/opt/app/record-tracking | Y | Where record tracking parquet files get saved |
configuration()
.planFilePath("/opt/app/custom/plan/postgres-plan.yaml")
.taskFolderPath("/opt/app/custom/task")
.validationFolderPath("/opt/app/custom/validation")
.generatedReportsFolderPath("/opt/app/custom/report")
.generatedPlanAndTaskFolderPath("/opt/app/custom/generated")
.recordTrackingFolderPath("/opt/app/custom/record-tracking");
configuration
.planFilePath("/opt/app/custom/plan/postgres-plan.yaml")
.taskFolderPath("/opt/app/custom/task")
.validationFolderPath("/opt/app/custom/validation")
.generatedReportsFolderPath("/opt/app/custom/report")
.generatedPlanAndTaskFolderPath("/opt/app/custom/generated")
.recordTrackingFolderPath("/opt/app/custom/record-tracking")
folders {
planFilePath = "/opt/app/custom/plan/postgres-plan.yaml"
planFilePath = ${?PLAN_FILE_PATH}
taskFolderPath = "/opt/app/custom/task"
taskFolderPath = ${?TASK_FOLDER_PATH}
validationFolderPath = "/opt/app/custom/validation"
validationFolderPath = ${?VALIDATION_FOLDER_PATH}
generatedReportsFolderPath = "/opt/app/custom/report"
generatedReportsFolderPath = ${?GENERATED_REPORTS_FOLDER_PATH}
generatedPlanAndTaskFolderPath = "/opt/app/custom/generated"
generatedPlanAndTaskFolderPath = ${?GENERATED_PLAN_AND_TASK_FOLDER_PATH}
recordTrackingFolderPath = "/opt/app/custom/record-tracking"
recordTrackingFolderPath = ${?RECORD_TRACKING_FOLDER_PATH}
}
Metadata
When metadata gets generated, there are some configurations that can be altered to help with performance or accuracy
related issues.
Metadata gets generated from two processes: 1) if enableGeneratePlanAndTasks
or 2) if enableSinkMetadata
are
enabled.
During the generation of plan and tasks, data profiling is used to create the metadata for each of the fields defined in the data source. You may face issues if the number of records in the data source is large as data profiling is an expensive task. Similarly, it can be expensive when analysing the generated data if the number of records generated is large.
Config | Default | Paid | Description |
---|---|---|---|
numRecordsFromDataSource |
10000 | Y | Number of records read in from the data source that could be used for data profiling |
numRecordsForAnalysis |
10000 | Y | Number of records used for data profiling from the records gathered in numRecordsFromDataSource |
oneOfMinCount |
1000 | Y | Minimum number of records required before considering if a field can be of type oneOf |
oneOfDistinctCountVsCountThreshold |
0.2 | Y | Threshold ratio to determine if a field is of type oneOf (i.e. a field called status that only contains open or closed . Distinct count = 2, total count = 10, ratio = 2 / 10 = 0.2 therefore marked as oneOf ) |
numGeneratedSamples |
10 | N | Number of sample records from generated data to take. Shown in HTML report |
configuration()
.numRecordsFromDataSourceForDataProfiling(10000)
.numRecordsForAnalysisForDataProfiling(10000)
.oneOfMinCount(1000)
.oneOfDistinctCountVsCountThreshold(1000)
.numGeneratedSamples(10);
configuration
.numRecordsFromDataSourceForDataProfiling(10000)
.numRecordsForAnalysisForDataProfiling(10000)
.oneOfMinCount(1000)
.oneOfDistinctCountVsCountThreshold(1000)
.numGeneratedSamples(10)
metadata {
numRecordsFromDataSource = 10000
numRecordsForAnalysis = 10000
oneOfMinCount = 1000
oneOfDistinctCountVsCountThreshold = 0.2
numGeneratedSamples = 10
}
Generation
When generating data, you may have some limitations such as limited CPU or memory, large number of data sources, or data sources prone to failure under load. To help alleviate these issues or speed up performance, you can control the number of records that get generated in each batch.
Config | Default | Paid | Description |
---|---|---|---|
numRecordsPerBatch |
100000 | N | Number of records across all data sources to generate per batch |
numRecordsPerStep |
N | Overrides the count defined in each step with this value if defined (i.e. if set to 1000, for each step, 1000 records will be generated) |
configuration()
.numRecordsPerBatch(100000)
.numRecordsPerStep(1000);
configuration
.numRecordsPerBatch(100000)
.numRecordsPerStep(1000)
generation {
numRecordsPerBatch = 100000
numRecordsPerStep = 1000
}
Runtime
Given Data Caterer uses Spark as the base framework for data processing, you can configure the job as to your specifications via configuration as seen here.
configuration()
.master("local[*]")
.runtimeConfig(Map.of("spark.driver.cores", "5"))
.addRuntimeConfig("spark.driver.memory", "10g");
configuration
.master("local[*]")
.runtimeConfig(Map("spark.driver.cores" -> "5"))
.addRuntimeConfig("spark.driver.memory" -> "10g")
runtime {
master = "local[*]"
master = ${?DATA_CATERER_MASTER}
config {
"spark.driver.cores" = "5"
"spark.driver.memory" = "10g"
}
}