Configuration

A number of configurations can be made and customised within Data Caterer to help control what gets run and/or where any metadata gets saved.

These configurations are defined from within your Java or Scala class via configuration or for YAML file setup, application.conf file as seen here.

Flags

Flags are used to control which processes are executed when you run Data Caterer.

Config	Default	Paid	Description
`enableGenerateData`	true	N	Enable/disable data generation
`enableCount`	true	N	Count the number of records generated. Can be disabled to improve performance
`enableFailOnError`	true	N	Whilst saving generated data, if there is an error, it will stop any further data from being generated
`enableSaveReports`	true	N	Enable/disable HTML reports summarising data generated, metadata of data generated (if `enableSinkMetadata` is enabled) and validation results (if `enableValidation` is enabled). Sample here
`enableSinkMetadata`	true	N	Run data profiling for the generated data. Shown in HTML reports if `enableSaveSinkMetadata` is enabled
`enableValidation`	false	N	Run validations as described in plan. Results can be viewed from logs or from HTML report if `enableSaveSinkMetadata` is enabled. Sample here
`enableGeneratePlanAndTasks`	false	Y	Enable/disable plan and task auto generation based off data source connections
`enableRecordTracking`	false	Y	Enable/disable which data records have been generated for any data source
`enableDeleteGeneratedRecords`	false	Y	Delete all generated records based off record tracking (if `enableRecordTracking` has been set to true)
`enableGenerateValidations`	false	Y	If enabled, it will generate validations based on the data sources defined.

JavaScalaapplication.conf

configuration()
  .enableGenerateData(true)
  .enableCount(true)
  .enableFailOnError(true)
  .enableSaveReports(true)
  .enableSinkMetadata(true)
  .enableValidation(false)
  .enableGeneratePlanAndTasks(false)
  .enableRecordTracking(false)
  .enableDeleteGeneratedRecords(false)
  .enableGenerateValidations(false);

configuration
  .enableGenerateData(true)
  .enableCount(true)
  .enableFailOnError(true)
  .enableSaveReports(true)
  .enableSinkMetadata(true)
  .enableValidation(false)
  .enableGeneratePlanAndTasks(false)
  .enableRecordTracking(false)
  .enableDeleteGeneratedRecords(false)
  .enableGenerateValidations(false)

flags {
  enableCount = false
  enableCount = ${?ENABLE_COUNT}
  enableGenerateData = true
  enableGenerateData = ${?ENABLE_GENERATE_DATA}
  enableFailOnError = true
  enableFailOnError = ${?ENABLE_FAIL_ON_ERROR}
  enableGeneratePlanAndTasks = false
  enableGeneratePlanAndTasks = ${?ENABLE_GENERATE_PLAN_AND_TASKS}
  enableRecordTracking = false
  enableRecordTracking = ${?ENABLE_RECORD_TRACKING}
  enableDeleteGeneratedRecords = false
  enableDeleteGeneratedRecords = ${?ENABLE_DELETE_GENERATED_RECORDS}
  enableGenerateValidations = false
  enableGenerateValidations = ${?ENABLE_GENERATE_VALIDATIONS}
}

Folders

Depending on which flags are enabled, there are folders that get used to save metadata, store HTML reports or track the records generated.

These folder pathways can be defined as a cloud storage pathway (i.e. s3a://my-bucket/task).

Config	Default	Paid	Description
`planFilePath`	/opt/app/plan/customer-create-plan.yaml	N	Plan file path to use when generating and/or validating data
`taskFolderPath`	/opt/app/task	N	Task folder path that contains all the task files (can have nested directories)
`validationFolderPath`	/opt/app/validation	N	Validation folder path that contains all the validation files (can have nested directories)
`generatedReportsFolderPath`	/opt/app/report	N	Where HTML reports get generated that contain information about data generated along with any validations performed
`generatedPlanAndTaskFolderPath`	/tmp	Y	Folder path where generated plan and task files will be saved
`recordTrackingFolderPath`	/opt/app/record-tracking	Y	Where record tracking parquet files get saved

JavaScalaapplication.conf

configuration()
  .planFilePath("/opt/app/custom/plan/postgres-plan.yaml")
  .taskFolderPath("/opt/app/custom/task")
  .validationFolderPath("/opt/app/custom/validation")
  .generatedReportsFolderPath("/opt/app/custom/report")
  .generatedPlanAndTaskFolderPath("/opt/app/custom/generated")
  .recordTrackingFolderPath("/opt/app/custom/record-tracking");

configuration
  .planFilePath("/opt/app/custom/plan/postgres-plan.yaml")
  .taskFolderPath("/opt/app/custom/task")
  .validationFolderPath("/opt/app/custom/validation")
  .generatedReportsFolderPath("/opt/app/custom/report")
  .generatedPlanAndTaskFolderPath("/opt/app/custom/generated")
  .recordTrackingFolderPath("/opt/app/custom/record-tracking")

folders {
  planFilePath = "/opt/app/custom/plan/postgres-plan.yaml"
  planFilePath = ${?PLAN_FILE_PATH}
  taskFolderPath = "/opt/app/custom/task"
  taskFolderPath = ${?TASK_FOLDER_PATH}
  validationFolderPath = "/opt/app/custom/validation"
  validationFolderPath = ${?VALIDATION_FOLDER_PATH}
  generatedReportsFolderPath = "/opt/app/custom/report"
  generatedReportsFolderPath = ${?GENERATED_REPORTS_FOLDER_PATH}
  generatedPlanAndTaskFolderPath = "/opt/app/custom/generated"
  generatedPlanAndTaskFolderPath = ${?GENERATED_PLAN_AND_TASK_FOLDER_PATH}
  recordTrackingFolderPath = "/opt/app/custom/record-tracking"
  recordTrackingFolderPath = ${?RECORD_TRACKING_FOLDER_PATH}
}

Metadata

When metadata gets generated, there are some configurations that can be altered to help with performance or accuracy related issues. Metadata gets generated from two processes: 1) if enableGeneratePlanAndTasks or 2) if enableSinkMetadata are enabled.

During the generation of plan and tasks, data profiling is used to create the metadata for each of the fields defined in the data source. You may face issues if the number of records in the data source is large as data profiling is an expensive task. Similarly, it can be expensive when analysing the generated data if the number of records generated is large.

Config	Default	Paid	Description
`numRecordsFromDataSource`	10000	Y	Number of records read in from the data source that could be used for data profiling
`numRecordsForAnalysis`	10000	Y	Number of records used for data profiling from the records gathered in `numRecordsFromDataSource`
`oneOfMinCount`	1000	Y	Minimum number of records required before considering if a field can be of type `oneOf`
`oneOfDistinctCountVsCountThreshold`	0.2	Y	Threshold ratio to determine if a field is of type `oneOf` (i.e. a field called `status` that only contains `open` or `closed`. Distinct count = 2, total count = 10, ratio = 2 / 10 = 0.2 therefore marked as `oneOf`)
`numGeneratedSamples`	10	N	Number of sample records from generated data to take. Shown in HTML report

JavaScalaapplication.conf

configuration()
  .numRecordsFromDataSourceForDataProfiling(10000)
  .numRecordsForAnalysisForDataProfiling(10000)
  .oneOfMinCount(1000)
  .oneOfDistinctCountVsCountThreshold(1000)
  .numGeneratedSamples(10);

configuration
  .numRecordsFromDataSourceForDataProfiling(10000)
  .numRecordsForAnalysisForDataProfiling(10000)
  .oneOfMinCount(1000)
  .oneOfDistinctCountVsCountThreshold(1000)
  .numGeneratedSamples(10)

metadata {
  numRecordsFromDataSource = 10000
  numRecordsForAnalysis = 10000
  oneOfMinCount = 1000
  oneOfDistinctCountVsCountThreshold = 0.2
  numGeneratedSamples = 10
}

Generation

When generating data, you may have some limitations such as limited CPU or memory, large number of data sources, or data sources prone to failure under load. To help alleviate these issues or speed up performance, you can control the number of records that get generated in each batch.

Config	Default	Paid	Description
`numRecordsPerBatch`	100000	N	Number of records across all data sources to generate per batch
`numRecordsPerStep`		N	Overrides the count defined in each step with this value if defined (i.e. if set to 1000, for each step, 1000 records will be generated)

ScalaScalaapplication.conf

configuration()
  .numRecordsPerBatch(100000)
  .numRecordsPerStep(1000);

configuration
  .numRecordsPerBatch(100000)
  .numRecordsPerStep(1000)

generation {
  numRecordsPerBatch = 100000
  numRecordsPerStep = 1000
}

Runtime

Given Data Caterer uses Spark as the base framework for data processing, you can configure the job as to your specifications via configuration as seen here.

JavaScalaapplication.conf

configuration()
  .master("local[*]")
  .runtimeConfig(Map.of("spark.driver.cores", "5"))
  .addRuntimeConfig("spark.driver.memory", "10g");

configuration
  .master("local[*]")
  .runtimeConfig(Map("spark.driver.cores" -> "5"))
  .addRuntimeConfig("spark.driver.memory" -> "10g")

runtime {
  master = "local[*]"
  master = ${?DATA_CATERER_MASTER}
  config {
    "spark.driver.cores" = "5"
    "spark.driver.memory" = "10g"
  }
}