Guides
Below are a list of guides you can follow to create your data generation for your use case.
For any of the paid tier guides, you can use the trial version fo the app to try it out. Details on how to get the trial can be found here.
Scenarios
- First Data Generation - If you are new, this is the place to start
- Multiple Records Per Column Value - How you can generate multiple records per set of columns
- Foreign Keys Across Data Sources - Generate matching values across generated data sets
- Data Validations - Run data validations after generating data
- Auto Generate From Data Connection - Automatically generating data from just defining data sources
- Delete Generated Data - Delete the generated data whilst leaving other data
- Generate Batch and Event Data - Generate matching batch and event data
Data Sources
- Files (CSV, JSON, ORC, Parquet) - Generate data for popular file formats
- Postgres - JDBC Postgres tables
- Cassandra - Cassandra tables
- Kafka - Kafka topics
- Solace - Solace messages
- Marquez - Generate data based on metadata in Marquez
- OpenMetadata - Generate data based on metadata in OpenMetadata
- HTTP - HTTP requests
- Files (Fixed width) - (Soon to document) A variant of CSV but with no separator
- MySql - (Soon to document) JDBC MySql tables
YAML Files
Base Concept
The execution of the data generator is based on the concept of plans and tasks. A plan represent the set of tasks that
need to be executed,
along with other information that spans across tasks, such as foreign keys between data sources.
A task represent the component(s) of a data source and its associated metadata so that it understands what the data
should look like
and how many steps (sub data sources) there are (i.e. tables in a database, topics in Kafka). Tasks can define one or
more steps.
Plan
Foreign Keys
Define foreign keys across data sources in your plan to ensure generated data can match
Link to associated task 1
Link to associated task 2
Task
Data Source Type | Data Source | Sample Task | Notes |
---|---|---|---|
Database | Postgres | Sample | |
Database | MySQL | Sample | |
Database | Cassandra | Sample | |
File | CSV | Sample | |
File | JSON | Sample | Contains nested schemas and use of SQL for generated values |
File | Parquet | Sample | Partition by year column |
Kafka | Kafka | Sample | Specific base schema to be used, define headers, key, value, etc. |
JMS | Solace | Sample | JSON formatted message |
HTTP | PUT | Sample | JSON formatted PUT body |
Configuration
Docker-compose
To see how it runs against different data sources, you can run using docker-compose
and set DATA_SOURCE
like below
./gradlew build
cd docker
DATA_SOURCE=postgres docker-compose up -d datacaterer
Can set it to one of the following:
- postgres
- mysql
- cassandra
- solace
- kafka
- http