Comparison to similar tools
I have tried to include all the companies found in the list here from Mostly AI blog post and used information that is publicly available.
The companies/products not shown below either have:
- a website with insufficient information about the technology side of data generation/validation
- no/little documentation
- don't have a free, no sign-up version of their app to use
Data Generation
Tool | Description | Cost | Pros | Cons |
---|---|---|---|---|
Clearbox AI | Python based data generation tool via ML | Unclear | Batch data only No data clean up Limited/no documentation |
|
Curiosity Software | Platform solution for test data management | Unclear | No quick start No SDK Many components that may not be required No event generation support |
|
DataCebo Synthetic Data Vault | Python based data generation tool via ML | Unclear | No data connection support No data clean up No foreign key support |
|
Datafaker | Realistic data generation library | Free | No data connection support No data clean up No validation No foreign key support |
|
DBLDatagen | Python based data generation tool | Free | Limited support if issues Code required No data clean up No data validation |
|
Gatling | HTTP API load testing tool | Free (Open Source) Gatling Enterprise, usage based, starts from €89 per month, 1 user, 6.25 hours of testing |
Only supports HTTP, JMS and JDBC No data clean up Data feeders not based off metadata |
|
Gretel | Python based data generation tool via ML | Usage based, starts from $295 per month, $2.20 per credit, assumed USD | Batch data only No relationships between data sources Only simple foreign key relations defined No data clean up Charge by usage |
|
Howso | Python based data generation tool via ML | Unclear | No support for data sources No data validation No data clean up |
|
Mostly AI | Python based data generation tool via ML | Usage based, Enterprise 1 user, 100 columns, 100K rows $3,100 per month, assumed USD | Charge by usage Batch data only No data clean up Confusing use of 'smart select' for multiple foreign keys Limited custom column generation logic Multiple deployment components No SDK |
|
Octopize | Python based data generation tool via ML | Unclear | Input data source is only CSV Multiple manual steps before starting Quickstart is not a quickstart Documentation lacks code examples |
|
Synthesized | Python based data generation tool via ML | Unclear | Not sure what is SDK & TDK Charge by usage No report of what was generated No relationships between data sources |
|
Tonic | Platform solution for generating data | Unclear | Batch data only Multiple deployment components No relationships between data sources No data validation No data clean up No SDK (only API) Difficult to embed complex business logic |
|
YData | Python based data generation tool via ML. Platform solution as well | Unclear | No data connection support Batch data only No data clean up Separate data generation and data validation No foreign key support |
Use of ML models
You may notice that the majority of data generators use machine learning (ML) models to learn from your existing datasets to generate new data. Below are some pros and cons to the approach.
Pros
- Simple setup
- Ability to reproduce complex logic
- Flexible to accept all types of data
Cons
- Long time for model learning
- Black box of logic
- Maintain, store and update of ML models
- Restriction on input data lengths
- May not maintain referential integrity
- Require deeper understanding of ML models for fine-tuning
- Accuracy may be worse than non-ML models