Create datasets for Machine Learning (ML) Model Training and Data Science use cases.

Synthesized Scientific Data Kit (SDK) is a comprehensive framework for generative modelling for structured data (tabular, time-series and event-based data). The SDK helps you create compliant statistical-preserving data snapshots for BI/Analytics and ML/AI applications. Right-size your data with AI-supported data transformations.

With the SDK, you can:

  • Improve data quality: benefit from up to ~15% uplift in ML/AI model performance with data rebalancing, data imputation, and high-quality synthetic data generation. SDK helps increase revenue across conversion, fraud, revenue recovery, and more.
  • Enable fast data access and lower data acquisition cost: extract data insights faster for BI/Analytics and train your ML/AI models faster. Increase developer productivity and speed-to-market.
  • Ensure data privacy and data compliance: codify complex data privacy requirements into concrete data transformations. Ensure compliance when using sensitive data in cloud initiatives. Rapidly migrate your data pipelines and workflows to the cloud faster.

Synthesized SDK is enabled on the marketplace via a Jupyter Notebook Server with the pre-installed SDK, providing an easy platform to start working with generative modelling. The Jupyter Notebook environment lets users create Python notebooks to load and process datasets, train a Synthesizer on the data, generate training/test data, and finally save the generated data to a desired destination.

In the BYOL version, users are required to have a license key already and supply it either at container creation or inside the notebooks.


Target industries:

  • Financial Services
  • Insurance
  • Healthcare
  • Pharmaceuticals
  • Government and Public Services

Key Benefits:

  • Increase market value of existing data
  • Improve model performance by 4-15%
  • Shorten model time to value from hours/days to minutes
  • Increase developer productivity by 20%+

Key Features:

  • Data rebalancing
  • Data snapshots
  • Synthetic data generation
  • Data anonymisation
  • Declarative Python DSL

Supported data types:

  • Tabular data
  • Time-series data
  • Event-based data


Use case on improving data quality and model performance:

Start a 30-day free trial now.