Cloud1 Data Quality Analysis: 4-week Implementation

Cloud1 Oy

To execute a deep data quality analysis for a defined group of data assets to validate them for ingestion, modelling, reporting and automated data quality monitoring

Costs from a bad data quality are extensive. Almost one third of data analysts uses more than 40% of their time to check and to validate data used for analysis. Data workers waste almost half of their time trying to find data, hunt and fix issues and finding confirming data sources for those they do not trust. By an estimate up to 20-30 percent of operational expenses are result of bad data quality.

Cloud1 Data Quality Analysis is an agile deep dive data quality analysis project done with close collaboration with business. We combine both business and IT-demands and goals for data quality. During this project we will familiarize your organization to data quality and strengthen the knowledge and know-how among the project team. Project's goal is to find issues in data assets before they end up compromising ML model or data analysis projects or in worst case to be found from production environments.

Project structure

  1. Kick-off and project scope refinement (Selecting a compact but encompassing set of data assets for the analysis)
  2. Identifying and selecting of data sets
  3. Defining expected behavior and rules for selected data sets
  4. Technical work - Data set integrity testing
  5. Evaluation of data sets suitability based on their integrity status
  6. Technical work - Business rule conversion and data quality rules compilation
  7. Evaluation of defined data behavior and business rules based on data quality analysis and rule results
  8. Creation and presentation of an end report listing found data quality issues, presenting baseline of data assets' data quality status and propose a set of automated data quality validations to be added

Offering includes process model, use of standardized tools and evaluations processes for data analysis, use of Cloud1 Data Quality Accelerator to proof quality check automation. Used tools include:
  • Azure Data Lake, Azure Databricks and Power BI as a platform for the work
  • Cloud1 Data Quality Accelarator (Python library for data quality rule execution automation and monitoring)
  • Standardized Databricks notebooks utilizing Pandas Profile Python library for data integrity testing
  • Power BI templates to provide transparent and easy to understand view of data issues
  • Standardized model for end reports either in separate document or Azure DevOps Wiki page

Offer project produces: For the selected scope evaluation of the level of data quality, proposal of operational actions and tools and finally concrete plan to enhance and monitor the data quality.