Welcome to data_check
data_check is a simple data validation tool. In its most basic form, it will execute SQL queries and compare the results against CSV or Excel files. But there are more advanced features:
Features
- CSV checks: compare SQL queries against CSV files
- Excel support: Use Excel (xlsx) instead of CSV
- multiple environments (databases) in the configuration file
- populate tables from CSV or Excel files
- execute any SQL files on a database
- more complex pipelines
- run any script/command (via pipelines)
- simplified checks for empty datasets and full table comparison
- lookups to reuse the same data in multiple queries
- test data generation
Database support
data_check is tested with these databases:
- PostgreSQL
- MySQL
- SQLite
- Oracle
- Microsoft SQL Server
Partially supported:
- DuckDB
- Databricks
Other databases supported by SQLAlchemy might also work.
Why?
data_check tries to solve a simple problem in data domains: during development you create multiple SQL queries to validate your data and compare the result manually. With data_check you can organize your data tests and run them automatically, for example in a CI pipeline.
How to get started
First install data_check. You can then try data_check with the sample project.
To create a new project folder write a data_check.yml for the configuration and put your test files in the checks folder.
Project layout
data_check has a simple layout for projects: a single configuration file and a folder with the test files. You can also organize the test files in subfolders.
data_check.yml # The configuration file
checks/ # Default folder for data tests
some_test.sql # SQL file with the query to run against the database
some_test.csv # CSV file with the expected result
other_test.sql # SQL file with another test
other_test.xlsx # Expected result for other_test.sql in an Excel file
empty_result.sql # SQL file with a result set that is expected to be empty
empty_result.empty # empty file for empty_result.sql
subfolder/ # Tests can be nested in subfolders
Configuration
data_check uses the data_check.yml file in the current folder for configuration. This is a simple YAML file:
default_connection: con
connections:
con: connection-string
Under connections you can put multiple connection strings with names. default_connection is the connection name that data_check uses when no additional arguments are given. You can also use environment variables in the connection string to store the credentials outside data_check.yml (e.g. postgresql://postgres:${POSTGRES_PWD}@db:5432
).
Test files
The test files are organized in the checks folder and its subfolders. data_check will run the queries for each SQL file in these folders and compare the results with the corresponding CSV files. The CSV file must be named exactly like the SQL file, only with different file endings.
Instead of writing the CSV files manually, you can also generate them from the SQL files.
How it works
data_check uses pandas and SQLAlchemy internally. The SQL files are executed and converted to DataFrames. The CSV files are also parsed to DataFrames. Both DataFrames are then merged with a full outer join. If some rows differ, the test is considered as failed.