Test data
data_check fake fake_config.yml
is used to generate test data for a table from a configuration file.
The data type for each column is deferred from the table and can be changed in the configuration.
See Usage for command line options.
Test data generation is done using Faker.
Example
The minimal configuration only names the table for test data generation:
table: main.simple_table
When running data_check fake fake_config.yml
the column definition from the table is read from the database and a CSV file main.simple_table.csv is generated with some data for the table (100 rows by default). The CSV file can be used to load the data back into the table.
A more complete configuration looks like this and is described in the following:
table: main.simple_table
business_key: # the key that should not change between iterations; must not be null
- bkey
rows: 200 # how many rows to generate
iterations: # generate data with same business_key with some variation
count: 5 # how many iterations to generate
columns:
bkey:
faker: iban
date_col:
add_values: # also use these value
- 1900-01-01
- 9999-12-31
col2:
faker: name # faker provider method, if not correctly inferred
col3:
from_query: select colx from main.other_table # use values from the query
next: inc # "algorithm" for next iteration
col4:
next: random
values: # use these values randomly
- 1
- 2
- null
Test data configuration
The configuration is a YAML file for a single table. The top level elements are:
table
table tells data_check for which table to generate the data.
Example:
table: main.simple_table
business_key
business_key is a list of columns that are unique and do not change between iterations.
Example:
business_key:
- column_1
- column_2
iterations
iterations configures how many iterations will be generated. Each iteration is a single CSV file with the same business keys. Each column can define a next algorithm how the data should change between iterations.
Example:
iterations:
count: 5
columns
columns defines a configuration for each column. If a column is not listed here, a default configuration will be used based on the data type.
Example:
columns:
column_1:
faker: iban
column_2:
from_query: select colx from main.other_table
column configuration
Each column can have multiple configurations:
faker
faker defines the provider used to generate the data. If not given, a default provider is used based on the data type of the column.
faker: name
data type | default provider |
---|---|
decimal | pydecimal |
varchar | pystr |
date | date_between |
datetime | date_time_between |
faker_args:
faker_args defines a map or arguments that are passed to the provider. Each provider can define different arguments.
Example:
faker: date_between
faker_args:
start_date: 1900-01-01
end_date: 2030-12-31
from_query
from_query defines a SQL query. The values of the query are used randomly to generate the data. If from_query is given, faker is ignored.
from_query: select column from other_table
values
value defines a list of values that are used randomly to generate the data. If values is given, from_query and faker are ignored.
values:
- 1
- 2
- null
add_values
add_values is used to add some specific values that are additionally used to generate the data. add_values can be used with faker, from_query and values. Each value in add_values has the same probability to occur as all the other generated values combined.
add_values:
- 9999-12-31
- 1900-01-01
next
next defines an algorithm for the iterator. If next is not given then the column is not changed between iterations.
next: inc
Possible values for next:
- inc: increment by 1 (day for date)
- dec: decrement by 1 (day for date)
- random: generate a random value from the configured faker/values