Test data

data_check fake fake_config.yml is used to generate test data for a table from a configuration file. The data type for each column is deferred from the table and can be changed in the configuration.

See Usage for command line options.

Test data generation is done using Faker.

Example

The minimal configuration only names the table for test data generation:

table: main.simple_table

When running data_check fake fake_config.yml the column definition from the table is read from the database and a CSV file main.simple_table.csv is generated with some data for the table (100 rows by default). The CSV file can be used to load the data back into the table.

A more complete configuration looks like this and is described in the following:

table: main.simple_table
business_key: # the key that should not change between iterations; must not be null
  - bkey
rows: 200  # how many rows to generate

iterations:  # generate data with same business_key with some variation
  count: 5  # how many iterations to generate

columns:
  bkey:
    faker: iban

  date_col:
    add_values:  # also use these value
      - 1900-01-01
      - 9999-12-31
  col2:
    faker: name # faker provider method, if not correctly inferred
  col3:
    from_query: select colx from main.other_table  # use values from the query
    next: inc  # "algorithm" for next iteration
  col4:
    next: random
    values:  # use these values randomly
      - 1
      - 2
      - null

Test data configuration

The configuration is a YAML file for a single table. The top level elements are:

table

table tells data_check for which table to generate the data.

Example:

table: main.simple_table

business_key

business_key is a list of columns that are unique and do not change between iterations.

Example:

business_key:
  - column_1
  - column_2

iterations

iterations configures how many iterations will be generated. Each iteration is a single CSV file with the same business keys. Each column can define a next algorithm how the data should change between iterations.

Example:

iterations:
  count: 5

columns

columns defines a configuration for each column. If a column is not listed here, a default configuration will be used based on the data type.

Example:

columns:
  column_1:
    faker: iban
  column_2:
    from_query: select colx from main.other_table

column configuration

Each column can have multiple configurations:

faker

faker defines the provider used to generate the data. If not given, a default provider is used based on the data type of the column.

faker: name
data type default provider
decimal pydecimal
varchar pystr
date date_between
datetime date_time_between

faker_args:

faker_args defines a map or arguments that are passed to the provider. Each provider can define different arguments.

Example:

faker: date_between
faker_args:
  start_date: 1900-01-01
  end_date: 2030-12-31

from_query

from_query defines a SQL query. The values of the query are used randomly to generate the data. If from_query is given, faker is ignored.

from_query: select column from other_table

values

value defines a list of values that are used randomly to generate the data. If values is given, from_query and faker are ignored.

values:
  - 1
  - 2
  - null

add_values

add_values is used to add some specific values that are additionally used to generate the data. add_values can be used with faker, from_query and values. Each value in add_values has the same probability to occur as all the other generated values combined.

add_values:
  - 9999-12-31
  - 1900-01-01

next

next defines an algorithm for the iterator. If next is not given then the column is not changed between iterations.

next: inc

Possible values for next:

  • inc: increment by 1 (day for date)
  • dec: decrement by 1 (day for date)
  • random: generate a random value from the configured faker/values