One project every month - making stuff better ...

Creating complete synthetic test data sets from production data

How easy could it be to generate a complete set of test data from existing tables? To make sure that data looks like production but in no way is a copy or scramble of production data. Instead it is synthetically created, to look like what your production data looks like. Same types, same data domain, same coherence between values, same data distribution and also maintaining the relationships (foreign keys) between all the tables simulating.

Create your Oracle test data using JSON and PLSQL

For a long time the only way to create a generator using the testdata_ninja package, was using a pure text based syntax. It was using special characters and was maybe a bit obscure and not user friendly. The reason for this was that the databases I needed this package on, was running Oracle 10 and 11. So I needed the package to be compatible with those versions, and JSON is not supported natively. Looking at the requirements, it would be a good fit for JSON though. So recently when I needed to run this on Oracle version 12, I had the chance to finally add JSON support. It makes it a lot more easy to write the generator definitions, as you can see in the below examples.

Predictable random test data using plsql

One of the problems in using random generated test data, is that it is random. Because sometimes what you need is test data that is the same every time you generate it. This entry will look at how you can achieve this using the seed functionality of the dbms_random package.

Random null value in plsql generated data

I have added another small feature to the testdata library. The ability to create random data is good, but sometimes you also need to have "gaps" in that data. So I made it possible to randomly create null instead of a value for any generated field. Simply surround the generator function in parentheses and add how big a percentage the chance are of a null value after the parentheses.

Recursive test data using plsql and the testdata_ninja package

Just the other day I was running a test of a system at work. I have already run several tests against this system, so I already have a test data generator to generate clients in the system. This time though I had to generate some orders within the system. So a really quick way to do that is to reuse the existing generator for the client id reference (a foreign key in the real system) from the orders data rows. This way I don't have to create an actual client table. I can just reference the plsql test data generator directly in my definition of my order generator.

PLSQL test data from cursor to table or from cursor to csv

When you are creating test data it is usually either to create a quick test table for a piece of code, or because you need to simulate data extraction to another system. That is why any generator created with testdata_ninja includes methods to do that quickly. Every generator created includes a to_table method and a to_csv table function.

Even more test data generated using the testdata_ninja plsql package

Last entry to the testdata_ninja blog roll, was about how to create iterators that were unique when you create test data rows. In this entry I will show how to use back references to other column values to use them as input to other row generator functions. One example could be that you were creating a table of employees and you wanted the name to be more realistic. So we could have a country column in the row as well and use that as an input to create the name, so it looks like a name from that country.

Generating randomized incremental values

In the previous entries, I explained the basics of testdata_ninja and then I explained how to create foreign key relations between your test data and existing tables. This time I will show how you can create data that are random but always increasing, either to create time patterns or to create unique id columns for primary keys. I will extend the format that I have already used in the previous entries, so the code will be familiar. If we look at the output of the test_days generator we have created, it currently looks like this:

SQL> select * from table(tdg_test_days.test_days);

NAME                                               BIRTHDAY            MY_PK_REF
-------------------------------------------------- ------------------ ----------
Arianna Turner                                     02-MAR-93 00:24:58          1
Sofie Dam                                          17-MAY-91 17:21:09          1
Camila King                                        27-OCT-62 18:16:56         10
Malik Henriksen                                    24-NOV-65 06:23:51         10
.... [more rows] ....
There is no clear column with possible unique values, so I will create a new column where the data generator is an "incrementor". Like all the other different data fields, it is in the 3 field definition of the block that we define how to generate the data. So far the format looks like this:

So let us add a column in the beginning, that increments randomly but is unique nonetheless:

and here is the explanation of the fields, which are separated by the "~" character:
  • incrementor type (numiterate for numbers and datiterate for dates)
  • start from (number or date to start increment from)
  • increment range (for numbers min and max separated by the "¤" character and dates interval type (seconds, minutes etc), min and max separated by the "¤" character for example: seconds¤1¤5)
So if we now take a look at the full format in the code, it would look like this:

  the_format varchar2(4000) := 'people_pk#number#^numiterate~1~3¤5@name#varchar2(50)#person_random.r_name@birthday#date#person_random.r_birthday@my_pk_ref#number#£t_people¤person_num_pk¤simple¤5';
  testdata_ninja.generator_create('test_days', the_format);
And would create a result like the following:

1	Sigurd Andreasen	20-NOV-63 12:33:49	1
6	Ella Madsen	30-JAN-80 07:09:39	1
9	Alexandra White	16-JUL-53 16:32:07	1
13	Benjamin Holst	08-APR-88 13:39:38	1
17	Lea Jepsen	24-MAY-88 17:21:01	1
21	Eli Morgan	07-AUG-86 15:54:27	10
26	Oskar Bruun	20-JUL-69 22:40:56	10
31	Jens Friis	26-APR-63 16:32:21	10
35	Camila Hall	05-DEC-78 23:30:19	10

Test data and foreign keys

In my last entry on testdata_ninja I showed the basics on how to create your own custom test data generators.

Once you have created the simple ones you probably want to extend it a little bit, and create test tables that has relations. So you want to create random data, but random data that actually can join to another table. The way to do this, is with a special data generator construct in the testdata_ninja format. If you read the old entry, you can see that testdata_ninja format is a series of columns seperated by the "@" sign. Each column has at least 3 parts (divided by the "#" character), where the third is the data generator.

So in the previous example I wrote about we used functions from the random_ninja package. You can however, reference existing tables instead. First we want to create a master table that we can reference in our new generator. We will use the basic data generators that are already part of the testdata_ninja package to create this:

create table t_people as select * from table(testdata_generator.people(100));
This statement will create a table with a hundred rows in. This table is what we will use to create our other test data table. One of the rows is called person_num_pk, and that is the column with the relation to our other table. The way we define that is with a special character as the first character in our generator field. So in my last entry I defined the test table with the following format:

So let us say that I wanted to add another column here, that linked to my primary key in the new table I just created. If I wanted to do that I simply create a third column definition like the following:

Notice the "£" sign. This is what specifies this generator as a reference to another table and not a function. The fields in the reference generator (divided with the "¤" character) are the following:

  • table name that we are referencing (t_people)
  • column in that table we are referencing (person_num_pk)
  • distribution rule (simple or range)
  • distribution count. If simple just one number, and every ref will this many rows in the new generator or range and you specify 2,5 then every reference will have between 2 and 5 rows in the new generator
So to create this new table we would simply run:

  the_format varchar2(4000) := 'name#varchar2(50)#person_random.r_name@birthday#date#person_random.r_birthday@my_pk_ref#number#£t_people¤person_num_pk¤simple¤5';
  testdata_ninja.generator_create('test_days', the_format);
and we can now select from the generator and see that each value in the my_pk_ref column will be a real reference back to the t_people table and there will be 5 "child" rows per real row from the t_people table:

  , b.birthday
  t_people a
  , (select name, birthday, my_pk_ref from table(tdg_test_days.test_days)) b
  a.person_num_pk = b.my_pk_ref;

FIRST_NAME                                                                                           BIRTHDAY         
---------------------------------------------------------------------------------------------------- ------------------
Parker                                                                                               14-SEP-64 13:35:56
Parker                                                                                               18-APR-62 05:02:35
Parker                                                                                               05-SEP-93 12:35:50
Parker                                                                                               16-AUG-82 21:19:14
Parker                                                                                               11-MAR-85 15:51:54
Bryson                                                                                               03-SEP-84 19:56:20
Bryson                                                                                               08-JUL-96 14:33:07
Bryson                                                                                               30-DEC-62 06:28:59
.......... [more rows]

Creating real test data - Fast and Efficiently

When I build RANDOM_NINJA I knew already one of things I wanted to use that library for. Building good valid and life-like test data has and is always a really big problem. The base for good test data is good and valid looking random data. Without that, most tests are not really valid, as data can be clustered wrongly, indexes act strange because data does not mimic real life and relations and constraints are hard to define or validate on data that is created using most available randomizing packages. That is why it was important to me that RANDOM_NINJA would be able to create random data from as many different domains as possible. As of today it can create more than 185 different data points from as many as 14 different data domains.

But having good random data is only half of it. You still need something that can define and create those tables. You also need something that can still maintain relations between those test tables, and make sure that foreign key distributions are real as well. So I created TESTDATA_NINJA. This package has some generic generators to create simple tables of people, populations which are true according to UN demographics data, CDR records and credit card transactions. The real function in this package is the custom generation procedure. This procedure can parse a string representation of what your data looks like and from that create pipelined functiones that can create thousands of test rows extremely fast and efficiently.

The testdata_ninja.generator_create procedure takes 2 arguments. The name of the new generator and the format of the test data. Below is a short description on how the format looks like.

Demographically correct test data

One of the main reasons behind finishing versions 1.5.0 of RANDOM_NINJA, was to be able to add the localization for different countries so that I could create test data sets, that are demographically correct.

So after adding that, the TESTDATA_NINJA package, can now create statistically correct data sets for 3 countries:

  • United States
  • China
  • Denmark

The following ratio of data will be correct according to UN Statistics and CIA World Book:

  • Age group - Age ratios will be divided in 0-14, 15-64 and 65+
  • Female/Male - Gender rations will be divided within the different age groups, according to statistics

Identification numbers and birthdays will also be in statistically correct ratios and in valid formats.

As an example I created 3 test tables, by using the population generator from the testdata_ninja package:

create table t_us as select * from table(testdata_ninja.population('US', 0.00001));
create table t_dk as select * from table(testdata_ninja.population('DK', 0.00001));
create table t_cn as select * from table(testdata_ninja.population('CN', 0.00001));

Building Facebook or Germany ....

Whether it is for work, or we are just playing with a new idea back home, there comes a time when we need to create some test data for whatever it is we are testing. The biggest problem with creating test data, is creating some data, that actually looks like real data. Because we want "real" data, if we really want to test how our application works or performs. Things like index selectivity, histograms and much more inside oracle will all behave differently depending on what kind of data there are in your tables. So creating life like data, is important.

Luckily I have previously created a base library that generates random data within a lot of different data domains. The RANDOM_NINJA package can create all that data, but because of the required flexibility in that library, it might not be user friendly, if you just want to create some tables fast to test on. So I decided to create a library that is based on the random_ninja library but binds it together in a little more easy way to create data.

The library is based on generators. All generators are pipelined functions, making it easy to use in create table statements. All generators by default will return 10 rows, but that can be controlled by setting the testdata_ninja.g_default_generator_rows parameter.

In the future there will be more control options for the generators, apart from just the rowcount. The current available generators (and their fields) are:

  • people - Generate a table of people with basic information.
    • Country - Country of the person.
    • Identification - The ID number of the person. Will be a valid ID based on the Country.
    • First name - First name. Will be country and gender specific if data is available in random_ninja.
    • Middle name - Possible middle name. Will be country and gender specific if data is available in random_ninja.
    • Last name - Last name. Will be country and gender specific if data is available in random_ninja.
    • Birthdate - Birthdate of the person.
    • Gender
  • users - Generate a table of website users.
    • Username - Username of website user.
    • Email - User email address.
    • Address1 - First part of address. Always present.
    • Address2 - Second part of address. Randomly present.
    • Zipcode - Zipcode of the address.
    • State - State of the address.
    • Creditcard - Type of credit card.
    • Creditcard number - Creditcard number.
    • Creditcard expiry - Creditcard expiry.
    • Password - The hashed password.
  • cdr - Generate a table of cdr records.
    • Orig imsi - IMSI number of the originating caller phone.
    • Orig isdn - Phone number of originating caller.
    • Orig imei - IMEI of originating caller.
    • Call type - Call type.
    • Call type service - Call service type.
    • Call start latitude - Latitude of the originator at the start of the call.
    • Call start longtitude - Longtitude of the orginator at the start of the call.
    • Call date - Date of the call.
    • Call duration - Call length in seconds.
    • Dest imsi - IMSI number of the destination caller phone.
    • Dest isdn - Phone number of destination phone.
    • Dest imei - IMEI of destination.
    • Network operator - Network operator code.
  • articles - Create a table of articles.
    • Author - The author of the article.
    • Written - Date article is written.
    • Headline - Headline.
    • Lead paragraph - Leading paragraph.
    • Main article - The main part of the article.
    • End paragraph - Final articel paragrpah.