codemonth.dk

One project every month - making stuff better ...

Creating real test data - Fast and Efficiently

When I build RANDOM_NINJA I knew already one of things I wanted to use that library for. Building good valid and life-like test data has and is always a really big problem. The base for good test data is good and valid looking random data. Without that, most tests are not really valid, as data can be clustered wrongly, indexes act strange because data does not mimic real life and relations and constraints are hard to define or validate on data that is created using most available randomizing packages. That is why it was important to me that RANDOM_NINJA would be able to create random data from as many different domains as possible. As of today it can create more than 185 different data points from as many as 14 different data domains.

But having good random data is only half of it. You still need something that can define and create those tables. You also need something that can still maintain relations between those test tables, and make sure that foreign key distributions are real as well. So I created TESTDATA_NINJA. This package has some generic generators to create simple tables of people, populations which are true according to UN demographics data, CDR records and credit card transactions. The real function in this package is the custom generation procedure. This procedure can parse a string representation of what your data looks like and from that create pipelined functiones that can create thousands of test rows extremely fast and efficiently.

The testdata_ninja.generator_create procedure takes 2 arguments. The name of the new generator and the format of the test data. Below is a short description on how the format looks like.

The format consists of blocks of row columns. Each separated by the "@" character. Each row block consists of at least 3 sections. Each section is divided by the "#" character. The 3 sections are as follows:

  • field name
  • data type
  • data generator

So if you wanted to create a generator that gave you 2 columns for each row, and the first column was a name and the second column was a birth date, you would write the format like this:


name#varchar2(50)#person_random.r_name@birthday#date#person_random.r_birthday
Let us say we want to call our generator for test_days, so the way to create the generator would be like this:

declare
  the_format varchar2(4000) := 'name#varchar2(50)#person_random.r_name@birthday#date#person_random.r_birthday';
begin
  testdata_ninja.generator_create('test_days', the_format);
end;
/
Once that has run, a package called tdg_test_days with a pipelined table function called test_days. So we can get our test data rows like this:

select * from table(tdg_test_days.test_days);
By default the compiled generator will create 100 rows, but this can of course be controlled as an input parameter like this:

select * from table(tdg_test_days.test_days(500));
This example is very simple. So what if we wanted to add some arguments to functions from RANDOM_NINJA, like age group to the random birthday date.?

We can add arguments to data generators if we want to. So say we only wanted birthdays for people older than 64, we can add the senior argument to that function:


declare
  the_format varchar2(4000) := 'name#varchar2(50)#person_random.r_name@birthday#date#person_random.r_birthday#''senior''';
begin
  testdata_ninja.generator_create('test_days', the_format);
end;
/
So recreating the package and now all birthdays will be from people older than 64 years old. The select is the same:

select * from table(tdg_test_days.test_days);
So it is very easy to create test data. In my next entry I will show more advanced usage of the package, such as creating real foreign key data, creating special incremental data and using the nested possibilities of pipelined functions.

Tagged in : RANDOM_NINJA