codemonth.dk

One project every month - making stuff better ...

Creating synthetic and random test data from an existing table

We have all been there at some point. Either we need to run a test but can't bring production data outside of the production network or we need to produce a test case for a support organization that are not allowed to view production data. What to do?

We could scramble the data, but sometimes that obscures problems if data is the problem. We could extract the DDL of the table(s) and try and create the data ourselves, but that usually takes a long time.

I know, because I have been in that situation many times. Which is one of the reasons I started creating both the RANDOM_NINJA package to create "life-like" synthetic data, and the TESTDATA_NINJA package to make it easy to create test data generators on the fly. Until now your only choice to create those generators were using a metadata language to describe your dataset. This is good for situations where you perhaps do not have an existing data model, but need to create some test data quickly. However it is not the fastest way to create test data from existing tables. So I have added a procedure to do just that.

The usage is very simple, so I will demonstrate it using the classic EMP/DEPT demo tables. So let us create the DEMO tables:


create table dept(
  deptno number(2,0),
  dname  varchar2(200),
  loc    varchar2(200),
  constraint pk_dept primary key (deptno)
);

insert into dept
values(10, 'ACCOUNTING', 'NEW YORK');
insert into dept
values(20, 'RESEARCH', 'DALLAS');
insert into dept
values(30, 'SALES', 'CHICAGO');
insert into dept
values(40, 'OPERATIONS', 'BOSTON');

create table emp(
  empno    number(4,0),
  ename    varchar2(150),
  job      varchar2(50),
  mgr      number(4,0),
  hiredate date,
  sal      number(7,2),
  comm     number(7,2),
  deptno   number(2,0),
  constraint pk_emp primary key (empno),
  constraint fk_deptno foreign key (deptno) references dept (deptno)
);

insert into emp
values(
 7839, 'KING', 'PRESIDENT', null,
 to_date('17-11-1981','dd-mm-yyyy'),
 5000, null, 10
);
insert into emp
values(
 7698, 'BLAKE', 'MANAGER', 7839,
 to_date('1-5-1981','dd-mm-yyyy'),
 2850, null, 30
);
insert into emp
values(
 7782, 'CLARK', 'MANAGER', 7839,
 to_date('9-6-1981','dd-mm-yyyy'),
 2450, null, 10
);
insert into emp
values(
 7566, 'JONES', 'MANAGER', 7839,
 to_date('2-4-1981','dd-mm-yyyy'),
 2975, null, 20
);
insert into emp
values(
 7788, 'SCOTT', 'ANALYST', 7566,
 to_date('13-JUL-87','dd-mm-rr') - 85,
 3000, null, 20
);
insert into emp
values(
 7902, 'FORD', 'ANALYST', 7566,
 to_date('3-12-1981','dd-mm-yyyy'),
 3000, null, 20
);
insert into emp
values(
 7369, 'SMITH', 'CLERK', 7902,
 to_date('17-12-1980','dd-mm-yyyy'),
 800, null, 20
);
insert into emp
values(
 7499, 'ALLEN', 'SALESMAN', 7698,
 to_date('20-2-1981','dd-mm-yyyy'),
 1600, 300, 30
);
insert into emp
values(
 7521, 'WARD', 'SALESMAN', 7698,
 to_date('22-2-1981','dd-mm-yyyy'),
 1250, 500, 30
);
insert into emp
values(
 7654, 'MARTIN', 'SALESMAN', 7698,
 to_date('28-9-1981','dd-mm-yyyy'),
 1250, 1400, 30
);
insert into emp
values(
 7844, 'TURNER', 'SALESMAN', 7698,
 to_date('8-9-1981','dd-mm-yyyy'),
 1500, 0, 30
);
insert into emp
values(
 7876, 'ADAMS', 'CLERK', 7788,
 to_date('13-JUL-87', 'dd-mm-rr') - 51,
 1100, null, 20
);
insert into emp
values(
 7900, 'JAMES', 'CLERK', 7698,
 to_date('3-12-1981','dd-mm-yyyy'),
 950, null, 30
);
insert into emp
values(
 7934, 'MILLER', 'CLERK', 7782,
 to_date('23-1-1982','dd-mm-yyyy'),
 1300, null, 10
);

commit;

begin
dbms_stats.gather_table_stats('TESTDATA_NINJA', 'DEPT');
dbms_stats.gather_table_stats('TESTDATA_NINJA', 'EMP');
end;
/
So a select from the EMP table would look like this:

SQL> select * from emp;

EMPNO ENAME      JOB               MGR HIREDATE                  SAL       COMM     DEPTNO
---------- ---------- ---------- ---------- ------------------ ---------- ---------- ----------
 7839 KING       PRESIDENT             17-NOV-81 00:00:00       5000                    10
 7698 BLAKE      MANAGER          7839 01-MAY-81 00:00:00       2850                    30
 7782 CLARK      MANAGER          7839 09-JUN-81 00:00:00       2450                    10
 7566 JONES      MANAGER          7839 02-APR-81 00:00:00       2975                    20
 7788 SCOTT      ANALYST          7566 19-APR-87 00:00:00       3000                    20
 7902 FORD       ANALYST          7566 03-DEC-81 00:00:00       3000                    20
 7369 SMITH      CLERK            7902 17-DEC-80 00:00:00        800                    20
 7499 ALLEN      SALESMAN         7698 20-FEB-81 00:00:00       1600        300         30
 7521 WARD       SALESMAN         7698 22-FEB-81 00:00:00       1250        500         30
 7654 MARTIN     SALESMAN         7698 28-SEP-81 00:00:00       1250       1400         30
 7844 TURNER     SALESMAN         7698 08-SEP-81 00:00:00       1500          0         30
 7876 ADAMS      CLERK            7788 23-MAY-87 00:00:00       1100                    20
 7900 JAMES      CLERK            7698 03-DEC-81 00:00:00        950                    30
 7934 MILLER     CLERK            7782 23-JAN-82 00:00:00       1300                    10

14 rows selected

Now let us imagine that this was highly confidential customer data, and that we needed to create a replica table output, so we could do some new development in the development environment. Maybe you could use DBMS_RANDOM to create random text strings to create new rows, but that makes it difficult to create "synthetic" numbers and names/words that mimic the real data. It is also difficult to directly get dates from using just DBMS_RANDOM. A quick attempt would probably look a little bit like this:


SQL> select
  2  round(dbms_random.value(1000,9999)) as empno
  3  , dbms_random.string('a', 10) as ename
  4  , dbms_random.string('a', 10) as job
  5  , round(dbms_random.value(1000,9999)) as mgr
  6  , sysdate - round(dbms_random.value(40,400)) as hiredate
  7  , round(dbms_random.value(500,5000)) as sal
  8  , round(dbms_random.value(0,400)) as comm
  9  , round(dbms_random.value(10,30)) as deptno
 10  from dual;

 EMPNO ENAME      JOB               MGR HIREDATE                  SAL       COMM     DEPTNO
---------- ---------- ---------- ---------- ------------------ ---------- ---------- ----------
  9693 CrPesZKtlP erPoCVhywV       4869 06-DEC-17 07:35:53       3307        109         29

The row might look ok at first, but name and job are garbage (which makes it harder to relate to as developer) and what if I need to create 5 rows or 10 rows? And this is just a simple table. Luckily the testdata_ninja package can make this a lot easier:


SQL> begin
  2  testdata_ninja.generator_create(generator_name => 'my_emps_t', generator_table => 'EMP');
  3  end;
  4  /
PL/SQL procedure successfully completed.

That is it. It is that easy! Now we have a pipelined function ready to use to generate all the data we need:


SQL> select * from table(tdg_my_emps_t.my_emps_t(5));

EMPNO ENAME                JOB                         MGR HIREDATE                  SAL       COMM     DEPTNO
---------- -------------------- -------------------- ---------- ------------------ ---------- ---------- ----------
  7369 Gregersen            Plumber                    7876 19-MAR-84 13:31:13     604622                    16
  7370 Ibsen                Security Specialist        7849 25-JUN-85 15:47:47      26305        365         14
  7371 Nelson               Attorney                   7883 24-MAR-85 22:56:25      69708                    18
  7372 Mortensen            Police Officer             7808 13-APR-85 19:43:34     651051       1277         22
  7373 Roberts              Inspector                  7617 23-APR-87 05:36:42      62108                    17

There is a few things to notice here.

  1. We did not have to name the columns, that was done automatically.
  2. The package automatically infers some of the data as names or titles, so the data "looks real".
  3. ID columns automatically follows the same structure as the original table.
  4. The package automatically recognizes that the COMM column include NULL values
  5. HIREDATE values will follow the same pattern as the actual dates in the table
All of this, we get from just one line of PLSQL. The package is ready to be exported and moved to a test environment, without any dependencies on the actual production objects or tables.

The part of the code that guesses/infers/learns the data domain of the individual columns is rather crude at the moment, but I am currently building a much more intelligent version, that covers a lot more. More on that in another blog post.

Tagged in : RANDOM_NINJA