codemonth.dk

One project every month - making stuff better ...

Creating complete synthetic test data sets from production data

How easy could it be to generate a complete set of test data from existing tables? To make sure that data looks like production but in no way is a copy or scramble of production data. Instead it is synthetically created, to look like what your production data looks like. Same types, same data domain, same coherence between values, same data distribution and also maintaining the relationships (foreign keys) between all the tables simulating.

For a quick first demonstration I will use the classic Oracle demo tables DEPT and EMP. You can find the DEPT and EMP script from lots of sites, but I used this one from Mike Hichwa on LiveSQL. So now we have those 2 tables, there are a couple of pre-requisites for testdata_ninja to work:

Before we can let TESTDATA_NINJA do its job, we need to make sure that the involved tables have up-to-date statistics created/generated. This is because TESTDATA_NINJA uses a lot of the information gathered during statistics update, for things like null distribution, uniqueness of values, low-high values and a lot more.

So we run the following step just be sure:


begin
    dbms_stats.gather_table_stats(user, 'DEPT');
    dbms_stats.gather_table_stats(user, 'EMP');
end;
/
When statistics has been run, then we can create the testdata set, that we can move to any other database to create similar looking synthetically created data. It is as simple as running one command:

begin
    testdata_set_generator.generator_set_create('MY_SET', 'EMP');
end;
/
So here we are saying we want a dataset called MY_TEST_SET and it should be based on the table EMP and by default it will include all tables dependent on this table (as well as any potential parent tables, but only one level up). The result from this command is a central package called TDG_MAIN_MY_SET to create the testdata set:


SQL> desc tdg_main_my_set;

PROCEDURE BUILD_SET
Argument Name                  Type                    In/Out Default?
------------------------------ ----------------------- ------ --------
SET_SIZE                       NUMBER                  IN
PREDICTABLE_KEY                VARCHAR2                IN
OVERWRITE                      PL/SQL BOOLEAN          IN

SQL>
and then it automatically create package for each of the tables needed to build the dataset for the EMP table. In our case that means one package to create the DEPT data and one package to create the EMP data:

SQL> select object_name from user_objects
  2  where object_type = 'PACKAGE BODY'
  3  and object_name like '%MY_SET%';

OBJECT_NAME
--------------------------------------------------------------------------------
TDG_MAIN_MY_SET
TDG_MY_SET_DEPT
TDG_MY_SET_EMP

SQL>
So whenever we want to create a new dataset for those tables, we can move those packages to the required database (only other requirement is RANDOM_NINJA), and we can easily re-create all the tables, with constraints, with the right data distribution, with simulated data domain and with the right data dependencies. So let us export these packages from this schema and import them into an empty database:

D:\..\..\testdata_ninja> sqlplus testdata_ninja/testdata_ninja@code

SQL*Plus: Release 18.0.0.0.0 Production on Sat Jun 9 22:55:48 2018
Version 18.2.0.0.0

Copyright (c) 1982, 2018, Oracle.  All rights reserved.

Last Successful login time: Sat Jun 09 2018 22:48:25 +08:00

Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options

SQL> desc emp;
ERROR:
ORA-04043: object emp does not exist


SQL> desc dept;
ERROR:
ORA-04043: object dept does not exist


SQL> desc tdg_my_set_emp
FUNCTION MY_SET_EMP RETURNS MY_SET_EMP_TAB
 Argument Name                  Type                    In/Out Default?
 ------------------------------ ----------------------- ------ --------
 GENERATOR_COUNT                NUMBER                  IN     DEFAULT
 PREDICTABLE_KEY                VARCHAR2                IN     DEFAULT
 DIST_DEPTNO                    VARCHAR2                IN     DEFAULT
FUNCTION TO_CSV RETURNS CSV_TAB
 Argument Name                  Type                    In/Out Default?
 ------------------------------ ----------------------- ------ --------
 GENERATOR_COUNT                NUMBER                  IN     DEFAULT
 DIST_DEPTNO                    VARCHAR2                IN     DEFAULT
 DELIMITER                      VARCHAR2                IN     DEFAULT
 OPTIONAL_ENCLOSE               VARCHAR2                IN     DEFAULT
 DATE_FORMAT                    VARCHAR2                IN     DEFAULT
 INCLUDE_HEADER                 NUMBER                  IN     DEFAULT
 CUSTOM_HEADER                  VARCHAR2                IN     DEFAULT
 INCLUDE_FOOTER                 NUMBER                  IN     DEFAULT
 CUSTOM_FOOTER                  VARCHAR2                IN     DEFAULT
PROCEDURE TO_TABLE
 Argument Name                  Type                    In/Out Default?
 ------------------------------ ----------------------- ------ --------
 TABLE_NAME                     VARCHAR2                IN     DEFAULT
 GENERATOR_COUNT                NUMBER                  IN     DEFAULT
 PREDICTABLE_KEY                VARCHAR2                IN     DEFAULT
 DIST_DEPTNO                    VARCHAR2                IN     DEFAULT
 ADD_FOREIGN_KEYS               BOOLEAN                 IN     DEFAULT
 OVERWRITE                      BOOLEAN                 IN     DEFAULT

SQL> desc tdg_my_set_dept
FUNCTION MY_SET_DEPT RETURNS MY_SET_DEPT_TAB
 Argument Name                  Type                    In/Out Default?
 ------------------------------ ----------------------- ------ --------
 GENERATOR_COUNT                NUMBER                  IN     DEFAULT
 PREDICTABLE_KEY                VARCHAR2                IN     DEFAULT
FUNCTION TO_CSV RETURNS CSV_TAB
 Argument Name                  Type                    In/Out Default?
 ------------------------------ ----------------------- ------ --------
 GENERATOR_COUNT                NUMBER                  IN     DEFAULT
 DELIMITER                      VARCHAR2                IN     DEFAULT
 OPTIONAL_ENCLOSE               VARCHAR2                IN     DEFAULT
 DATE_FORMAT                    VARCHAR2                IN     DEFAULT
 INCLUDE_HEADER                 NUMBER                  IN     DEFAULT
 CUSTOM_HEADER                  VARCHAR2                IN     DEFAULT
 INCLUDE_FOOTER                 NUMBER                  IN     DEFAULT
 CUSTOM_FOOTER                  VARCHAR2                IN     DEFAULT
PROCEDURE TO_TABLE
 Argument Name                  Type                    In/Out Default?
 ------------------------------ ----------------------- ------ --------
 TABLE_NAME                     VARCHAR2                IN     DEFAULT
 GENERATOR_COUNT                NUMBER                  IN     DEFAULT
 PREDICTABLE_KEY                VARCHAR2                IN     DEFAULT
 ADD_FOREIGN_KEYS               BOOLEAN                 IN     DEFAULT
 OVERWRITE                      BOOLEAN                 IN     DEFAULT

SQL>

as you can see our emp and dept table does not exist in this schema, but we have the packages imported. So all we need to do now is call the main package and say we want to create the dataset:


SQL> begin
  2     tdg_main_my_set.build_set;
  3  end;
  4  /

PL/SQL procedure successfully completed.

SQL> desc dept
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 DEPTNO                                    NOT NULL NUMBER
 DNAME                                              VARCHAR2(14)
 LOC                                                VARCHAR2(50)

SQL> desc emp
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 EMPNO                                              NUMBER
 ENAME                                              VARCHAR2(50)
 JOB                                                VARCHAR2(45)
 MGR                                                NUMBER
 HIREDATE                                           DATE
 SAL                                                NUMBER
 COMM                                               NUMBER
 DEPTNO                                             NUMBER

SQL>
and as we can see there is plenty of data now in the tables:

SQL> select * from dept where loc like 'P%';

    DEPTNO DNAME          LOC
---------- -------------- --------------------------------------------------
       104 hcmjiqng       Plymouth
       234 npiudhoif      Provo
       254 qskgljnva      Parkersburg
       564 bhpevlg        Pueblo

SQL> select empno, deptno, job, hiredate from emp
  2  where rownum <= 10;

     EMPNO     DEPTNO JOB                                           HIREDATE
---------- ---------- --------------------------------------------- ---------
      7246       1034 Landscaper                                    21-MAY-84
      7258       1034 Police Officer                                30-DEC-81
      7283       1034 Green Keeper                                  08-SEP-80
      7317        154 Artist                                        19-MAY-85
      7325        154 Associate                                     12-AUG-82
      7347        154 Consultant                                    05-JUL-81
      7357        184 Masonry Worker                                19-FEB-81
      7373        184 Plumber                                       27-JUN-82
      7398        184 Business Analyst                              02-AUG-81
      7436        194 Financial Advisor                             30-SEP-85

10 rows selected.

SQL>
So that is how easy you can build complete datasets of your database to use either for development or SIT/UAT without the need to use your production data. But to use it for any test, we need to make sure that we can create the same data every single time we create the dataset. This we can do by using the predictable_key argument in the build_set procedure. The functionality is already described in an earlier blog post: Predictable random test data using plsql.

Tagged in : RANDOM_NINJA