Analyse

Model the analysis

Info

All the modelling described in this tutorial can be viewed with comments in the Test folder of the Profile Templates (TestAnalyzeProfile.php).

Get an overview

To get an overview of the data in the database, you can use the command pseudify:debug:table_schema.
Of course, you can also use any other tool of your choice.

$ pseudify pseudify:debug:table_schema

wh_log
------

 -------------------- --------- --------------------------------------------------------------------------------------------------------- 
  column               type      data example                                                                                             
 -------------------- --------- --------------------------------------------------------------------------------------------------------- 
  id                   integer   6                                                                                                        
  log_type             string    foo                                                                                                      
  log_data             blob      613a323a7b693a303b733a33383a223466623a313434373a646566623a396434373a613265303a613336613a313064333a66...  
  log_message          text      {"message":"foo text \"ronaldo15\", another \"mcclure.ofelia@example.com\""}                             
  ip                   string    4fb:1447:defb:9d47:a2e0:a36a:10d3:fd98                                                                   
 -------------------- --------- --------------------------------------------------------------------------------------------------------- 

wh_meta_data
------------

 --------------------- --------- --------------------------------------------------------------------------------------------------------- 
  column                type      data example                                                                                             
 --------------------- --------- --------------------------------------------------------------------------------------------------------- 
  id                    integer   5                                                                                                        
  meta_data             blob      1f8b08000000000000036592dd6ea33010855f65657159116ca0818922f52fca6ea5d52a4bab46bd89066c821b302c769246...  
 --------------------- --------- --------------------------------------------------------------------------------------------------------- 

wh_user
-------

 -------------------- --------- ---------------------------------------------------------------------------------------------- 
  column               type      data example                                                                                  
 -------------------- --------- ---------------------------------------------------------------------------------------------- 
  id                   integer   5                                                                                             
  username             string    howell.damien                                                                                 
  password             string    $argon2i$v=19$m=8,t=1,p=1$ZldmOWd2TDJRb3FTNVpGNA$ORIwp6yekRx02mqM4WCTVhllgXpUpuFJZ1MmbYwAMXs  
  first_name           string    Mckayla                                                                                       
  last_name            string    Stoltenberg                                                                                   
  email                string    cassin.bernadette@example.net                                                                 
  city                 string    South Wilfordland                                                                             
 -------------------- --------- ---------------------------------------------------------------------------------------------- 

wh_user_session
---------------

 ------------------- --------- -------------------------------------------------------------------- 
  column              type      data example                                                        
 ------------------- --------- -------------------------------------------------------------------- 
  id                  integer   5                                                                   
  session_data        blob      a:1:{s:7:"last_ip";s:38:"4fb:1447:defb:9d47:a2e0:a36a:10d3:fd98";}  
  session_data_json   text      {"data":{"last_ip":"4fb:1447:defb:9d47:a2e0:a36a:10d3:fd98"}}
 ------------------- --------- --------------------------------------------------------------------

The command outputs all tables of the database one after the other and lists their columns.
The column column contains the name of the database column.
The column type contains the human-readable name of the data type of the database column.
The column data example contains the longest data record that can be found in the database in this column. After 100 characters, the data will be truncated.

Search for personal data that you want to pseudonymise.
Search for names, user names, passwords, addresses, email addresses, IP addresses, telephone numbers, ID numbers such as insurance numbers, profile data such as height or weight, etc.

Info

If you need suggestions, read the chapter "What to pseudonymise?".

It is best to note the columns with directly visible personal data, i.e. the columns that contain data in plain text and not those with more complex data structures such as JSON (e.g. the column wh_log.log_message) or those in which the data is available in encoded form (e.g. the column wh_log.log_data).
In the example, the preferred columns would be:

  • wh_log.ip
  • wh_user.username
  • wh_user.password
  • wh_user.first_name
  • wh_user.last_name
  • wh_user.email
  • wh_user.city

Model an "Analyze Profile"

Create a "Profile"

In the folder src/Profiles create a PHP file with any name.
In the example, the file is called TestAnalyzeProfile.php.
The file will have the following content:

<?php

namespace Waldhacker\Pseudify\Profiles;

use Waldhacker\Pseudify\Core\Profile\Analyze\ProfileInterface;
use Waldhacker\Pseudify\Core\Profile\Model\Analyze\TableDefinition;

class TestAnalyzeProfile implements ProfileInterface
{
    public function getIdentifier(): string
    {
        return 'test-profile';
    }

    public function getTableDefinition(): TableDefinition
    {
        $tableDefinition = new TableDefinition(identifier: $this->getIdentifier());

        return $tableDefinition;
    }
}

The getIdentifier() method must return a unique identifier of your profile and should only consist of letters, numbers or the characters - and _ and must not contain any spaces.

After creating the profile, the cache must be cleared.

$ pseudify cache:clear

The command pseudify pseudify:debug:analyse test-profile already gives you information about your profile.

$ pseudify pseudify:debug:analyze test-profile

Analyzer profile "test-profile"
===============================

Basis configuration
-------------------

 ----------------------------------------------- ------- 
  Key                                             Value  
 ----------------------------------------------- ------- 
  Shown characters before and after the finding   10     
 ----------------------------------------------- ------- 

Collect search data from this tables
------------------------------------

 ------- -------- --------------- ----------------- 
  Table   column   data decoders   data collectors  
 ------- -------- --------------- ----------------- 

Search data in this tables
--------------------------

 ----------------- -------------------------- --------------- ----------------------- 
  Table             column                     data decoders   special data decoders  
 ----------------- -------------------------- --------------- ----------------------- 
  wh_log            id (integer)               Scalar          no further processing  
  wh_log            log_type (string)          Scalar          no further processing  
  wh_log            log_data (blob)            Scalar          no further processing  
  wh_log            log_message (text)         Scalar          no further processing  
  wh_log            ip (string)                Scalar          no further processing  
  wh_meta_data      id (integer)               Scalar          no further processing  
  wh_meta_data      meta_data (blob)           Scalar          no further processing  
  wh_user           id (integer)               Scalar          no further processing  
  wh_user           username (string)          Scalar          no further processing  
  wh_user           password (string)          Scalar          no further processing  
  wh_user           first_name (string)        Scalar          no further processing  
  wh_user           last_name (string)         Scalar          no further processing  
  wh_user           email (string)             Scalar          no further processing  
  wh_user           city (string)              Scalar          no further processing  
  wh_user_session   id (integer)               Scalar          no further processing  
  wh_user_session   session_data (blob)        Scalar          no further processing  
  wh_user_session   session_data_json (text)   Scalar          no further processing  
 ----------------- -------------------------- --------------- ----------------------- 

Define source data

Info

The "Analyze Profile" is used to determine in which "unlit corners" of the database other personal data are hiding.
We therefore use the personal data already known to us, which we identified in the first step, to find them in the rest of the database.

We have identified personal data in the following columns:

  • wh_log.ip
  • wh_user.username
  • wh_user.password
  • wh_user.first_name
  • wh_user.last_name
  • wh_user.email
  • wh_user.city

You must now tell pseudify that you want to use the data in these columns as source data.
To do this, you extend the getTableDefinition() method in the profile.

    public function getTableDefinition(): TableDefinition
    {
        $tableDefinition = new TableDefinition(identifier: $this->getIdentifier());

        $tableDefinition
            ->addSourceTable(table: 'wh_log', columns: [
                'ip',
            ])
            ->addSourceTable(table: 'wh_user', columns: [
                'username',
                'password',
                'first_name',
                'last_name',
                'email',
                'city',
            ])
        ;

        return $tableDefinition;
    }

With the method addSourceTable() you tell pseudify in which database table and in which database columns the source data should be collected.
Pseudify will then automatically search for occurrences of the source data in all other database columns of the database tables and output them.
Previously, the output of the command pseudify:debug:analyse test-profile contained all database tables and all database columns under Search data in this tables.
Now only the database tables and their database columns that were not defined as source data using addSourceTable() are listed there.

$ pseudify pseudify:debug:analyze test-profile

Analyzer profile "test-profile"
===============================

Basis configuration
-------------------

 ----------------------------------------------- ------- 
  Key                                             Value  
 ----------------------------------------------- ------- 
  Shown characters before and after the finding   10     
 ----------------------------------------------- ------- 

Collect search data from this tables
------------------------------------

 --------- --------------------- --------------- ----------------------- 
  Table     column                data decoders   data collectors        
 --------- --------------------- --------------- ----------------------- 
  wh_log    ip (string)           Scalar          default (scalar data)  
  wh_user   username (string)     Scalar          default (scalar data)  
  wh_user   password (string)     Scalar          default (scalar data)  
  wh_user   first_name (string)   Scalar          default (scalar data)  
  wh_user   last_name (string)    Scalar          default (scalar data)  
  wh_user   email (string)        Scalar          default (scalar data)  
  wh_user   city (string)         Scalar          default (scalar data)  
 --------- --------------------- --------------- ----------------------- 

Search data in this tables
--------------------------

 ----------------- -------------------------- --------------- ----------------------- 
  Table             column                     data decoders   special data decoders  
 ----------------- -------------------------- --------------- ----------------------- 
  wh_log            id (integer)               Scalar          no further processing  
  wh_log            log_type (string)          Scalar          no further processing  
  wh_log            log_data (blob)            Scalar          no further processing  
  wh_log            log_message (text)         Scalar          no further processing  
  wh_meta_data      id (integer)               Scalar          no further processing  
  wh_meta_data      meta_data (blob)           Scalar          no further processing  
  wh_user           id (integer)               Scalar          no further processing  
  wh_user_session   id (integer)               Scalar          no further processing  
  wh_user_session   session_data (blob)        Scalar          no further processing  
  wh_user_session   session_data_json (text)   Scalar          no further processing  
 ----------------- -------------------------- --------------- -----------------------
Encoded data as source data

It happens that data in database columns are in encoded form.
This means that the encoded plaintext must be decoded during the analysis in order to be able to use it as source data.
Similar to what is described under "Search encoded data", the database columns of the source data can also be decoded.

The method SourceColumn::create() can be given a name of a built-in decoder with the parameter dataType.

Note

As described in "Search multiple encoded data", the ChainedEncoder can also be used here to decode multiple-encoded data.

<?php

namespace Waldhacker\Pseudify\Profiles;

use Waldhacker\Pseudify\Core\Profile\Analyze\ProfileInterface;
use Waldhacker\Pseudify\Core\Profile\Model\Analyze\SourceColumn;
use Waldhacker\Pseudify\Core\Profile\Model\Analyze\TableDefinition;

class TestAnalyzeProfile implements ProfileInterface
{
    public function getIdentifier(): string
    {
        return 'test-profile';
    }

    public function getTableDefinition(): TableDefinition
    {
        $tableDefinition = new TableDefinition(identifier: $this->getIdentifier());

        $tableDefinition
            ->addSourceTable(table: 'wh_test_table', columns: [
                SourceColumn::create(identifier: 'wh_test_column', dataType: SourceColumn::DATA_TYPE_HEX),
            ])
        ;

        return $tableDefinition;
    }
}

You will now see under Collect search data from these tables that the name Hex is listed under data decoders in the database column session_data_json.
This signals to you that the data will be decoded using the HexEncoder.

$ pseudify pseudify:debug:analyze test-profile

Analyzer profile "test-profile"
===============================

Basis configuration
-------------------

 ----------------------------------------------- ------- 
  Key                                             Value  
 ----------------------------------------------- ------- 
  Shown characters before and after the finding   10     
 ----------------------------------------------- ------- 

Collect search data from this tables
------------------------------------

 ----------------- -------------------------- --------------- ---------------------- 
  Table             column                     data decoders   data collectors     
 ----------------- -------------------------- --------------- ----------------------
  wh_test_table     wh_test_column (text)      Hex             default (scalar data) 
 ----------------- -------------------------- --------------- ---------------------- 

Search data in this tables
--------------------------

 ----------------- --------------------- --------------- ----------------------- 
  Table             column                data decoders   special data decoders  
 ----------------- --------------------- --------------- ----------------------- 
  wh_log            id (integer)          Scalar          no further processing  
  wh_log            log_type (string)     Scalar          no further processing  
  wh_log            log_data (blob)       Scalar          no further processing  
  wh_log            log_message (text)    Scalar          no further processing  
  wh_log            ip (string)           Scalar          no further processing  
  wh_meta_data      id (integer)          Scalar          no further processing  
  wh_meta_data      meta_data (blob)      Scalar          no further processing  
  wh_user           id (integer)          Scalar          no further processing  
  wh_user           username (string)     Scalar          no further processing  
  wh_user           password (string)     Scalar          no further processing  
  wh_user           first_name (string)   Scalar          no further processing  
  wh_user           last_name (string)    Scalar          no further processing  
  wh_user           email (string)        Scalar          no further processing  
  wh_user           city (string)         Scalar          no further processing  
  wh_user_session   id (integer)          Scalar          no further processing  
  wh_user_session   session_data (blob)   Scalar          no further processing  
 ----------------- --------------------- --------------- -----------------------

Alternatively, ->setEncoder(encoder: new HexEncoder()) can be used:

<?php

namespace Waldhacker\Pseudify\Profiles;

use Waldhacker\Pseudify\Core\Profile\Analyze\ProfileInterface;
use Waldhacker\Pseudify\Core\Profile\Model\Analyze\SourceColumn;
use Waldhacker\Pseudify\Core\Profile\Model\Analyze\TableDefinition;

class TestAnalyzeProfile implements ProfileInterface
{
    public function getIdentifier(): string
    {
        return 'test-profile';
    }

    public function getTableDefinition(): TableDefinition
    {
        $tableDefinition = new TableDefinition(identifier: $this->getIdentifier());

        $tableDefinition
            ->addSourceTable(table: 'wh_test_table', columns: [
                SourceColumn::create(identifier: 'wh_test_column')->setEncoder(encoder: new HexEncoder()),
            ])
        ;

        return $tableDefinition;
    }
}

Without further definition, pseudify will search for the source data in all database tables and their columns that have not been defined as source data using addSourceTable() or addColumn().
The search can be optimised so that the analysis does not take an unnecessarily long time.
The aim is usually to search only "text" (strings).

Exclude data types

You can exclude columns with certain data types from the search to shorten the search time.
For example, in most cases it does not make sense to search database columns of the type `integer'.
Data types can be excluded for certain tables or for all tables.
As soon as data types are excluded at the table level, the globally excluded data types are not additionally excluded for this table.

Info

You can find the names of the data types in the source code of the Doctrine project, e.g. string, integer, datetime etc: string, integer, datetime etc.

Info

There is the constant TableDefinition::COMMON_EXCLUED_TARGET_COLUMN_TYPES, which contains all data types that do not normally have to be scanned.

Exclude data types at table level

To exclude all columns with the data type integer in the table wh_meta_data from the search, you must extend the method getTableDefinition() in the profile:

    public function getTableDefinition(): TableDefinition
    {
        $tableDefinition = new TableDefinition(identifier: $this->getIdentifier());

        $tableDefinition
            // ...
            ->addTargetTable(table: 'wh_meta_data', excludeColumnTypes: [
                'integer'
            ])
        ;

        return $tableDefinition;
    }

The method addTargetTable() tells the automatic table configuration that you want to configure the table wh_meta_data specifically.
In the parameter excludeColumnTypes you can pass an array of data types to be excluded from the search.

Exclude data types globally

To globally exclude all columns with the data type integer from the search in all tables, you must extend the method getTableDefinition() in the profile:

    public function getTableDefinition(): TableDefinition
    {
        $tableDefinition = new TableDefinition(identifier: $this->getIdentifier());

        $tableDefinition
            ->addSourceTable(table: 'wh_log', columns: [
                'ip',
            ])
            ->addSourceTable(table: 'wh_user', columns: [
                'username',
                'password',
                'first_name',
                'last_name',
                'email',
                'city',
            ])

            ->excludeTargetColumnTypes(columnTypes: [
                'integer'
            ])
        ;

        return $tableDefinition;
    }

The method excludeTargetColumnTypes() tells the automatic table configuration,
that in all tables (which have no special exclusions defined) all columns of the data type integer are to be excluded from the search.

$ pseudify pseudify:debug:analyze test-profile

Analyzer profile "test-profile"
===============================

Basis configuration
-------------------

 ----------------------------------------------- ------- 
  Key                                             Value  
 ----------------------------------------------- ------- 
  Shown characters before and after the finding   10     
 ----------------------------------------------- ------- 

Collect search data from this tables
------------------------------------

 --------- --------------------- --------------- ----------------------- 
  Table     column                data decoders   data collectors        
 --------- --------------------- --------------- ----------------------- 
  wh_log    ip (string)           Scalar          default (scalar data)  
  wh_user   username (string)     Scalar          default (scalar data)  
  wh_user   password (string)     Scalar          default (scalar data)  
  wh_user   first_name (string)   Scalar          default (scalar data)  
  wh_user   last_name (string)    Scalar          default (scalar data)  
  wh_user   email (string)        Scalar          default (scalar data)  
  wh_user   city (string)         Scalar          default (scalar data)  
 --------- --------------------- --------------- ----------------------- 

Search data in this tables
--------------------------

 ----------------- -------------------------- --------------- ----------------------- 
  Table             column                     data decoders   special data decoders  
 ----------------- -------------------------- --------------- ----------------------- 
  wh_log            log_type (string)          Scalar          no further processing  
  wh_log            log_data (blob)            Scalar          no further processing  
  wh_log            log_message (text)         Scalar          no further processing  
  wh_meta_data      meta_data (blob)           Scalar          no further processing  
  wh_user_session   session_data (blob)        Scalar          no further processing  
  wh_user_session   session_data_json (text)   Scalar          no further processing  
 ----------------- -------------------------- --------------- -----------------------

You will now see under Search data in this tables that all integer columns have disappeared.

As a rule, it is a good idea to integrate the following line in the profile in order to globally exclude all data types for which it does not make sense to search:

->excludeTargetColumnTypes(columnTypes: TableDefinition::COMMON_EXCLUED_TARGET_COLUMN_TYPES)
Exclude database columns

The automatic table configuration will always exclude database columns from the search first on the basis of the data type.
In addition, you can define in the profile at table level that database columns are to be excluded from the search based on their name.
To do this, you must extend the getTableDefinition() method in the profile:

    public function getTableDefinition(): TableDefinition
    {
        $tableDefinition = new TableDefinition(identifier: $this->getIdentifier());

        $tableDefinition
            ->addSourceTable(table: 'wh_log', columns: [
                'ip',
            ])
            ->addSourceTable(table: 'wh_user', columns: [
                'username',
                'password',
                'first_name',
                'last_name',
                'email',
                'city',
            ])

            ->excludeTargetColumnTypes(columnTypes: TableDefinition::COMMON_EXCLUED_TARGET_COLUMN_TYPES)

            ->addTargetTable(table: 'wh_log', excludeColumns: [
                'log_message',
            ])
        ;

        return $tableDefinition;
    }

The method addTargetTable() tells the automatic table configuration that you want to configure the table wh_log specifically.
In the parameter excludeColumns you can pass an array of column names which should be excluded from the search.

$ pseudify pseudify:debug:analyze test-profile

Analyzer profile "test-profile"
===============================

Basis configuration
-------------------

 ----------------------------------------------- ------- 
  Key                                             Value  
 ----------------------------------------------- ------- 
  Shown characters before and after the finding   10     
 ----------------------------------------------- ------- 

Collect search data from this tables
------------------------------------

 --------- --------------------- --------------- ----------------------- 
  Table     column                data decoders   data collectors        
 --------- --------------------- --------------- ----------------------- 
  wh_log    ip (string)           Scalar          default (scalar data)  
  wh_user   username (string)     Scalar          default (scalar data)  
  wh_user   password (string)     Scalar          default (scalar data)  
  wh_user   first_name (string)   Scalar          default (scalar data)  
  wh_user   last_name (string)    Scalar          default (scalar data)  
  wh_user   email (string)        Scalar          default (scalar data)  
  wh_user   city (string)         Scalar          default (scalar data)  
 --------- --------------------- --------------- ----------------------- 

Search data in this tables
--------------------------

 ----------------- -------------------------- --------------- ----------------------- 
  Table             column                     data decoders   special data decoders  
 ----------------- -------------------------- --------------- ----------------------- 
  wh_log            log_type (string)          Scalar          no further processing  
  wh_log            log_data (blob)            Scalar          no further processing  
  wh_meta_data      meta_data (blob)           Scalar          no further processing  
  wh_user_session   session_data (blob)        Scalar          no further processing  
  wh_user_session   session_data_json (text)   Scalar          no further processing  
 ----------------- -------------------------- --------------- -----------------------

You now see under Search data in these tables that the column log_message of the table wh_log has disappeared.

Exclude tables

You can exclude whole tables from the search to shorten the search time.
This can be done with the method excludeTargetTables().

    public function getTableDefinition(): TableDefinition
    {
        $tableDefinition = new TableDefinition(identifier: $this->getIdentifier());

        $tableDefinition
            ->addSourceTable(table: 'wh_log', columns: [
                'ip',
            ])

            ->excludeTargetColumnTypes(columnTypes: TableDefinition::COMMON_EXCLUED_TARGET_COLUMN_TYPES)

            ->excludeTargetTables(tables: [
                'wh_user',
            ])
        ;

        return $tableDefinition;
    }

As you can see, the table wh_user is no longer listed under Search data in these tables.

$ pseudify pseudify:debug:analyze test-profile

Analyzer profile "test-profile"
===============================

Basis configuration
-------------------

 ----------------------------------------------- ------- 
  Key                                             Value  
 ----------------------------------------------- ------- 
  Shown characters before and after the finding   10     
 ----------------------------------------------- ------- 

Collect search data from this tables
------------------------------------

 -------- ------------- --------------- ----------------------- 
  Table    column        data decoders   data collectors        
 -------- ------------- --------------- ----------------------- 
  wh_log   ip (string)   Scalar          default (scalar data)  
 -------- ------------- --------------- ----------------------- 

Search data in this tables
--------------------------

 ----------------- ---------------------------- --------------- ----------------------- 
  Table             column                       data decoders   special data decoders  
 ----------------- ---------------------------- --------------- ----------------------- 
  wh_log            log_type (string)            Scalar          no further processing  
  wh_log            log_data (blob)              Scalar          no further processing  
  wh_log            log_data_plaintext (blob)    Scalar          no further processing  
  wh_log            log_message (text)           Scalar          no further processing  
  wh_meta_data      meta_data (blob)             Scalar          no further processing  
  wh_meta_data      meta_data_plaintext (blob)   Scalar          no further processing  
  wh_user_session   session_data (blob)          Scalar          no further processing  
 ----------------- ---------------------------- --------------- -----------------------

Regular expressions can be used in the table names to be excluded, e.g.: wh_user.*.
This makes it possible, for example, to exclude several tables with one expression:

    public function getTableDefinition(): TableDefinition
    {
        $tableDefinition = new TableDefinition(identifier: $this->getIdentifier());

        $tableDefinition
            ->addSourceTable(table: 'wh_log', columns: [
                'ip',
            ])

            ->excludeTargetColumnTypes(columnTypes: TableDefinition::COMMON_EXCLUED_TARGET_COLUMN_TYPES)

            ->excludeTargetTables(tables: [
                'wh_user.*',
            ])
        ;

        return $tableDefinition;
    }

As you can see, the tables wh_user and the table wh_user_session are no longer listed under Search data in these tables.

$ pseudify pseudify:debug:analyze test-profile

Analyzer profile "test-profile"
===============================

Basis configuration
-------------------

 ----------------------------------------------- ------- 
  Key                                             Value  
 ----------------------------------------------- ------- 
  Shown characters before and after the finding   10     
 ----------------------------------------------- ------- 

Collect search data from this tables
------------------------------------

 -------- ------------- --------------- ----------------------- 
  Table    column        data decoders   data collectors        
 -------- ------------- --------------- ----------------------- 
  wh_log   ip (string)   Scalar          default (scalar data)  
 -------- ------------- --------------- ----------------------- 

Search data in this tables
--------------------------

 -------------- ---------------------------- --------------- ----------------------- 
  Table          column                       data decoders   special data decoders  
 -------------- ---------------------------- --------------- ----------------------- 
  wh_log         log_type (string)            Scalar          no further processing  
  wh_log         log_data (blob)              Scalar          no further processing  
  wh_log         log_data_plaintext (blob)    Scalar          no further processing  
  wh_log         log_message (text)           Scalar          no further processing  
  wh_meta_data   meta_data (blob)             Scalar          no further processing  
  wh_meta_data   meta_data_plaintext (blob)   Scalar          no further processing  
 -------------- ---------------------------- --------------- ----------------------- 

Search encoded data

It happens that data in database columns are in encoded form.
This means that the encoded plaintext must be decoded during the analysis.
In our example, the database column log_data of the table wh_log and the database column meta_data of the table wh_meta_data contain encoded data.
You have to find out how this data is encoded by looking at the source code or the documentation of the application that uses the database.

In our example, the data in the database column log_data (with log_type = bar) is encoded as follows.

Database data:

613a323a7b693a303b733a31353a223133322e3138382e3234312e313535223b733a343a2275736572223b4f3a383a22737464436c617373223a353a7b733a383a22757365724e616d65223b733a373a22637972696c3036223b733a383a226c6173744e616d65223b733a383a22486f6d656e69636b223b733a353a22656d61696c223b733a32313a22636c696e746f6e3434406578616d706c652e6e6574223b733a323a226964223b693a39313b733a343a2275736572223b523a333b7d7d

Encoding through the application:

$plaintext = 'a:2:{i:0;s:15:"132.188.241.155";s:4:"user";O:8:"stdClass":5:{s:8:"userName";s:7:"cyril06";s:8:"lastName";s:8:"Homenick";s:5:"email";s:21:"clinton44@example.net";s:2:"id";i:91;s:4:"user";R:3;}}';
$logData = bin2hex($plaintext);

In order for pseudify to search the data ($plaintext), the data must first be converted from hexadecimal representation to binary format.
For this purpose, the data type (parameter dataType) can be passed to the definition of a database column (TargetColumn::create()).

<?php

namespace Waldhacker\Pseudify\Profiles;

use Waldhacker\Pseudify\Core\Profile\Analyze\ProfileInterface;
use Waldhacker\Pseudify\Core\Profile\Model\Analyze\TableDefinition;
use Waldhacker\Pseudify\Core\Profile\Model\Analyze\TargetColumn;
use Waldhacker\Pseudify\Core\Profile\Model\Analyze\TargetTable;

class TestAnalyzeProfile implements ProfileInterface
{
    public function getIdentifier(): string
    {
        return 'test-profile';
    }

    public function getTableDefinition(): TableDefinition
    {
        $tableDefinition = new TableDefinition(identifier: $this->getIdentifier());

        $tableDefinition
            // ...
            ->addTargetTable(table: TargetTable::create(identifier: 'wh_log',
                columns: [
                    TargetColumn::create(identifier: 'log_data', dataType: TargetColumn::DATA_TYPE_HEX),
                ]
            ))
        ;

        return $tableDefinition;
    }
}

The method TargetColumn::create() can be passed with the parameter dataType a name of a built-in decoder.
This is equivalent to: ->setEncoder(encoder: new HexEncoder()).

<?php

namespace Waldhacker\Pseudify\Profiles;

use Waldhacker\Pseudify\Core\Processor\Encoder\HexEncoder;
use Waldhacker\Pseudify\Core\Profile\Analyze\ProfileInterface;
use Waldhacker\Pseudify\Core\Profile\Model\Analyze\TableDefinition;
use Waldhacker\Pseudify\Core\Profile\Model\Analyze\TargetColumn;
use Waldhacker\Pseudify\Core\Profile\Model\Analyze\TargetTable;

class TestAnalyzeProfile implements ProfileInterface
{
    public function getIdentifier(): string
    {
        return 'test-profile';
    }

    public function getTableDefinition(): TableDefinition
    {
        $tableDefinition = new TableDefinition(identifier: $this->getIdentifier());

        $tableDefinition
            // ...
            ->addTargetTable(table: TargetTable::create(identifier: 'wh_log',
                columns: [
                    TargetColumn::create(identifier: 'log_data')->setEncoder(encoder: new HexEncoder()),
                ]
            ))
        ;

        return $tableDefinition;
    }
}

When searching the database column log_data, pseudify will then always process the database column data using the decode() method of the HexEncoder and then search the result.

$ pseudify pseudify:debug:analyze test-profile

Analyzer profile "test-profile"
===============================

Basis configuration
-------------------

 ----------------------------------------------- ------- 
  Key                                             Value  
 ----------------------------------------------- ------- 
  Shown characters before and after the finding   10     
 ----------------------------------------------- ------- 

Collect search data from this tables
------------------------------------

 --------- --------------------- --------------- ----------------------- 
  Table     column                data decoders   data collectors        
 --------- --------------------- --------------- ----------------------- 
  wh_log    ip (string)           Scalar          default (scalar data)  
  wh_user   username (string)     Scalar          default (scalar data)  
  wh_user   password (string)     Scalar          default (scalar data)  
  wh_user   first_name (string)   Scalar          default (scalar data)  
  wh_user   last_name (string)    Scalar          default (scalar data)  
  wh_user   email (string)        Scalar          default (scalar data)  
  wh_user   city (string)         Scalar          default (scalar data)  
 --------- --------------------- --------------- ----------------------- 

Search data in this tables
--------------------------

 ----------------- -------------------------- --------------- ----------------------- 
  Table             column                     data decoders   special data decoders  
 ----------------- -------------------------- --------------- ----------------------- 
  wh_log            log_data (blob)            Hex             no further processing  
  wh_log            log_type (string)          Scalar          no further processing  
  wh_log            log_message (text)         Scalar          no further processing  
  wh_meta_data      meta_data (blob)           Scalar          no further processing  
  wh_user_session   session_data (blob)        Scalar          no further processing  
  wh_user_session   session_data_json (text)   Scalar          no further processing  
 ----------------- -------------------------- --------------- -----------------------

You will now see under Search data in this tables that the name Hex is listed under data decoders in the database column log_data.
This signals to you that the data is being decoded using the HexEncoder.

Search multiple encoded data

It happens that data in database columns are stored in multiple encoded form.
In our example, the data of the database column meta_data are encoded like this:

$plaintext = 'a:3:{s:4:"key1";a:9:{s:2:"id";i:5;s:8:"username";s:13:"howell.damien";s:8:"password";s:92:"$argon2i$v=19$m=8,t=1,p=1$ZldmOWd2TDJRb3FTNVpGNA$ORIwp6yekRx02mqM4WCTVhllgXpUpuFJZ1MmbYwAMXs";s:18:"password_hash_type";s:8:"argon2id";s:18:"password_plaintext";s:13:"nF5;06?nsS/nE";s:10:"first_name";s:7:"Mckayla";s:9:"last_name";s:11:"Stoltenberg";s:5:"email";s:24:"conn.abigale@example.net";s:4:"city";s:11:"Dorothyfort";}s:4:"key2";a:2:{s:2:"id";i:3;s:12:"session_data";s:41:"a:1:{s:7:"last_ip";s:13:"244.166.32.78";}";}s:4:"key3";a:1:{s:4:"key4";s:12:"139.81.0.139";}}';
$meta_data = bin2hex(gzencode($plaintext, 5, ZLIB_ENCODING_GZIP));

In order for pseudify to search the data ($plaintext), the data must first be converted from hexadecimal representation to binary format and then the binary data must still be decompressed from ZLIB format.
To perform multiple decoding, the ChainedEncoder can be used.
With the ChainedEncoder, several decoders can be configured, which then decode the data in sequence.

<?php

namespace Waldhacker\Pseudify\Profiles;

use Waldhacker\Pseudify\Core\Processor\Encoder\ChainedEncoder;
use Waldhacker\Pseudify\Core\Processor\Encoder\GzEncodeEncoder;
use Waldhacker\Pseudify\Core\Processor\Encoder\HexEncoder;
use Waldhacker\Pseudify\Core\Profile\Analyze\ProfileInterface;
use Waldhacker\Pseudify\Core\Profile\Model\Analyze\TableDefinition;
use Waldhacker\Pseudify\Core\Profile\Model\Analyze\TargetColumn;
use Waldhacker\Pseudify\Core\Profile\Model\Analyze\TargetTable;

class TestAnalyzeProfile implements ProfileInterface
{
    public function getIdentifier(): string
    {
        return 'test-profile';
    }

    public function getTableDefinition(): TableDefinition
    {
        $tableDefinition = new TableDefinition(identifier: $this->getIdentifier());

        $tableDefinition
            // ...
            ->addTargetTable(table: TargetTable::create(identifier: 'wh_meta_data',
                columns: [
                    TargetColumn::create(identifier: 'meta_data')->setEncoder(encoder: new ChainedEncoder(encoders: [
                        new HexEncoder(),
                        new GzEncodeEncoder(defaultContext: [
                            GzEncodeEncoder::ENCODE_LEVEL => 5,
                            GzEncodeEncoder::ENCODE_ENCODING => ZLIB_ENCODING_GZIP,
                        ]),
                    ])),
                ]
            ))
        ;

        return $tableDefinition;
    }
}

When searching the database column meta_data of the table wh_meta_data, pseudify will then first process the data of the database column using the method decode() of the HexEncoder
and then by the decode() method of the GzEncodeEncoder and then search the result.

$ pseudify pseudify:debug:analyze test-profile

Analyzer profile "test-profile"
===============================

Basis configuration
-------------------

 ----------------------------------------------- ------- 
  Key                                             Value  
 ----------------------------------------------- ------- 
  Shown characters before and after the finding   10     
 ----------------------------------------------- ------- 

Collect search data from this tables
------------------------------------

 --------- --------------------- --------------- ----------------------- 
  Table     column                data decoders   data collectors        
 --------- --------------------- --------------- ----------------------- 
  wh_log    ip (string)           Scalar          default (scalar data)  
  wh_user   username (string)     Scalar          default (scalar data)  
  wh_user   password (string)     Scalar          default (scalar data)  
  wh_user   first_name (string)   Scalar          default (scalar data)  
  wh_user   last_name (string)    Scalar          default (scalar data)  
  wh_user   email (string)        Scalar          default (scalar data)  
  wh_user   city (string)         Scalar          default (scalar data)  
 --------- --------------------- --------------- ----------------------- 

Search data in this tables
--------------------------

 ----------------- -------------------------- --------------- ----------------------- 
  Table             column                     data decoders   special data decoders  
 ----------------- -------------------------- --------------- ----------------------- 
  wh_meta_data      meta_data (blob)           Hex>GzEncode    no further processing  
  wh_log            log_type (string)          Scalar          no further processing  
  wh_log            log_data (blob)            Scalar          no further processing  
  wh_log            log_message (text)         Scalar          no further processing  
  wh_user_session   session_data (blob)        Scalar          no further processing  
  wh_user_session   session_data_json (text)   Scalar          no further processing  
 ----------------- -------------------------- --------------- -----------------------

You will now see under Search data in this tables that under data decoders of the database column wh_meta_data the names Hex>GzEncode are listed.
This signals to you that the data will first be decoded using the HexEncoder and then using the GzEncodeEncoder.

Search differently encoded data

It happens that data in database columns are stored in differently encoded form.
Based on conditions, applications store the data in different forms.

In our example, the data of the database column log_data are encoded as follows if the database column log_type contains the value bar.

Database data:

613a323a7b693a303b733a31353a223133322e3138382e3234312e313535223b733a343a2275736572223b4f3a383a22737464436c617373223a353a7b733a383a22757365724e616d65223b733a373a22637972696c3036223b733a383a226c6173744e616d65223b733a383a22486f6d656e69636b223b733a353a22656d61696c223b733a32313a22636c696e746f6e3434406578616d706c652e6e6574223b733a323a226964223b693a39313b733a343a2275736572223b523a333b7d7d

Encoding through the application:

$plaintext = 'a:2:{i:0;s:15:"132.188.241.155";s:4:"user";O:8:"stdClass":5:{s:8:"userName";s:7:"cyril06";s:8:"lastName";s:8:"Homenick";s:5:"email";s:21:"clinton44@example.net";s:2:"id";i:91;s:4:"user";R:3;}}';
$logData = bin2hex($plaintext);

In order for pseudify to search the data ($plaintext), the data must first be converted from hexadecimal representation to binary format.

The data of the database column log_data are encoded as follows if the database column log_type contains the value foo.

Database data:

65794a3163325679546d46745a534936496e4a76626d46735a4738784e534973496d567459576c73496a6f6962574e6a624856795a5335765a6d5673615746415a586868625842735a53356a623230694c434a7359584e30546d46745a534936496b746c5a577870626d63694c434a7063434936496a457a4d6a45364e54646d597a6f304e6a42694f6d51305a4441365a44677a5a6a706a4d6a41774f6a52694f6d5978597a676966513d3d

Encoding through the application:

$plaintext = '{"userName":"ronaldo15","email":"mcclure.ofelia@example.com","lastName":"Keeling","ip":"1321:57fc:460b:d4d0:d83f:c200:4b:f1c8"}';
$logData = bin2hex(base64_encode($logDataPlaintext));

In order for pseudify to search the data ($plaintext), the data must first be converted from hexadecimal representation to binary format and then decoded in Base64 format.

In both cases (log_type == foo and log_type == bar) the data can first be converted from hexadecimal representation to binary format.
If the database column contains log_type == foo, the data must then additionally be base64 decoded.
This can be modelled as follows:

<?php

namespace Waldhacker\Pseudify\Profiles;

use Waldhacker\Pseudify\Core\Processor\Encoder\Base64Encoder;
use Waldhacker\Pseudify\Core\Processor\Processing\Analyze\TargetDataDecoderContext;
use Waldhacker\Pseudify\Core\Processor\Processing\DataProcessing;
use Waldhacker\Pseudify\Core\Profile\Analyze\ProfileInterface;
use Waldhacker\Pseudify\Core\Profile\Model\Analyze\TableDefinition;
use Waldhacker\Pseudify\Core\Profile\Model\Analyze\TargetColumn;
use Waldhacker\Pseudify\Core\Profile\Model\Analyze\TargetTable;

class TestAnalyzeProfile implements ProfileInterface
{
    public function getIdentifier(): string
    {
        return 'test-profile';
    }

    public function getTableDefinition(): TableDefinition
    {
        $tableDefinition = new TableDefinition(identifier: $this->getIdentifier());

        $tableDefinition
            // ...
            ->addTargetTable(table: TargetTable::create(identifier: 'wh_log',
                columns: [
                    TargetColumn::create(identifier: 'log_data', dataType: TargetColumn::DATA_TYPE_HEX)
                        ->addDataProcessing(dataProcessing: new DataProcessing(identifier: 'decode conditional log data',
                            processor: function (TargetDataDecoderContext $context): void {
                                $row = $context->getDatebaseRow();
                                if ('foo' !== $row['log_type']) {
                                    return;
                                }
                                $data = $context->getDecodedData();

                                $encoder = new Base64Encoder();
                                $logData = $encoder->decode(data: $data);

                                $context->setDecodedData(decodedData: $logData);
                            }
                        )),
                ]
            ))
        ;

        return $tableDefinition;
    }
}

With the method addDataProcessing(), further manual data transformations can be programmed in addition to the decoding of the data.
The DataProcessings are executed after the decoding of the data.
Any number of DataProcessings can be defined, which are processed one after the other.

A DataProcessing consists of a unique identification per database column (parameter identifier) and
an anonymous function (parameter processor).
The anonymous function is called with a parameter context of the type TargetDataDecoderContext.
The TargetDataDecoderContext can be used to obtain various information about the data set to be processed:

  • $context->getRawData(): The original data of the database column.
  • $context->getDecodedData(): The data of the database column after decoding
  • $context->getDatebaseRow(): Contains the original data of all database columns of the row being processed

With the method setDecodedData() manually processed data can be passed to pseudify.
This manually processed data is then searched by the analysis.

In our example, we use the value of the column log_type to determine whether the data needs to be further decoded using base64.
If the value of log_type is not foo, nothing further is processed by the return statement.
If the value of log_type is foo, the data is decoded by the Base64Encoder() and written back to pseudify by the setDecodedData() method.

$ pseudify pseudify:debug:analyze test-profile

Analyzer profile "test-profile"
===============================

Basis configuration
-------------------

 ----------------------------------------------- ------- 
  Key                                             Value  
 ----------------------------------------------- ------- 
  Shown characters before and after the finding   10     
 ----------------------------------------------- ------- 

Collect search data from this tables
------------------------------------

 --------- --------------------- --------------- ----------------------- 
  Table     column                data decoders   data collectors        
 --------- --------------------- --------------- ----------------------- 
  wh_log    ip (string)           Scalar          default (scalar data)  
  wh_user   username (string)     Scalar          default (scalar data)  
  wh_user   password (string)     Scalar          default (scalar data)  
  wh_user   first_name (string)   Scalar          default (scalar data)  
  wh_user   last_name (string)    Scalar          default (scalar data)  
  wh_user   email (string)        Scalar          default (scalar data)  
  wh_user   city (string)         Scalar          default (scalar data)  
 --------- --------------------- --------------- ----------------------- 

Search data in this tables
--------------------------

 ----------------- -------------------------- --------------- ----------------------------- 
  Table             column                     data decoders   special data decoders        
 ----------------- -------------------------- --------------- ----------------------------- 
  wh_log            log_data (blob)            Hex             decode conditional log data  
  wh_log            log_type (string)          Scalar          no further processing        
  wh_log            log_message (text)         Scalar          no further processing        
  wh_meta_data      meta_data (blob)           Scalar          no further processing        
  wh_user_session   session_data (blob)        Scalar          no further processing        
  wh_user_session   session_data_json (text)   Scalar          no further processing        
 ----------------- -------------------------- --------------- -----------------------------

You will now see under Search data in this tables that the name Hex is listed under data decoders in the wh_log database column.
This signals to you that the data will first be decoded using the HexEncoder.
Under special data decoders the DataProcessing is listed with the identification decode conditional log data.
This signals to you that after decoding the data, it will also be processed using the specified DataProcessing.

Normalize JSON Data

If the data to be searched is in JSON format in the database,
it should be normalised to make it fully searchable by pseudify.
For example, UTF-8 characters are masked in JSON format, so for example an Ö in JSON format is masked by the string \u00d6.

Example dataset:

"{"oldRecord":{"bodytext":"<p>In 2023 sind folgende \u00d6ffentlichkeitsaktionen geplant:<\/p>"}}"

Assuming pseudify is to search for occurrences of the word Öffentlichkeitsaktionen, pseudify will not find this in the sample dataset due to the masking.
To normalise the JSON string and make it look like this:

"{"oldRecord":{"bodytext":"<p>In 2023 sind folgende Öffentlichkeitsaktionen geplant:</p>"}}"

the DataProcessing named normalisedJsonString() exists.
The addition of this DataProcessing by using

->addDataProcessing(dataProcessing: TargetDataDecoderPreset::normalizedJsonString())

to a database column containing JSON data structures, normalises the JSON string and makes it searchable for pseudify.

<?php

namespace Waldhacker\Pseudify\Profiles;

use Waldhacker\Pseudify\Core\Processor\Processing\Analyze\TargetDataDecoderPreset;
use Waldhacker\Pseudify\Core\Profile\Analyze\ProfileInterface;
use Waldhacker\Pseudify\Core\Profile\Model\Analyze\TableDefinition;
use Waldhacker\Pseudify\Core\Profile\Model\Analyze\TargetColumn;
use Waldhacker\Pseudify\Core\Profile\Model\Analyze\TargetTable;

class TestAnalyzeProfile implements ProfileInterface
{
    public function getIdentifier(): string
    {
        return 'test-profile';
    }

    public function getTableDefinition(): TableDefinition
    {
        $tableDefinition = new TableDefinition(identifier: $this->getIdentifier());

        $tableDefinition
            // ...
            ->addTargetTable(table: TargetTable::create(identifier: 'wh_log',
                columns: [
                    TargetColumn::create(identifier: 'log_message')->addDataProcessing(dataProcessing: TargetDataDecoderPreset::normalizedJsonString()),
                ]
            ))
        ;

        return $tableDefinition;
    }
}

Define non-scalar source data

Sometimes it is necessary to define data from complex data structures as source data.
As an example, we want to use data from the database column session_data_json of the table wh_user_session to use as source data.
session_data_json contains a string in JSON format. In this there is a property called data consisting of an array with the property last_ip which we want to use as source data.

{"data": {"last_ip":"107.66.23.195"}}

You can pass the method SourceColumn::create() with the parameter dataType a name of a built-in decoder.

Note

As described in "Search multiple encoded data", the ChainedEncoder can also be used here to decode multiple-encoded data.

The method addDataProcessing() can now be used to define which data is to be extracted from the decoded data structure in order to use it as source data.
The DataProcessings are executed after the decoding of the data.
Any number of DataProcessings can be defined, which are processed one after the other.

A DataProcessing consists of a unique identification per database column (parameter identifier) and an anonymous function (parameter processor).
The anonymous function is called with a parameter context of the type SourceDataCollectorContext.
The SourceDataCollectorContext can be used to obtain various information about the data set to be processed:

  • $context->getRawData(): The original data of the database column.
  • $context->getDecodedData(): The data of the database column after decoding
  • $context->getDatebaseRow(): Contains the original data of all database columns of the row being processed

The addCollectedData() method can be used to pass the extracted data to pseudify as source data.
The method addCollectedData() can be used any number of times to pass any number of source data to pseudify.
The method addCollectedData() can be passed either a string or a one-dimensional array. If an array is passed, all scalar data in it is extracted and passed to pseudify as source data.

Info

If no DataProcessing is defined, the standard DataProcessing SourceDataCollectorPreset::scalarData() is automatically used.
This only collects the data from a database column if the content contains more than 2 characters.

<?php

namespace Waldhacker\Pseudify\Profiles;

use Waldhacker\Pseudify\Core\Processor\Processing\Analyze\SourceDataCollectorContext;
use Waldhacker\Pseudify\Core\Processor\Processing\DataProcessing;
use Waldhacker\Pseudify\Core\Profile\Analyze\ProfileInterface;
use Waldhacker\Pseudify\Core\Profile\Model\Analyze\SourceColumn;
use Waldhacker\Pseudify\Core\Profile\Model\Analyze\TableDefinition;

class TestAnalyzeProfile implements ProfileInterface
{
    public function getIdentifier(): string
    {
        return 'test-profile';
    }

    public function getTableDefinition(): TableDefinition
    {
        $tableDefinition = new TableDefinition(identifier: $this->getIdentifier());

        $tableDefinition
            ->addSourceTable(table: 'wh_user_session', columns: [
                SourceColumn::create(identifier: 'session_data_json', dataType: SourceColumn::DATA_TYPE_JSON)
                    ->addDataProcessing(dataProcessing: new DataProcessing(identifier: 'extract ip address',
                        processor: function (SourceDataCollectorContext $context): void {
                            $data = $context->getDecodedData();
                            $context->addCollectedData(data: $data['data']['last_ip']);
                        }
                    )),
            ])
        ;

        return $tableDefinition;
    }
}

You will now see under Collect search data from these tables that the name Json is listed under data decoders in the database column session_data_json.
This signals to you that the data will be decoded using the JsonEncoder.
Under data collectors the DataProcessing is listed with the identification extract ip address.
This signals to you that after decoding the data, it will also be collected using the specified DataProcessing.

$ pseudify pseudify:debug:analyze test-profile

Analyzer profile "test-profile"
===============================

Basis configuration
-------------------

 ----------------------------------------------- ------- 
  Key                                             Value  
 ----------------------------------------------- ------- 
  Shown characters before and after the finding   10     
 ----------------------------------------------- ------- 

Collect search data from this tables
------------------------------------

 ----------------- -------------------------- --------------- -------------------- 
  Table             column                     data decoders   data collectors     
 ----------------- -------------------------- --------------- -------------------- 
  wh_user_session   session_data_json (text)   Json            extract ip address  
 ----------------- -------------------------- --------------- -------------------- 

Search data in this tables
--------------------------

 ----------------- --------------------- --------------- ----------------------- 
  Table             column                data decoders   special data decoders  
 ----------------- --------------------- --------------- ----------------------- 
  wh_log            id (integer)          Scalar          no further processing  
  wh_log            log_type (string)     Scalar          no further processing  
  wh_log            log_data (blob)       Scalar          no further processing  
  wh_log            log_message (text)    Scalar          no further processing  
  wh_log            ip (string)           Scalar          no further processing  
  wh_meta_data      id (integer)          Scalar          no further processing  
  wh_meta_data      meta_data (blob)      Scalar          no further processing  
  wh_user           id (integer)          Scalar          no further processing  
  wh_user           username (string)     Scalar          no further processing  
  wh_user           password (string)     Scalar          no further processing  
  wh_user           first_name (string)   Scalar          no further processing  
  wh_user           last_name (string)    Scalar          no further processing  
  wh_user           email (string)        Scalar          no further processing  
  wh_user           city (string)         Scalar          no further processing  
  wh_user_session   id (integer)          Scalar          no further processing  
  wh_user_session   session_data (blob)   Scalar          no further processing  
 ----------------- --------------------- --------------- -----------------------

Define custom source data

It is possible to define user-defined source data that does not refer to database columns.
With the method addSourceString() strings can be defined as source data.

<?php

namespace Waldhacker\Pseudify\Profiles;

use Waldhacker\Pseudify\Core\Profile\Analyze\ProfileInterface;
use Waldhacker\Pseudify\Core\Profile\Model\Analyze\TableDefinition;

class TestAnalyzeProfile implements ProfileInterface
{
    public function getIdentifier(): string
    {
        return 'test-profile';
    }

    public function getTableDefinition(): TableDefinition
    {
        $tableDefinition = new TableDefinition(identifier: $this->getIdentifier());

        $tableDefinition
            ->addSourceTable(table: 'wh_log', columns: [
                'ip',
            ])
            ->addSourceTable(table: 'wh_user', columns: [
                'username',
                'password',
                'first_name',
                'last_name',
                'email',
                'city',
            ])

            ->addSourceString(string: 'example.com')
            ->addSourceString(string: 'regex:(?:[0-9]{1,3}\.){3}[0-9]{1,3}')

            // ...
        ;

        return $tableDefinition;
    }
}

You will now see the user-defined strings that are searched for in the database under 'Search for this strings'.
As an alternative to static values, it is possible to use regular expressions for the search.
A regular expression must be identified by the prefix regex: and follow the PCRE regex syntax.
For example, regex:(?:[0-9]{1,3}\.){3}[0-9]{1,3} can be used to search for IPv4 addresses.

$ pseudify pseudify:debug:analyze test-profile

Analyzer profile "test-profile"
===============================

Basis configuration
-------------------

 ----------------------------------------------- ------- 
  Key                                             Value  
 ----------------------------------------------- ------- 
  Shown characters before and after the finding   10     
 ----------------------------------------------- ------- 

Collect search data from this tables
------------------------------------

 --------- --------------------- --------------- ----------------------- 
  Table     column                data decoders   data collectors        
 --------- --------------------- --------------- ----------------------- 
  wh_log    ip (string)           Scalar          default (scalar data)  
  wh_user   username (string)     Scalar          default (scalar data)  
  wh_user   password (string)     Scalar          default (scalar data)  
  wh_user   first_name (string)   Scalar          default (scalar data)  
  wh_user   last_name (string)    Scalar          default (scalar data)  
  wh_user   email (string)        Scalar          default (scalar data)  
  wh_user   city (string)         Scalar          default (scalar data)  
 --------- --------------------- --------------- ----------------------- 

Search for this strings
-----------------------

 ------------------------------------- 
  String                               
 ------------------------------------- 
  example.com                          
  regex:(?:[0-9]{1,3}\.){3}[0-9]{1,3}  
 -------------------------------------

Search data in this tables
--------------------------

 ----------------- -------------------------- --------------- ----------------------- 
  Table             column                     data decoders   special data decoders  
 ----------------- -------------------------- --------------- ----------------------- 
  wh_log            log_type (string)          Scalar          no further processing  
  wh_log            log_data (blob)            Scalar          no further processing  
  wh_log            log_message (text)         Scalar          no further processing  
  wh_meta_data      meta_data (blob)           Scalar          no further processing  
  wh_user_session   session_data (blob)        Scalar          no further processing  
  wh_user_session   session_data_json (text)   Scalar          no further processing  
 ----------------- -------------------------- --------------- -----------------------

Execute an "Analyze Profile"

An "Analyze Profile" can be executed with the command pseudify:analyse <profile-name>.

$ pseudify pseudify:analyze test-profile

 1224/1224 [▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓] 100% < 1 sec/< 1 sec 4.0 MiB

summary
=======

 ----------------------------------- ---------------------------------------------------------------------------------------------- ------------------------------ 
  source                              data                                                                                           seems to be in                
 ----------------------------------- ---------------------------------------------------------------------------------------------- ------------------------------ 
  __custom__.__custom__               132.188.241.155                                                                                wh_log.ip                     
  __custom__.__custom__               155.215.67.191                                                                                 wh_log.ip                     
  __custom__.__custom__               243.202.241.67                                                                                 wh_log.ip                     
  __custom__.__custom__               132.188.241.155                                                                                wh_log.log_data               
  __custom__.__custom__               155.215.67.191                                                                                 wh_log.log_data               
  __custom__.__custom__               243.202.241.67                                                                                 wh_log.log_data               
  __custom__.__custom__               example.com                                                                                    wh_log.log_data               
  __custom__.__custom__               example.com                                                                                    wh_log.log_message            
  __custom__.__custom__               139.81.0.139                                                                                   wh_meta_data.meta_data        
  __custom__.__custom__               187.135.239.239                                                                                wh_meta_data.meta_data        
  __custom__.__custom__               197.110.248.18                                                                                 wh_meta_data.meta_data        
  __custom__.__custom__               20.1.58.149                                                                                    wh_meta_data.meta_data        
  __custom__.__custom__               239.27.57.12                                                                                   wh_meta_data.meta_data        
  __custom__.__custom__               244.166.32.78                                                                                  wh_meta_data.meta_data        
  __custom__.__custom__               83.243.216.115                                                                                 wh_meta_data.meta_data        
  __custom__.__custom__               example.com                                                                                    wh_meta_data.meta_data        
  __custom__.__custom__               107.66.23.195                                                                                  wh_user_session.session_data  
  __custom__.__custom__               197.110.248.18                                                                                 wh_user_session.session_data  
  __custom__.__custom__               244.166.32.78                                                                                  wh_user_session.session_data  
  wh_user.city                        Dorothyfort                                                                                    wh_meta_data.meta_data        
  wh_user.city                        North Elenamouth                                                                               wh_meta_data.meta_data        
  wh_user.city                        South Wilfordland                                                                              wh_meta_data.meta_data        
  wh_user.email                       mcclure.ofelia@example.com                                                                     wh_log.log_data               
  wh_user.email                       mcclure.ofelia@example.com                                                                     wh_log.log_message            
  wh_user.email                       cassin.bernadette@example.net                                                                  wh_meta_data.meta_data        
  wh_user.email                       conn.abigale@example.net                                                                       wh_meta_data.meta_data        
  wh_user.email                       mcclure.ofelia@example.com                                                                     wh_meta_data.meta_data        
  wh_user.first_name                  Donato                                                                                         wh_meta_data.meta_data        
  wh_user.first_name                  Maybell                                                                                        wh_meta_data.meta_data        
  wh_user.first_name                  Mckayla                                                                                        wh_meta_data.meta_data        
  wh_user.last_name                   Keeling                                                                                        wh_log.log_data               
  wh_user.last_name                   Anderson                                                                                       wh_meta_data.meta_data        
  wh_user.last_name                   Keeling                                                                                        wh_meta_data.meta_data        
  wh_user.last_name                   Stoltenberg                                                                                    wh_meta_data.meta_data        
  wh_user.password                    $argon2i$v=19$m=8,t=1,p=1$QXNXbTRMZWxmenBRUzdwZQ$i6hntUDLa3ZFqmCG4FM0iPrpMp6d4D8XfrNBtyDmV9U   wh_meta_data.meta_data        
  wh_user.password                    $argon2i$v=19$m=8,t=1,p=1$SUJJeWZGSGEwS2h2TEw5Ug$kCQm4/5DqnjXc/3SiXwimtTBvbDO9H0Ru1f5hkQvE/Q   wh_meta_data.meta_data        
  wh_user.password                    $argon2i$v=19$m=8,t=1,p=1$ZldmOWd2TDJRb3FTNVpGNA$ORIwp6yekRx02mqM4WCTVhllgXpUpuFJZ1MmbYwAMXs   wh_meta_data.meta_data        
  wh_user.username                    georgiana59                                                                                    wh_log.log_data               
  wh_user.username                    georgiana59                                                                                    wh_log.log_message            
  wh_user.username                    georgiana59                                                                                    wh_meta_data.meta_data        
  wh_user.username                    howell.damien                                                                                  wh_meta_data.meta_data        
  wh_user.username                    hpagac                                                                                         wh_meta_data.meta_data        
  wh_user_session.session_data_json   1321:57fc:460b:d4d0:d83f:c200:4b:f1c8                                                          wh_log.ip                     
  wh_user_session.session_data_json   4fb:1447:defb:9d47:a2e0:a36a:10d3:fd98                                                         wh_log.ip                     
  wh_user_session.session_data_json   1321:57fc:460b:d4d0:d83f:c200:4b:f1c8                                                          wh_log.log_data               
  wh_user_session.session_data_json   4fb:1447:defb:9d47:a2e0:a36a:10d3:fd98                                                         wh_log.log_data               
  wh_user_session.session_data_json   1321:57fc:460b:d4d0:d83f:c200:4b:f1c8                                                          wh_meta_data.meta_data        
  wh_user_session.session_data_json   197.110.248.18                                                                                 wh_meta_data.meta_data        
  wh_user_session.session_data_json   244.166.32.78                                                                                  wh_meta_data.meta_data        
  wh_user_session.session_data_json   107.66.23.195                                                                                  wh_user_session.session_data  
  wh_user_session.session_data_json   1321:57fc:460b:d4d0:d83f:c200:4b:f1c8                                                          wh_user_session.session_data  
  wh_user_session.session_data_json   197.110.248.18                                                                                 wh_user_session.session_data  
  wh_user_session.session_data_json   244.166.32.78                                                                                  wh_user_session.session_data  
  wh_user_session.session_data_json   4fb:1447:defb:9d47:a2e0:a36a:10d3:fd98                                                         wh_user_session.session_data  
 ----------------------------------- ---------------------------------------------------------------------------------------------- ------------------------------

Note

Depending on the size of the database, the analysis can be finished after seconds or only after hours.
Since analyses are usually only carried out infrequently, e.g. to model pseudonymisation with the collected information, we have decided that a somewhat longer runtime of an analysis is justifiable.

The first line of the analysis indicates how many data have already been analysed and how many are analysed in total (1148/1148).
This is followed by a progress bar and a percentage indication of the progress.
After that, the runtime and the estimated total time of the analysis are output.
Finally, the maximum memory consumption so far is output.

The summary of the analysis finally lists which source data (column data) from which source database column (column source) can be found in which database columns (column seems to be in).
If there is a __custom__.__custom__ in the source column, this means that the source data does not come from a database column, but was defined using addSourceString().
If you were not previously aware that certain source data can be found in a database column under seems to be in, then you can now take a closer look at these database columns and include them in the modelling of the pseudonymisation.

Info

If there are many database tables and columns, the output of the analysis can become very long and may not fit into the buffer of your terminal.
In this case, it is worth writing the output to a file.

pseudify --no-ansi pseudify:debug:analyze test-profile > analysis.log

Output extended information

For debugging or refining the analysis profile, it may be useful to see what data pseudify found in the database data.
To do this, the command pseudify:analyse can be called with the parameter --verbose:

pseudify pseudify:analyze <profil-name> --verbose

Now the source data is listed (wh_log.ip (1321:57fc:460b:d4d0:d83f:c200:4b:f1c8)) and the location (wh_meta_data.meta_data (...ip";s:37:"1321:57fc:460b:d4d0:d83f:c200:4b:f1c8";}";}s:4:...))

The number of characters that are output before and after the location can be defined with the setTargetDataFrameCuttingLength() method.
By default, 10 characters are output before and after a target.
If the value is set to 0, nothing is cut off before and after the target and the complete database content is output.

<?php

namespace Waldhacker\Pseudify\Profiles;

use Waldhacker\Pseudify\Core\Profile\Analyze\ProfileInterface;
use Waldhacker\Pseudify\Core\Profile\Model\Analyze\TableDefinition;

class TestAnalyzeProfile implements ProfileInterface
{
    public function getIdentifier(): string
    {
        return 'test-profile';
    }

    public function getTableDefinition(): TableDefinition
    {
        $tableDefinition = new TableDefinition(identifier: $this->getIdentifier());

        $tableDefinition
            // ...
            ->setTargetDataFrameCuttingLength(length: 15);

        return $tableDefinition;
    }
}