Preprocessing of Data

This feature ensures that sensitive data is processed on the client side before being transmitted to the data lake, protecting it from exposure to unauthorized external entities.

For the best Machine Learning (ML) results in automatic agent process identification, keep this feature turned off (default setting). This allows the ML to analyze more data, resulting in better automatic process analytics. Enable this feature only if you must hash or mask the data in the data lake database.

  1. On your client system, go to %appdata%\Nice_Systems\CXDiscovery and open the CXDClientConfig.json file.

  2. To enable this feature, add the following configuration section under DataCollection in the CXDClientConfig.json file. Set the PreProcessing "enabled" property to true to enable preprocessing

    "PreProcessing":

    {

    "enabled": "true",

    "emailToken": "_email_",

    "phoneNumberToken": "_phone_",

    "dateTimeToken": "_datetime_",

    "IPToken": "_ip_",

    "FirstNameToken": "_firstname_",

    "SurnameToken": "_surname_",

    "AirportToken": "_airport_",

    "CityToken": "_city_",

    "CountryToken": "_country_",

    "alphanumericToken": "_alphanumeric_",

    "numericToken": "_num_",

    "customCharacterTokens": "",

    "maxKeepingInvalidMessageDays": "30"

    }

    For this feature to function properly, ensure that DataCollection property is enabled in the CXDClientConfig.json file. All attributes are optional and will use their default values if not specified.

    Each attribute allows you to customize how sensitive or structured data is represented or managed in your client configuration.

    For Example:

    Input text: John Doe, whose email is john.doe@example.com and phone number is +1-555-1234, lives in New York, USA.

    After generalization:

    Input text: _name_, whose email is _email_ and phone number is _phone_number_, lives in _city_, _country_.

  3. Masking the data: All data except the above specified keywords will be masked. For example:

    Before Masking:

    Input text: _name_, whose email is _email_ and phone number is _phone_number_, lives in _city_, _country_.

    After Masking:

    Masked Input text: _name_, CCCC CCCCC CC _email_ CCC CCCCC CCCCC CC _phone_number_, CCCCC CC _city_, _country_

  4. Hashing the data: All data except the above specified keywords will be hashed and assigned to these fields:

    • controlIdentifierHashed

    • textHashed

    • processTitleHashed

    For example:

    Before Hashing:

    _name_, whose email is _email_ and phone number is _phone_number_, lives in _city_, _country_

    After Hashing:

    _name_d883601c7ec91e2457a40e870755151d96019c7f60e1c7de8efec2a0bbd0db53_email_e1e4485e020ef85beab9c356edcf1846d40d2d12b5ad14f8e502eae357a6ce3a_phone_number_98698d73a9b09f7c6fe3cda57f1034f9be5c630765cba3df38579262275b26a0_city_315f5bdb76d078c43b8ac0064e4a01646a5f0b8f9e4e0fbb2a6a6b9e826dd4b8_country_

The table below describes the configuration attributes.

Attribute

Description

enabled

This attribute determines whether the pre-processing feature is enabled or disabled.
By default, it is set tofalse (disabled). When enabled, it replaces the sensitive data fields mentioned below in processTitle, url, textMasked, and controlIdentifier with the configured tokens.

emailToken

This is the placeholder for email addresses in messages.

Default value: _email_

phoneNumberToken

This is the placeholder for phone numbers in messages.

Default value: _phone_

dateTimeToken

This is the placeholder for date and time values.

Default value: _datetime_

IPToken

This is the placeholder for IP addresses.

Default value: _ip_

FirstNameToken

This is the placeholder for first names in messages.

Default value: _firstname_

SurnameToken

This is the placeholder for surnames (last names) in messages.

Default value: _surname_

AirportToken

This is the placeholder for airport names or codes.

Default value: _airport_

CityToken

This is the placeholder for city names.

Default value: _city_

CountryToken

This is the placeholder for country names.

Default value: _country_

alphanumericToken

This is the placeholder for alphanumeric strings.

Default value: _alphanumeric_

numericToken

This is the placeholder for numeric values.

Default value: _num_

customCharacterTokens This specifies special characters you want to replace with an asterisk (*) during preprocessing. The default value is empty.
maxKeepingInvalidMessageDays

This indicates the maximum number of days to keep invalid messages.

Default value: 30 (days)

Limitations

  • Name Recognition Scope: Only exact matches of names (including first names, surnames, airports, cities, and countries) in the database will be recognized and tokenized. Variations or misspellings will not be detected.

  • False Positives in Name Detection: Due to the high volume of names, some non-name words may be incorrectly identified as names.

  • Overlapping Name Categories: Certain names can belong to multiple categories (e.g., "Georgia" and "Chad" can be first names, surnames, or country names). Tokenization depends on the category in the database and follows this sequence: First name > Surname > Airport name > City name > Country name.