Preprocessing of Data

This feature ensures that sensitive data is processed on the client side before being transmitted to the data lake, protecting it from exposure to unauthorized external entities.

For the best Machine Learning (ML) results in automatic agent process identification, keep this feature turned off (default setting). This allows the ML to analyze more data, resulting in better automatic process analytics. Enable this feature only if you must hash or mask the data in the data lake database.

On your client system, go to %appdata%\Nice_Systems\CXDiscovery and open the CXDClientConfig.json file.
To enable this feature, add the following configuration section under DataCollection in the CXDClientConfig.json file. Set the PreProcessing "enabled" property to true to enable preprocessing

"PreProcessing":

{

"enabled": "true",

"emailToken": "_email_",

"phoneNumberToken": "_phone_",

"dateTimeToken": "_datetime_",

"IPToken": "_ip_",

"FirstNameToken": "_firstname_",

"SurnameToken": "_surname_",

"AirportToken": "_airport_",

"CityToken": "_city_",

"CountryToken": "_country_",

"alphanumericToken": "_alphanumeric_",

"numericToken": "_num_",

"customCharacterTokens": "",

"maxKeepingInvalidMessageDays": "30"

}

For this feature to function properly, ensure that DataCollection property is enabled in the CXDClientConfig.json file. All attributes are optional and will use their default values if not specified.

Each attribute allows you to customize how sensitive or structured data is represented or managed in your client configuration.

For Example:

Input text: John Doe, whose email is john.doe@example.com and phone number is +1-555-1234, lives in New York, USA.

After generalization:

Input text: _name_, whose email is _email_ and phone number is _phone_number_, lives in _city_, _country_.
Masking the data: All data except the above specified keywords will be masked. For example:

Before Masking:

Input text: _name_, whose email is _email_ and phone number is _phone_number_, lives in _city_, _country_.

After Masking:

Masked Input text: _name_, CCCC CCCCC CC _email_ CCC CCCCC CCCCC CC _phone_number_, CCCCC CC _city_, _country_
Hashing the data: All data except the above specified keywords will be hashed and assigned to these fields:
- controlIdentifierHashed
- textHashed
- processTitleHashed
For example:

Before Hashing:

_name_, whose email is _email_ and phone number is _phone_number_, lives in _city_, _country_

After Hashing:

_name_d883601c7ec91e2457a40e870755151d96019c7f60e1c7de8efec2a0bbd0db53_email_e1e4485e020ef85beab9c356edcf1846d40d2d12b5ad14f8e502eae357a6ce3a_phone_number_98698d73a9b09f7c6fe3cda57f1034f9be5c630765cba3df38579262275b26a0_city_315f5bdb76d078c43b8ac0064e4a01646a5f0b8f9e4e0fbb2a6a6b9e826dd4b8_country_

The table below describes the configuration attributes.

Attribute	Description
enabled	This attribute determines whether the pre-processing feature is enabled or disabled. By default, it is set tofalse (disabled). When enabled, it replaces the sensitive data fields mentioned below in processTitle, url, textMasked, and controlIdentifier with the configured tokens.
emailToken	This is the placeholder for email addresses in messages. Default value: _email_
phoneNumberToken	This is the placeholder for phone numbers in messages. Default value: _phone_
dateTimeToken	This is the placeholder for date and time values. Default value: _datetime_
IPToken	This is the placeholder for IP addresses. Default value: _ip_
FirstNameToken	This is the placeholder for first names in messages. Default value: _firstname_
SurnameToken	This is the placeholder for surnames (last names) in messages. Default value: _surname_
AirportToken	This is the placeholder for airport names or codes. Default value: _airport_
CityToken	This is the placeholder for city names. Default value: _city_
CountryToken	This is the placeholder for country names. Default value: _country_
alphanumericToken	This is the placeholder for alphanumeric strings. Default value: _alphanumeric_
numericToken	This is the placeholder for numeric values. Default value: _num_
customCharacterTokens	This specifies special characters you want to replace with an asterisk (*) during preprocessing. The default value is empty.
maxKeepingInvalidMessageDays	This indicates the maximum number of days to keep invalid messages. Default value: 30 (days)

Limitations

Name Recognition Scope: Only exact matches of names (including first names, surnames, airports, cities, and countries) in the database will be recognized and tokenized. Variations or misspellings will not be detected.
False Positives in Name Detection: Due to the high volume of names, some non-name words may be incorrectly identified as names.
Overlapping Name Categories: Certain names can belong to multiple categories (e.g., "Georgia" and "Chad" can be first names, surnames, or country names). Tokenization depends on the category in the database and follows this sequence: First name > Surname > Airport name > City name > Country name.