Preprocessing of Data
This feature ensures that sensitive data is processed on the client side before being transmitted to the data lake, protecting it from exposure to unauthorized external entities.
For the best Machine Learning (ML) results in automatic agent process identification, keep this feature turned off (default setting). This allows the ML to analyze more data, resulting in better automatic process analytics. Enable this feature only if you must hash or mask the data in the data lake database.
-
On your client system, go to %appdata%\Nice_Systems\CXDiscovery and open the CXDClientConfig.json file.
-
To enable this feature, add the following configuration section under DataCollection in the CXDClientConfig.json file. Set the PreProcessing "enabled" property to true to enable preprocessing
"PreProcessing":
{
"enabled": "true",
"emailToken": "_email_",
"phoneNumberToken": "_phone_",
"dateTimeToken": "_datetime_",
"IPToken": "_ip_",
"FirstNameToken": "_firstname_",
"SurnameToken": "_surname_",
"AirportToken": "_airport_",
"CityToken": "_city_",
"CountryToken": "_country_",
"alphanumericToken": "_alphanumeric_",
"numericToken": "_num_",
"customCharacterTokens": "",
"maxKeepingInvalidMessageDays": "30"
}
For this feature to function properly, ensure that DataCollection property is enabled in the CXDClientConfig.json file. All attributes are optional and will use their default values if not specified.
Each attribute allows you to customize how sensitive or structured data is represented or managed in your client configuration.For Example:
Input text: John Doe, whose email is john.doe@example.com and phone number is +1-555-1234, lives in New York, USA.
After generalization:
Input text: _name_, whose email is _email_ and phone number is _phone_number_, lives in _city_, _country_.
-
Masking the data: All data except the above specified keywords will be masked. For example:
Before Masking:
Input text: _name_, whose email is _email_ and phone number is _phone_number_, lives in _city_, _country_.
After Masking:
Masked Input text: _name_, CCCC CCCCC CC _email_ CCC CCCCC CCCCC CC _phone_number_, CCCCC CC _city_, _country_
-
Hashing the data: All data except the above specified keywords will be hashed and assigned to these fields:
-
controlIdentifierHashed
-
textHashed
-
processTitleHashed
For example:
Before Hashing:
_name_, whose email is _email_ and phone number is _phone_number_, lives in _city_, _country_
After Hashing:
_name_d883601c7ec91e2457a40e870755151d96019c7f60e1c7de8efec2a0bbd0db53_email_e1e4485e020ef85beab9c356edcf1846d40d2d12b5ad14f8e502eae357a6ce3a_phone_number_98698d73a9b09f7c6fe3cda57f1034f9be5c630765cba3df38579262275b26a0_city_315f5bdb76d078c43b8ac0064e4a01646a5f0b8f9e4e0fbb2a6a6b9e826dd4b8_country_
-
The table below describes the configuration attributes.
Attribute |
Description |
---|---|
enabled |
This attribute determines whether the pre-processing feature is enabled or disabled. |
emailToken |
This is the placeholder for email addresses in messages. Default value: _email_ |
phoneNumberToken |
This is the placeholder for phone numbers in messages. Default value: _phone_ |
dateTimeToken |
This is the placeholder for date and time values. Default value: _datetime_ |
IPToken |
This is the placeholder for IP addresses. Default value: _ip_ |
FirstNameToken |
This is the placeholder for first names in messages. Default value: _firstname_ |
SurnameToken |
This is the placeholder for surnames (last names) in messages. Default value: _surname_ |
AirportToken |
This is the placeholder for airport names or codes. Default value: _airport_ |
CityToken |
This is the placeholder for city names. Default value: _city_ |
CountryToken |
This is the placeholder for country names. Default value: _country_ |
alphanumericToken |
This is the placeholder for alphanumeric strings. Default value: _alphanumeric_ |
numericToken |
This is the placeholder for numeric values. Default value: _num_ |
customCharacterTokens | This specifies special characters you want to replace with an asterisk (*) during preprocessing. The default value is empty. |
maxKeepingInvalidMessageDays |
This indicates the maximum number of days to keep invalid messages. Default value: 30 (days) |
Limitations
-
Name Recognition Scope: Only exact matches of names (including first names, surnames, airports, cities, and countries) in the database will be recognized and tokenized. Variations or misspellings will not be detected.
-
False Positives in Name Detection: Due to the high volume of names, some non-name words may be incorrectly identified as names.
-
Overlapping Name Categories: Certain names can belong to multiple categories (e.g., "Georgia" and "Chad" can be first names, surnames, or country names). Tokenization depends on the category in the database and follows this sequence: First name > Surname > Airport name > City name > Country name.