dlp regexp behaves different on pre-made

Hello,

==Summary==
We are experiencing different DLP behavior for complex RegEx between two installations.

==System==
Version: ciphermail-virtual-appliance-2.10.0-3.
  1. Ubuntu pre-made virtual appliance (on my laptop)
  2. Red Hat & CentOS gateway package (on a test server)

==Configuration==
DLP: several triggers with "Must Encrypt"
Settings: Encrypt Mode "No Encryption"
Settings: DLP Patterns added

==Example==
We want to search a message for [any text][four numbers][any text]
So we try this RegEx: *.\d{4}.*

This works perfectly on the Ubuntu VA, but it encrypts EVERY message on CentOS.
Everything is back to normal when we disable the complex RegEx on CentOS.

We also tried to search for a little more simple like: [0-9][0-9][0-9][0-9]
Ubuntu version is fine, CentOS version encrypts every message.

==DLP Trigger Comparison ==
Ubuntu version:
  - Single words work as expected
  - Mail header works as expected
  - Complex *.\d{4}.* works as expected

CentOS version:
  - Single words work as expected
  - Mail header works as expected
  - Complex *.\d{4}.* works DIFFERENT

Does anyone have experience with this situation?

Is our installation perhaps incorrect?

It's quite likely that a message contains 4 digits. Could it be that the
mail sent via the CentOS gateway is sent with some other mail app than
the mail sent via the virtual appliance?

We will look at this tomorrow, but I'm quite sure it is a default
intallation as described in the CipherMail guide.

The DLP text extractor also extracts header values. So for example a
date header will also be extracted. Since almost all mails contain a
date header, almost any mail will contain 4 digits.

That's true. The original is 8 digits (simulate Dutch Personal Id)
but I get the point. What I don't understand (yet) is that my testing
method & messages are the same on Ubuntu and CentOS and that it works
on the Ubuntu version.

If you have the "raw" MIME content, you can see what text the DLP
engine see during scanning by uploading the MIME message to the "extract
text" tool (Admin -> other -> extract text). The "extract text" tool
will return the normalized text.

So we try this RegEx: *.\d{4}.*

If you want to trigger on 4 digits, you should use \d{4} , i.e., skip
the .* part. The .* is not needed, it will make scanning slower. The reg
exp is not required to match the complete text, i.e. .* is kind of
implicitly added to any reg ex.

BTW I made a typo in the mail, it was afcourse .* but I think you
saw that :slight_smile: Okay, we will test without the wildcards. But when
I used 9 times [0-9] without wildcards, all messages were encrypted.
But again that could be because 9 numbers is in the headers...
Hmm, anyway, thanks for your support, we will try some more tests.

Kind regards,

CipherMail support

···

Cheers,

Raymond Bakker | Integration Consultant

T +31 (0)10 288 1600
M +31 (0)6 2222 5515
E raymond.bakker(a)vanadgroup.com

VANAD Enovation
Rivium Westlaan 1
2909 LD Capelle aan den IJssel
The Netherlands

Website | Facebook | LinkedIn | Twitter

This e-mail is personal. For our disclaimer, please visit www.vanadgroup.com/disclaimer

Are you sure that the messages sent via the Ubuntu version are exactly
the same as the message sent via the CentOS version? If for example the
message sent via the Ubuntu system is sent by Zimbra but the message
comparing apples and oranges. It might be that one mail client (server?)
adds certain headers with 8 digits and the other mail client (server?) not.

Kind regards,

CipherMail support

···

On 12/03/2015 12:24 PM, Raymond Bakker wrote:

Hello,

==Summary==
We are experiencing different DLP behavior for complex RegEx between two installations.

==System==
Version: ciphermail-virtual-appliance-2.10.0-3.
  1. Ubuntu pre-made virtual appliance (on my laptop)
  2. Red Hat & CentOS gateway package (on a test server)

==Configuration==
DLP: several triggers with "Must Encrypt"
Settings: Encrypt Mode "No Encryption"
Settings: DLP Patterns added

==Example==
We want to search a message for [any text][four numbers][any text]
So we try this RegEx: *.\d{4}.*

This works perfectly on the Ubuntu VA, but it encrypts EVERY message on CentOS.
Everything is back to normal when we disable the complex RegEx on CentOS.

We also tried to search for a little more simple like: [0-9][0-9][0-9][0-9]
Ubuntu version is fine, CentOS version encrypts every message.

==DLP Trigger Comparison ==
Ubuntu version:
  - Single words work as expected
  - Mail header works as expected
  - Complex *.\d{4}.* works as expected

CentOS version:
  - Single words work as expected
  - Mail header works as expected
  - Complex *.\d{4}.* works DIFFERENT

Does anyone have experience with this situation?

Is our installation perhaps incorrect?

It's quite likely that a message contains 4 digits. Could it be that the
mail sent via the CentOS gateway is sent with some other mail app than
the mail sent via the virtual appliance?

We will look at this tomorrow, but I'm quite sure it is a default
intallation as described in the CipherMail guide.

The DLP text extractor also extracts header values. So for example a
date header will also be extracted. Since almost all mails contain a
date header, almost any mail will contain 4 digits.

That's true. The original is 8 digits (simulate Dutch Personal Id)
but I get the point. What I don't understand (yet) is that my testing
method & messages are the same on Ubuntu and CentOS and that it works
on the Ubuntu version.

sent via the CentOS version is sent via Exchange then it's kind of
--
CipherMail email encryption

Email encryption with support for S/MIME, OpenPGP, PDF encryption and
secure webmail pull.

Twitter: http://twitter.com/CipherMail

To make it less likely to have false positives, it might help if you
require that the number of digits are exactly 8 for a match. Because
with your original reg exp, digit sequences of 8 or more would trigger.

The following reg exp only triggers on digit sequences of exactly 8 digits:

\b\d{8}\b

Note: the \b is a word boundary separator

Kind regards,

CIpherMail support

···

On 12/03/2015 12:34 PM, CipherMail support wrote:

On 12/03/2015 12:24 PM, Raymond Bakker wrote:

Hello,

==Summary==
We are experiencing different DLP behavior for complex RegEx between two installations.

==System==
Version: ciphermail-virtual-appliance-2.10.0-3.
  1. Ubuntu pre-made virtual appliance (on my laptop)
  2. Red Hat & CentOS gateway package (on a test server)

==Configuration==
DLP: several triggers with "Must Encrypt"
Settings: Encrypt Mode "No Encryption"
Settings: DLP Patterns added

==Example==
We want to search a message for [any text][four numbers][any text]
So we try this RegEx: *.\d{4}.*

This works perfectly on the Ubuntu VA, but it encrypts EVERY message on CentOS.
Everything is back to normal when we disable the complex RegEx on CentOS.

We also tried to search for a little more simple like: [0-9][0-9][0-9][0-9]
Ubuntu version is fine, CentOS version encrypts every message.

==DLP Trigger Comparison ==
Ubuntu version:
  - Single words work as expected
  - Mail header works as expected
  - Complex *.\d{4}.* works as expected

CentOS version:
  - Single words work as expected
  - Mail header works as expected
  - Complex *.\d{4}.* works DIFFERENT

Does anyone have experience with this situation?

Is our installation perhaps incorrect?

It's quite likely that a message contains 4 digits. Could it be that the
mail sent via the CentOS gateway is sent with some other mail app than
the mail sent via the virtual appliance?

We will look at this tomorrow, but I'm quite sure it is a default
intallation as described in the CipherMail guide.

The DLP text extractor also extracts header values. So for example a
date header will also be extracted. Since almost all mails contain a
date header, almost any mail will contain 4 digits.

That's true. The original is 8 digits (simulate Dutch Personal Id)
but I get the point. What I don't understand (yet) is that my testing
method & messages are the same on Ubuntu and CentOS and that it works
on the Ubuntu version.

Are you sure that the messages sent via the Ubuntu version are exactly
the same as the message sent via the CentOS version? If for example the
message sent via the Ubuntu system is sent by Zimbra but the message
sent via the CentOS version is sent via Exchange then it's kind of
comparing apples and oranges. It might be that one mail client (server?)
adds certain headers with 8 digits and the other mail client (server?) not.

--
CipherMail email encryption

Email encryption with support for S/MIME, OpenPGP, PDF encryption and
secure webmail pull.

Twitter: http://twitter.com/CipherMail