MPA spool queue filling up under high message load

rickyboone · January 30, 2024, 11:10pm

I have a CipherMail instance that under high message throughput conditions is causing the MPA spool queue to get backed up, which as confirmed with documentation, causes Postfix to queue up. Are there parameters for email processing within CipherMail that may help influence how quickly messages are processed, how many threads or how concurrent the processing performs, etc?

When observing the MPA spool queue process messages, the timing appeared to be fairly slow per message (spot checking looks to be around 15 seconds per message in the MPA log). Logs seem to indicate that there are multiple spool threads, but I don’t know if this is a parameter that can be tuned, or if it is based on something about the system (number of CPUs, etc.).

(Adding for additional context)
System resource conditions are fairly low, only a running at around 4-6% CPU. Memory usage seemed to increase slightly, but not causing heavy swapping. Low IO wait, low load averages, etc.

The high throughput situation is related to legitimate outbound email, but clearly automated as part of a business process to generate the amount of emails in a short amount of time that I saw. That said, I suspect the issue I observed is a symptom of an underlying configuration issue that could be causing all emails to be processed relatively slowly, but not having the queuing issue come up before didn’t raise the attention it now has.

Not sure if anyone has observed this behavior or is aware of anything to look for.

martijn · January 31, 2024, 1:12pm

An email goes through a number of states (processors) while handling the email. Which states, depends on the settings. The MPA queue shows the current state for an email. The handling speed depends on a number of factors like the size of the email, the IO speed of the storage device, number of CPUs etc.

The MPA log shows the flow of an email. It reports the states and the time the email entered that state. To find all log lines for a specific email, you should copy the MailID value (each incoming email gets a unique MailID) and filter on the MailID. You can then see how log each state took

In a normal setup, handling should be quite fast. In order to investigate the issue, some additional information is required. For example if an external database is used, it might be that the connection to the database is slow.

What is the average email size?
How many emails are sent per second?
Are you using an external database?
How much memory does the system have?
Is the memory reserved or shared (ballooning)?

rickyboone · January 31, 2024, 2:37pm

The database was definitely my first concern, and yes it is currently remote from the server. I’m not sure of any other stats that could help with identifying if that is the root cause other than round-trip-time, which is unfortunately a bit high due to the physical distance between the cluster and the gateway. Other characteristics for the database server shows fairly low utilization when the issue was occurring.

Yep, the MPA MailID was what I was using to track the start and end time for total processing to determine average timing. I just grabbed a sample from right now, and the timing is similar at 15 seconds from incoming to finished.

Average mail size (related to the high throughput scenario) is around 15.5kB, though based on what I am seeing the timing doesn’t seem to change whether the messages are smaller or larger.
Unfortunately by the time I was involved, the process had finished and was stuck with what was in queue. Based on postfix logs, it looks to be around 1 per second.
Yes, which is unfortunately a good distance away at around 25ms rtt.
4GB
For the gateway, not reserved, but not currently ballooning (fully allocated and not under contention). For the database server, fully reserved (8GB).

martijn · February 1, 2024, 7:15am

Can you connect to the remote database from the gateway’s command line with the mysql command? If so can you run a few random queries and report the result including the time it takes to run them?

Example queries:

use djigzo
select count(*) from cm_certificates;
select count(*) from cm_users;
select count(*) from cm_mail_repository;

rickyboone · February 2, 2024, 7:20pm

Using the mysql cli client from the gateway:

MariaDB [ciphermail_gwdb]> select count(*) from cm_certificates;
+----------+
| count(*) |
+----------+
|      148 |
+----------+
1 row in set (0.03 sec)

MariaDB [ciphermail_gwdb]> select count(*) from cm_users;
+----------+
| count(*) |
+----------+
|        4 |
+----------+
1 row in set (0.03 sec)

MariaDB [ciphermail_gwdb]> select count(*) from cm_mail_repository;
+----------+
| count(*) |
+----------+
|        0 |
+----------+
1 row in set (0.03 sec)

The same, but directly on the MariaDB host:

MariaDB [ciphermail_gwdb]> select count(*) from cm_certificates;
+----------+
| count(*) |
+----------+
|      148 |
+----------+
1 row in set (0.000 sec)

MariaDB [ciphermail_gwdb]> select count(*) from cm_users;
+----------+
| count(*) |
+----------+
|        4 |
+----------+
1 row in set (0.001 sec)

MariaDB [ciphermail_gwdb]> select count(*) from cm_mail_repository;
+----------+
| count(*) |
+----------+
|        0 |
+----------+
1 row in set (0.000 sec)

Topic		Replies	Views
Bug in MPA Gateway	1	79	January 28, 2016
Servers wont talk to me! Gateway	7	99	June 26, 2017
MPA Log Gateway	2	96	February 11, 2021
Mail moved to "ERROR Spool" Gateway	6	70	May 2, 2011
New release of CipherMail gateway and Webmail Messenger Gateway	0	122	December 24, 2019

MPA spool queue filling up under high message load

Related topics