[JAMES-3784] Ease mail repository / event dead letter operation - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: master
Fix Version/s: 3.8.0
Component/s: mailbox, MailStore & MailRepository
Labels:
None

Description

Mailing list thread

https://www.mail-archive.com/server-dev@james.apache.org/msg72012.html

context

James does mostly 2 kinds of processing:

Mail processing: when receiving a mail in SMTP, the mail is enqueued and then the mailet processing is executed. Mailet/matchers are called against it and a serie of decision can be made: store this mail n a user mailbox, forward it, bounce, ignore it, etc... - Event processing: Once actions are taken in a user mailbox, an event is emitted to the event bus. Listeners are then called to "decorate" features of the mailbox manager in a non invasive way. ElasticSearch indexing, quota management, JMAP projections and state, IMAP/JMAP notifications...

Of course each of these processing can fail and error management is applied.

Regarding Mail processing the mail is stored in /var/mail/error .
To detect incidents:

ERROR logs during processing
Webadmin calls shows a non-zero /var/mail/error repository size
To fix this incident:

Explicit admin action is required, and if needed a reprocessing can be attempted (webadmin)

Regarding Event processing, listener execution is retried several time with a delay. If it keeps failingit is eventually stored in dead letter.

To detect incidents:

ERROR logs during processing
WebAdmin reports a non-zero size for deadletter

Health check, wich eventually does a recuring WARNING log that cannot be missed.

To fix this incident:

Explicit admin action is required, and if needed a redelivery can be attempted (webadmin)

Problem statement

Most users misses this yet critical part of error management in James.

Actions are never taken, problems piles up.

While understandably major incidents with thousands of problems would clearly benefit from an admin intervention, I would like small incidents to self recover without a human intervention.

In practice, none of my clients (me included) managed to set up a reliable action plan regarding processing failures. Problems could be takled months after they arise thus escalating in major issues needlessly.

Proposed solution

Implement a healthcheck that verifies var/mail/error is empty
An upper bound on redelivery/reprocessing exposed by webadmin

The goal of this limit is to prevent unbounded processing that could consume unbounded resources. Auto-healing could be budgetted for (eg: 10 mails/min).

A human intervention is still needed in some cases:

Massive outage whose require a full redelivery/reprocessing
Bugs that cause recurring failure.

The goal is to have auto-healing in place, given those tasks are called with CRONs.

CRONs remove the need for extra James based developments that adds complexity.

Proposed changes

Add a `limit` parameter to reprocessing /redelivery.

If specified, it enables to limit the count of element reprocessed/redelivered. If unspecified the count of processed element is unbounded (like today)

Endpoints to modify:

We also need:

to update webadmin documentation accordingly.
to recommend a CRON of eg 10 redelivery/reprocessing per minute in our operation guides. (https://james.apache.org/server/manage-guice-distributed-james.html + https://james.staged.apache.org/james-distributed-app/3.7.0/operate/guide.html)

Attachments

Issue Links

links to

GitHub Pull Request #1062

GitHub Pull Request #1122

Activity

People

Assignee:: Unassigned

Reporter:: Benoit Tellier

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 22/Jun/22 04:53

Updated:: 20/Aug/22 06:15

Resolved:: 20/Aug/22 06:15

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

2.5h