Package io.mats3
Class MatsInitiator.MatsMessageSendException
java.lang.Object
java.lang.Throwable
java.lang.Exception
io.mats3.MatsInitiator.MatsMessageSendException
- All Implemented Interfaces:
Serializable
- Enclosing interface:
- MatsInitiator
Will be thrown by the
MatsInitiator.initiate(InitiateLambda)
-method if Mats fails to send the messages
after the MatsInitiator.InitiateLambda
has been run, any external resource (typically DB) has been committed, and
then some situation occurs that makes it impossible to send out messages. (Some developers might recognize
this as the "VERY BAD!-initiation" situation).
This is a rare, but unfortunate situation, but which is hard to guard completely against, in particular in the
"Best Effort 1-Phase Commit" paradigm that the current Mats implementations runs on. What it means, is that if
you e.g. in the initiate-lambda did some "job allocation" logic on a table in a database, and based on that
allocation sent out e.g. 5 messages, the job allocation will now have happened, but the messages have
not actually been sent. The result is that in the database, you will see those jobs as processed
(semantically "started processing"), but in reality the downstream endpoints never started working on them since
the message was not actually sent out.
This situation can to a degree be alleviated if you catch this exception, and then use a compensating
transaction to de-allocate the jobs in the database again. However, since bad things often happen in
clusters, you might not be able to do the de-allocation either (due to the database becoming inaccessible at the
same instant - e.g. the reason that the messages could not be set was that the network cable became unplugged, or
that this node actually lost power at that instant). A way to at least catch when this happens, is to employ a
state machine to the job allocation logic: First pick jobs for this node by setting the state column of
job-entries whose state is "UNPROCESSED" to some status like "ALLOCATED" (along with a column of which node
allocated them (i.e. "hostname" of this node) and a column for timestamp of when they were allocated). In the
initiator, you pick the jobs that was allocated to this node, set the status to "SENT" and send the outgoing
messages. Finally, in the terminator endpoint (which you specify in the initiation), you set the status to
"DONE". Then you make a health check: Assuming that in normal conditions such jobs should always be processed in
seconds, you make a health check that scans the table for rows which have been in the "ALLOCATED" or "SENT"
status for e.g. 15 minutes: Such rows are very suspicious, and should be checked up by humans. Sitting in
"ALLOCATED" status would imply that the node that allocated the job went down (and has not (yet) come back up)
before it managed to initiate the messages, while sitting in "SENT" would imply that the message had started, but
not gotten through the processing: Either that message flow sits in a downstream Dead Letter Queue due to some
error, or you ended up in the situation explained here: The database commit went through, but the messages was
not sent.
Please note that this should, in a somewhat stable operations environment, happen extremely seldom: What needs to
occur for this to happen, is that in the sliver of time between the commit of the database and the commit of the
message broker, this node crashes, the network is lost, or the message broker goes down. Given that a check for
broker liveliness is performed right before the database commit, that time span is very tight. But to make the
most robust systems that can monitor themselves, you should consider employing a state machine handling as
outlined above. You might never see that health check trip, but now you can at least sleep without thinking about
that 1 billion dollar order that was never processed.
PS: Best effort 1PC: Two transactions are opened: one for the message broker, and one for the database. The
business logic and possibly database reads and changes are performed. The database is committed first, as that
has many more failure scenarios than the message systems, e.g. data or code problems giving integrity constraint
violations, and spurious stuff like MS SQL's deadlock victim, etc. Then the message queue is committed, as the
only reason for the message broker to not handle a commit is basically that you've had infrastructure problems
like connectivity issues or that the broker has crashed.
Notice that it has been decided to not let this exception extend the MatsInitiator.MatsBackendException
, even though it
is definitely a backend problem. The reason is that it in all situations where MatsBackendException
is
raised, the other resources have not been committed yet, as opposed to situations where this
MatsMessageSendException
is raised. Luckily, in this time and age, we have multi-exception catch blocks
if you want to handle both the same.- See Also:
-
Constructor Summary
ConstructorsConstructorDescriptionMatsMessageSendException
(String message) MatsMessageSendException
(String message, Throwable cause) -
Method Summary
Methods inherited from class java.lang.Throwable
addSuppressed, fillInStackTrace, getCause, getLocalizedMessage, getMessage, getStackTrace, getSuppressed, initCause, printStackTrace, printStackTrace, printStackTrace, setStackTrace, toString
-
Constructor Details
-
MatsMessageSendException
-
MatsMessageSendException
-