Package io.mats3

Class MatsInitiator.MatsMessageSendException

java.lang.Object
java.lang.Throwable
java.lang.Exception
io.mats3.MatsInitiator.MatsMessageSendException
All Implemented Interfaces:
Serializable
Enclosing interface:
MatsInitiator

public static class MatsInitiator.MatsMessageSendException extends Exception
Will be thrown by the MatsInitiator.initiate(InitiateLambda)-method if Mats fails to send the messages after the MatsInitiator.InitiateLambda has been run, any external resource (typically DB) has been committed, and then some situation occurs that makes it impossible to send out messages. (Some developers might recognize this as the "VERY BAD!-initiation" situation).

This is a rare, but unfortunate situation, but which is hard to guard completely against, in particular in the "Best Effort 1-Phase Commit" paradigm that the current Mats implementations runs on. What it means, is that if you e.g. in the initiate-lambda did some "job allocation" logic on a table in a database, and based on that allocation sent out e.g. 5 messages, the job allocation will now have happened, but the messages have not actually been sent. The result is that in the database, you will see those jobs as processed (semantically "started processing"), but in reality the downstream endpoints never started working on them since the message was not actually sent out.

This situation can to a degree be alleviated if you catch this exception, and then use a compensating transaction to de-allocate the jobs in the database again. However, since bad things often happen in clusters, you might not be able to do the de-allocation either (due to the database becoming inaccessible at the same instant - e.g. the reason that the messages could not be set was that the network cable became unplugged, or that this node actually lost power at that instant). A way to at least catch when this happens, is to employ a state machine to the job allocation logic: First pick jobs for this node by setting the state column of job-entries whose state is "UNPROCESSED" to some status like "ALLOCATED" (along with a column of which node allocated them (i.e. "hostname" of this node) and a column for timestamp of when they were allocated). In the initiator, you pick the jobs that was allocated to this node, set the status to "SENT" and send the outgoing messages. Finally, in the terminator endpoint (which you specify in the initiation), you set the status to "DONE". Then you make a health check: Assuming that in normal conditions such jobs should always be processed in seconds, you make a health check that scans the table for rows which have been in the "ALLOCATED" or "SENT" status for e.g. 15 minutes: Such rows are very suspicious, and should be checked up by humans. Sitting in "ALLOCATED" status would imply that the node that allocated the job went down (and has not (yet) come back up) before it managed to initiate the messages, while sitting in "SENT" would imply that the message had started, but not gotten through the processing: Either that message flow sits in a downstream Dead Letter Queue due to some error, or you ended up in the situation explained here: The database commit went through, but the messages was not sent.

Please note that this should, in a somewhat stable operations environment, happen extremely seldom: What needs to occur for this to happen, is that in the sliver of time between the commit of the database and the commit of the message broker, this node crashes, the network is lost, or the message broker goes down. Given that a check for broker liveliness is performed right before the database commit, that time span is very tight. But to make the most robust systems that can monitor themselves, you should consider employing a state machine handling as outlined above. You might never see that health check trip, but now you can at least sleep without thinking about that 1 billion dollar order that was never processed.

PS: Best effort 1PC: Two transactions are opened: one for the message broker, and one for the database. The business logic and possibly database reads and changes are performed. The database is committed first, as that has many more failure scenarios than the message systems, e.g. data or code problems giving integrity constraint violations, and spurious stuff like MS SQL's deadlock victim, etc. Then the message queue is committed, as the only reason for the message broker to not handle a commit is basically that you've had infrastructure problems like connectivity issues or that the broker has crashed.

Notice that it has been decided to not let this exception extend the MatsInitiator.MatsBackendException, even though it is definitely a backend problem. The reason is that it in all situations where MatsBackendException is raised, the other resources have not been committed yet, as opposed to situations where this MatsMessageSendException is raised. Luckily, in this time and age, we have multi-exception catch blocks if you want to handle both the same.

See Also:
  • Constructor Details

    • MatsMessageSendException

      public MatsMessageSendException(String message)
    • MatsMessageSendException

      public MatsMessageSendException(String message, Throwable cause)