Sagas in distributed systems

Sagas in distributed systems

Achieving complete isolation between transactions is relatively expensive in distributed systems The system either has to maintain locks for each transaction and potentially block other concurrent transactions from making progress, or abort some transactions to maintain safety, which leads to some wasted effort.

can we do it better in handling transactions!?

Photo by Lukas Blazek on Unsplash

Long lived transactions

Long lived transactions are transactions that by their nature have a longer duration in the order of hours or even days, instead of milliseconds. This can happen because this transaction processes a large amount of data, requires human input to proceed, or needs to communicate with third party systems that are slow.

Examples of LLT

  • Batch jobs that calculate reports over big datasets
  • Claims at an insurance company, containing various stages that require human input
  • An online order of a product that spans several days from order to delivery

As a result, running these transactions using the common concurrent mechanisms degrades performance significantly, since they need to hold resources for long periods of time, while not operating on them.

Sometimes, long-lived transactions do not really require full isolation between each other, but they still need to be atomic, so that consistency is maintained under partial failures. Thus, researchers came up with a new concept: the saga.

Saga

The saga is a sequence of transactions T1T​1​​, T2T​2​​, …, TNTN​​ that can be interleaved with other transactions. However, it’s guaranteed that either all of the transactions will succeed, or none of them will, maintaining the atomicity guarantee. Each transaction TiTi​​ is associated with a so-called compensating transaction CiCi​​, that is executed in case a rollback is needed

Benefits

The concept of saga transactions can be really useful in distributed systems. distributed transactions are generally hard and can only be achieved by making compromises on performance and availability.

There are cases where we can use a saga transaction instead of a distributed transaction. This will satisfy all of our business requirements while keeping our systems loosely coupled and achieving good availability and performance.

A weird scenario

Think about the scenario of two concurrent orders A and B, where A has reserved the last item from the warehouse. As a result of this, order B fails at the first step and is rejected because of zero inventory. Later on, order A also fails at the second step because the customer’s card does not have enough money. Then, the associated compensating transaction runs, returning the reserved item to the warehouse.

This would mean that an order was rejected while it could have been processed normally. Of course, this violation of isolation does not have severe consequences. However, in some cases the consequences might be more serious, e.g. customers being charged without receiving a product.

We can handle this by Providing isolation at the application layer

Semantic lock

The use of a semantic lock essentially signals that some data items are currently in process and should be treated differently or not accessed at all. The final transaction of a saga takes care of releasing this lock and resetting the data to its normal state.

Commutative updates

The use of commutative updates that have the same effect regardless of their order of execution. This can help mitigate cases that are otherwise susceptible to lost update phenomena.

Re-ordering the structure of the saga

Re-order the saga structure so that a transaction called a pivot transaction delineates a boundary between transactions that can fail and those that can’t.

In this way, transactions that can’t fail, but could lead to serious problems if rolled back due to failures of other transactions, can be moved after the pivot transaction.

An example of this is a transaction that increases the balance of an account. This transaction could have serious consequences if another concurrent saga reads this increase in the balance, but then the previous transaction is rolled back. Moving this transaction after the pivot transaction means that it will never be rolled back, since only all the transactions after the pivot transaction can succeed.

As I said earlier, complete isolation between transactions is relatively expensive, saga is one of the ways to handle this, do share your thoughts on the same.