Distributed Transactions and Eventual Consistency in AWS Microservices

Traditionally, when building monolithic applications, database transactions were part of the package. You had a single database, often relational. When a request came in, you wrapped the business logic in a database transaction.

For example, when placing an order, multiple steps occurred within this single transaction:

Deducting item inventory.
Processing the customer's payment.
Generating an invoice.
Writing a ledger entry.

Because they were wrapped in a database transaction, if any of these actions failed, the database automatically rolled back every write and failed gracefully. This is the superpower of ACID guarantees:

Atomicity: Every statement in a transaction is treated as a single unit. Either all succeed, or all fail.
Consistency: A transaction transitions the database from one valid state to another, maintaining all schema constraints and rules.
Isolation: Concurrent execution of transactions leaves the database in the same state as if they were executed sequentially.
Durability: Once a transaction commits, its writes are guaranteed to persist, even in the event of a system crash.

In a monolith, ACID guarantees that you never charge a payment for a failed order, and that half-finished states are never exposed to concurrent queries. The database does the heavy lifting so you don't have to.

Eventually, as applications scale and traffic spikes, the monolithic database becomes a bottleneck point. To scale horizontally and maintain team velocity, you break the monolith into independent microservices.

But there is a catch: Every microservice must own its own database.

What was once a single database transaction is now split across multiple independent service databases. The payment database is separate from the order database, which is separate from the inventory database. You cannot run a standard SQL transaction across network boundaries and heterogeneous databases.

If the payment service successfully charges a customer, but the order placement fails downstream (e.g., due to an out-of-stock item), there is no database-level rollback.

As Pat Helland famously noted:

"Distributed transactions across autonomous services don't work at the internet scale."

Strong Consistency vs. Eventual Consistency

The shift from a monolith to microservices requires a shift in how we think about data consistency.

Strong Consistency (ACID)

In a strongly consistent system, any read operation immediately following a write is guaranteed to return the updated data. To achieve this in a distributed system, you must lock records across databases during a transaction.

The Cost: Locking resources across network boundaries introduces latency, risks deadlocks, and severely degrades availability. If one database or network link in the chain is slow or offline, the entire transaction blocks.

Eventual Consistency (BASE)

Instead of forcing the system to be consistent at all times, eventual consistency accepts that the system can be temporarily inconsistent, with the guarantee that it will eventually converge to a consistent state.

BASE Guarantees: Basically Available, Soft state (the state can change over time without user interaction), and Eventually consistent.
The Benefit: High throughput, fault tolerance, and loose coupling. Services do not wait for each other to commit; they commit locally and propagate state asynchronously.

Why Two-Phase Commit (2PC) Fails in Microservices

Historically, distributed database systems used the Two-Phase Commit (2PC) protocol to achieve strong consistency across multiple nodes.

Two-Phase Commit (2PC) pattern for Microservices.

In 2PC, a coordinator node asks all participating databases if they are ready to commit (Phase 1). If all respond "yes," the coordinator instructs them to commit (Phase 2). If any node says "no" or fails to respond, the entire transaction is rolled back.

While theoretically sound, 2PC is an anti-pattern for modern microservices architectures:

Synchronous Blocking: Resources (database rows) remain locked throughout both phases. This kills performance and limits scale.
Single Point of Failure: If the coordinator crashes midway through Phase 2, participants are left in a blocked state, unsure whether to commit or abort, holding locks indefinitely.
Lack of Service Autonomy: 2PC requires participation at the database engine level. In microservices, services should hide their databases behind APIs. Exposing raw transaction coordinators violates the core principle of microservice encapsulation.

The Modern Industry Standard: The Saga Pattern

The Saga Pattern is the industry standard for managing distributed transactions without locking resources. A Saga is a sequence of local transactions. Each local transaction updates the database within a single service and publishes an event or message to trigger the next step.

If a step fails, the Saga must execute compensating transactions in reverse order to undo the changes made by the previous steps.

Unlike database rollbacks, compensating transactions are business-level undo actions (e.g., refunding a credit card instead of deleting a database row). Because changes are committed locally before the entire workflow finishes, steps must be idempotent so that retries do not cause side effects (like charging a customer twice).

Choreography vs. Orchestration

There are two primary ways to coordinate a Saga: Choreography and Orchestration.

Choreography (Decentralized)

In a choreographed saga, there is no central coordinator. Services listen to events published by other services and take action based on those events.

Choreography implementation of the Saga pattern.

Pros: Easy to understand for simple workflows; no single coordinator point of failure; services remain loosely coupled.
Cons: As the system grows, it becomes difficult to understand which services subscribe to which events. Debugging, tracing, and monitoring become complex. You risk circular event loops.

Orchestration (Centralized)

This is the implementation most companies use today to coordinate Saga workflows. In an orchestrated saga, a central orchestrator acts as the "brain," explicitly directing participants on which local transactions to execute and when.

Orchestration implementation of the Saga pattern.

Pros: Single place to define, visualize, and monitor the workflow. Clear separation of concerns; the domain services don't need to know about downstream steps or compensation logic.
Cons: Risk of over-centralizing logic into the orchestrator, turning services into simple CRUD layers. The orchestrator itself must be highly available and resilient.

Choreography vs. Orchestration Decision Matrix

Dimension	Choreography	Orchestration
Complexity	Simple for small workflows, highly complex at scale.	Initial setup overhead, but scales cleanly for complex flows.
Coupling	Loose coupling at the API level, but tight coupling to event schemas.	Central coordinator couples steps, but domain services remain simple.
Visibility	Difficult to trace the overall state of a single customer journey.	Full observability and real-time state tracking out of the box.
Error Handling	Hard to implement coordinated retries and complex error branches.	Built-in retry policies, catch-all errors, and fallback routing.

AWS Recommended Architecture: Step Functions Orchestration

On AWS, the industry-standard recommendation for implementing an orchestrated Saga is AWS Step Functions.

Step Functions is a serverless orchestration service that lets you build resilient state machines. It integrates directly with AWS services (like Lambda, DynamoDB, ECS, and SQS) and external systems.

Saga Pattern on AWS

A Typical E-Commerce Saga with AWS Step Functions

Below is an architecture diagram of how Step Functions orchestrates a distributed transaction across microservices:

Why AWS Step Functions is Ideal for Sagas

Declarative State Machines: You define workflows in Amazon States Language (ASL) or visually. The visual graph serves as both documentation and a live monitoring console.
Built-in Error Handling and Retries: If an external API (like Stripe) is down, you can configure Step Functions to automatically retry with exponential backoff and jitter, without writing retry logic in your Lambda code.
Compensation Routing: Using Catch and Retry blocks, you can route failures to compensation steps. If "Reserve Stock" fails, the state machine automatically catches the exception and routes it to the "Refund Payment" state.
Long-Running Executions: Standard Workflows can run for up to a year, allowing human-in-the-loop steps (e.g., manual fraud checks or manager approvals).
Standard vs. Express Workflows:
- Standard Workflows: Best for long-running, audit-heavy processes. They guarantee exactly-once execution execution and maintain detailed histories of every execution step.
- Express Workflows: Best for high-volume, short-duration (under 5 minutes) transaction processing. They support at-least-once execution and are significantly cheaper, making them ideal for high-throughput microservice orchestrations.

Making Sagas Reliable: The Transactional Outbox Pattern

Even with a robust orchestrator like AWS Step Functions, you face the Dual Write Problem.

Consider this scenario: The Order Service updates its database to mark an order as PAID. Now, it must notify downstream services by publishing an event to Amazon EventBridge or SNS.

These are two different network operations on two different systems. If the database write succeeds, but the event publisher fails (due to network blips or client crashes), downstream services will never react. If you reverse the order (publish the event first, write to DB second), the database write might fail, leading to phantom events.

This violates consistency. We cannot use a simple compensating transaction here because a downstream service might read the inconsistent database state before the compensation runs, leading to data drift:

T1: Order Service DB writes state "PAID"
T2: Shipping Service reads DB state and ships the item
T3: Event Bridge publication fails
T4: Timeout forces compensation, changing DB state to "CANCELLED"

Result: Item is shipped, but the order is cancelled in the system.

The Solution: The Transactional Outbox Pattern

Instead of writing to the database and publishing an event as two separate actions, you execute them together inside a single local database transaction.

The Transactional Outbox pattern for making Sagas reliable.

The Local Write: The Order Service writes the order record to the Orders table, and simultaneously writes an event payload to an Outbox table within the same transaction.
The Event Relay: An independent agent or poller watches the Outbox table, reads new messages, and publishes them to the message broker. Once published, it marks the messages as sent or deletes them.

AWS Native Outbox Implementations

Implementing this on AWS is highly streamlined depending on your database choice:

1. DynamoDB + DynamoDB Streams

DynamoDB Streams serves as a built-in outbox table.

When a microservice writes to a DynamoDB table, a stream record is generated automatically.
An AWS Lambda function reads from this stream (with at-least-once delivery guarantees) and publishes the event to Amazon EventBridge or SNS.
It's great because you do not need to create or clean up a separate outbox table. DynamoDB handles the stream buffer natively.

2. Aurora (Relational DB) + AWS Lambda / Debezium

This is the path if you are using Amazon Aurora PostgreSQL/MySQL.

Use a local transaction to insert into your business tables and an outbox table.
Use a tool like Debezium (running on AWS Fargate) to capture database binary logs (CDC - Change Data Capture) and stream them to Kafka/EventBridge.
Alternatively, use Aurora PostgreSQL's built-in aws_lambda extension to trigger a Lambda function directly from database triggers (though CDC is generally preferred for high-throughput scaling).

How an advanced multi-regional microservices architecture on AWS with outbox pattern might look like:

Advanced multi-regional microservices architecture on AWS with Step Functions Saga orchestration.

Summary: Designing for Resilience

Distributed transactions are difficult, and eventual consistency is a mind shift. Before adopting these patterns, ask if you truly need them:

The Golden Rule: If you can redesign your service boundaries to keep transactional data within a single database, do so. Do not build microservices just to run distributed transactions.

However, when scale demands division, follow the AWS-recommended path:

Use AWS Step Functions to orchestrate complex sagas, relying on built-in retries, catch blocks, and state tracking.
Implement compensating actions for steps that require rollback, ensuring every action in your workflow is idempotent.
Eliminate dual writes by routing events through the Transactional Outbox Pattern using DynamoDB Streams or CDC.

By designing around eventual consistency, your microservices remain isolated, scalable, and resilient to failure at internet scale.

Distributed Transactions and Eventual Consistency in AWS Microservices