Payment Gateway (Stripe) System Design with Microsoft Engineer

Posted by : on Dec 2, 2025

Introduction
Domain Background
Requirements
- Functional Requirements
- Non-Functional Requirements
Satisfy Functional Requirements
Satisfy Non-Functional Requirements
Conclusion

Introduction

Payment Gateway system design is pretty similar to Bank system design, with a lot of problems and complexities in common. The goal of the system is to allow businesses to make money out of goods/services by processing payments for them.

We are going to design a full lifecycle starting from card details submitted by the customer ending with sending money to the merchant.

Note, that this system design requires a bit of domain knowledge about how payments, card networks and banks are working - you can find a very short section about it below.

Recorded mock for Stripe System Design can be found here.

Domain Background

As I mentioned, for this system design you will need a bit of background knowledge about payment processing, card networks and banks.

Stripe is not a bank. It is an intermediary that helps move money from customer cards to merchant bank accounts.

Why are people using Stripe instead of implementing their own payment processing? Because in order to store and process card details, payments, and integrate with Card Networks and banks you will need to acquire tons of licenses, which is both time consuming and very expensive. In addition, you will pay huge fines if you don’t follow high standards.

Terminology

Merchant’s Customer - our customers who want to buy goods / services.

Merchant - our direct customer that sells some goods or services

Stripe - us, Payment Gateway acts as the payment gateway and payment processor

Acquiring bank - Stripe’s (receiving) bank

Card network - Card payments processor, e.g. Visa, Mastercard, etc.

Issuing bank - the customer’s bank. It issued the card the customer is paying with.

Payout - settling money from Stripe bank to Merchant’s bank.

Ledger - is a financial record system that tracks all transactions as paired entries - a debit and a credit (e.g., acc1 -$100; acc2 +$100). This two-sided format ensures every movement of money is balanced and traceable. It’s a legal requirement for payment processors, and all entries together should sum to zero.

Flow

In general, flow looks like that:

The customer wants to pay with a card, so the merchant creates a Payment (payment intent) in Stripe.
Stripe sends the request to Card Network (Visa, MasterCard), as we are working with card details, not bank accounts.
Card Network transfers money from the customer’s bank (issuing bank) to Stripe’s bank (acquiring bank).
Stripe records the transaction in the internal ledger, and charges a fee for money transfer
Stripe transfers funds from its bank to the merchant’s bank account (payout)

Money transfer from customer to Stripe’s bank is performed using 3 steps:

authorization transaction, which locks money on customer’s account, performs fraud check, etc
capture request, idempotent request that actually confirms authorized transaction. It means money must be moved from sender to receiver
settlement - is the final stage when the bank is moving money from one account to another account.

You can think about it as follows:

authorization is operation in SQL transaction
capturing is transaction’s commit
settlement is replicating this commit to the follower node.

Resources:

Reconciliation

Once per couple of days banks are comparing their Ledger logs between each other to detect inconsistencies and rollback or rollout some of the transactions. That’s because banks communicate between each other asynchronously, and this is the only way to keep the state consistent.

Stripe, as payment processor, should perform regular reconciliation as well, especially in case of payouts to synchronize money between internal ledger’s transactions and merchant banks.

Requirements

As always, we are following our plan:

define functional and non-functional requirements
satisfy functional requirements
satisfy non-functional requirements
follow ups

Functional Requirements

Merchant is able to register itself and webhook
Customers are able to pay for merchant’s goods/services
Merchant receives webhooks about its payment
Merchant is getting money for goods/services on its bank account via Payout mechanism
Ensure no double charge for the same item of good/service
Audit - all transactions are stored in Ledger (legal requirement)

Non-Functional Requirements

High Traffic:
- 500 M payment requests per day => 5787 rps => 10k rps on peaks
- 5B ledger events (change of transaction/payment state) per day => 57870 => 100k rps on peaks
- 2M merchants
CAP, availability vs consistency:
- as highly available as possible
- Strong consistency (linearizability) for payment processing (no double charge)
- Eventual consistency for merchant’s related data, and ledger data
- No data loss for ledger entries
Correctness:
- debit and credit add up to zero
- merchant receives webhook with “at least once” guarantee, retrying 30 days.
Spikes: uneven per season (black Friday), and per merchant

Satisfy Functional Requirements

Let’s try to design a system for a couple of users on a single machine just to make sure it will work and satisfy Functional requirements, then follow up with Non Functional requirements.

Merchant is able to register itself and webhook

We can trivially satisfy this with simple API and storage. One endpoint that receives Merchant related information and stores it in a storage. In the real world it will be different types of data, e.g. webhook details, goods, logos, names, payment methods, etc. For security reasons we add api keys/secrets for API endpoints + signature for webhook validation.

Customers are able to pay for merchant’s goods/services

There is a sequence of steps that need to be performed to move money from the customer to the merchant’s bank.

Step 1. The customer goes to the page of payment details, expressing willingness to pay. This action sends a request to our Payment API

Step 2. API creates a payment entity with status “new”, merchant/goods related information. It will be used for deduplication later.

Step 3. The customer initiates payment by filling in card details and pressing the “Pay”. The client should encrypt data and send it along with the payment id created earlier.

Step 4. Payment API sends “Authorization” request to Card Network with customer’s card details. This step might include interactive flow and prompt customers to enter code, or confirm from the app. That flow will work on redirections the same way as OAuth.

Step 5. API saves the result of the Authorization transaction with “auth_id” to the database and updates status to “authorized”. Authorization id is a deduplication mechanism on Card Networks (see later).

Step 6. Payment API sends a “Capture” request to Card Network to move authorized money initialized by merchant, background job or within the same request straight away.

Step 7. The server saves “Captured” status to the payment in storage.

Step 8. Later on, the payout to merchants is happening via a background job that sends a batch of payments to the Merchant’s bank.

During step 1 - step 7 inclusively, two extra steps are done every time:

sending an event to Merchant’s webhook.
store transaction/event to Audit storage

Merchant receives webhooks about its payment

As shown in the diagram webhook to Merchant’s API is sent on every payment update or event.

Merchant is getting money for goods/services

At step 8, the money is transferred from Stripe’s account to Merchant’s account via payout. Every now and then we collect all successful payments and transfer them to the merchant’s bank by batch (with retries and reconciliation).

Ensure no double charge for the same item of good/service

Two step card payment (auth, capture) is helping here a lot. On our side, as we have unique payment ids, double charge can happen only in case of concurrent requests / retries from the client.

We will use Optimistic concurrency with Compare-and-swap (CAS) for that. When customer clicked “Pay” (POST /payment/:id) first we set status of the payment to “Authorizing”

UPDATE payments SET status = ‘authorizing’ WHERE id = $id AND status = ‘new’

Only in case of success (rows affected > 0) - we proceed to sending “Auth” requests to Card Network. This way if 2 concurrent threads try to call this endpoint for the same payment ID (client is retrying / concurrency issue) - only one of them will proceed with actually “Authorizing” and locking the customer’s money.

When we got response from Card Network, we update payments again

UPDATE payments SET status = ‘authorized’ auth_id = $auth_id WHERE id = $id AND status = ‘authorizing’

Capturing then is idempotent on the Card Network side, so we can send multiple “capture” requests for the same auth_id and be on the safe side.

Audit - all transactions are stored in Ledger

As shown in the diagram we are writing a ledger entry to Audit Storage on every payment update or event.

Drawbacks

Current architecture will work in a happy path and low traffic, but it has a lot of flaws:

How to handle payouts failures? What if the bank acknowledged it but the background job died before saving it to the database? How in general ensure the bank has the same payouts as our internal storage?
How to guarantee webhook delivery to merchants if the merchant API is down for one day?
What to do if API died before sending transaction data to Audit storage? In that case we violate the law by not having full ledger logs.
Scalability / Availability issues are still there. We cannot afford more than a few thousands payments per second, and we cannot failover in case some component stopped working.

Don’t worry all of these points are covered in the section below.

Satisfy Non-Functional Requirements

Now that our system works on a couple of users and a couple of machines, let’s make it Stripe’s size.

Introduce Kafka for side effects

Before deep diving into non functional requirements I would like to add one pattern that will ease our life. You can see that when customer pays money we have a lot of side effects:

sending webhook to merchant
save payment update to audit log
it might be more, e.g. notifications, some derived storages, etc

These operations are independent and can be randomly failed, delayed.

Therefore let’s introduce a well known pattern: message broker with multiple workers per side effect.

Correctness

To guarantee at least one delivery to a Merchant’s Webhook we need to store information about webhook delivery, as workers can randomly die, merchant API might be down for hours/days, etc.

Since we cannot lose data for our Audit requirements, we are using the same mechanism here as well with “at least once” semantics (basically retry infinitely).

High Traffic & Scalability

A typical strategy for scaling traffic is to distribute the load as evenly as possible between the machines. It applies for both storage and compute.

Payment API

Since stateless - trivially scaled with adding more instances in multiple regions/geos along with loadbalancers.

Payment Storage

Since our load is mostly write heavy - we are applying sharding along with consistent/rendezvous hashing for rebalancing. The key would be payment id for even distribution of the records, as merchants can cause hotspots and can overload single instances (if they are too big).

However, you might say that it will create scatter/gather load on our shards for the Merchant’s dashboard, and you will be right - this is a scalability bottleneck.

Another problem here is that if we shard by payment id, how is our coordinator/client going to know where a specific merchant’s data is located?

The best approach here would be to use derived data and sync the payments data to another storage that is optimized for querying, aggregation and will be sharded by merchant id. Let’s name such storage “Merchant Storage”, and move all merchant related data along with payments needed for the merchant dashboard there.

Merchant Storage

As discussed above - we are sharding it by merchant id because we want to retrieve merchant related data. For rebalancing - well known consistent/rendezvous hashing. For too hot merchants - we can isolate them and handle them separately as per “celebrity problem”.

Audit Storage

We don’t know how this database is going to be queried, so it is not easy to come up with a universal sharding strategy here. Let’s add sharding by payment, same as our Payment Storage. This will be enough to prove that it will scale well. If the intention is to query it by some ranges, run aggregations - then something more advanced should be done as a sharding strategy.

PaymentEvents Queue

Kafka is trivially getting sharded (partitioned) by payment id for ordering within the same payment.

Workers, background job

The workers are trivially scaled together with kafka partitions as they are stateless computes.

Retry/Job Storage

The storage can be sharded by job id trivially as well. Background jobs might query the job by “type” field from multiple shards. We will just apply an index on top of it.