Payment Gateway Aggregator: How Stripe Routes Through 15+ Gateways in Milliseconds to make Your ₹175 Payment
How Razorpay handles millions of payments per hour, even if their own gateway is down.

A few months ago, I was learning about how UPI and online payments work. I realized that what we integrate into e-commerce applications are not actually payment gateways, but payment aggregators. These aggregators have over 40 payment gateways under them. Your payment aggregator finds the best gateway for your transaction in under 50ms, even if it fails, it retries or fail over and redirects your request to another gateway within 200ms. Your customers won't even notice that their payment switched between two or three gateways.
Note: To understand this better, I built a simulated payment orchestration system using NestJS, Kafka, RabbitMQ, Redis, Postgres, and Docker. It does not process real money, but it models the routing, retries, metrics, experiments, and failure-handling patterns used in real systems.
Let's break down things one by one:
What is the difference b/w payment gateway, aggregator and processor -
A payment processor operates behind the scenes, handling the actual financial transaction authorisation and settlement with banks and card networks.
A payment gateway is the customer-facing technology layer that securely captures and transmits payment data from your checkout interface to processing systems.
A payment aggregator is a service that includes multiple gateways under it and give an interface to end merchants for integrating almost 100% uptime payment with almost 0 paperwork.
Why Can't You Just Use One Payment Gateway?
Earlier, I thought payments worked like this: Your app talks with Razorpay, Razorpay talks with your bank, money moves. Done.
But reality is much messier.
In India, a typical payment company integrates with 10-15 different gateways: Razorpay, Juspay, PayU, Paytm, PhonePe Switch, HDFC, SBI, Cashfree, CCAvenue, BillDesk, and more.
Why so many? Because each gateway has different strengths.
Gateway Strengths
Gateway A might have: 97% success rate for HDFC UPI, 82% for Axis Bank cards, ~800ms latency, ₹2 per transaction.
Gateway B might have: 89% success for HDFC UPI, 95% for Axis Bank cards, ~1200ms latency, ₹1.5 per transaction.
If you're processing an SBI Bank credit card payment, SBI own gateway can handle it best, but if it is suffering from maintenance, maybe HDFC gateway can handle it better than others. But how do you decide this in real-time, at scale? That's the core aggregator work.
What happens when we click on "Pay 330₹" on Amazon or Swiggy
Let's walk through how payment gateway aggregators work internally, according to my research -
Step 1: Idempotency check & fraud and risk scoring (5-10 ms )
When a payment request reaches the aggregator's server, it first checks for idempotency key check, if it was a duplicate request that has been already processed earlier, if yes than return its previous result from cache.
Aggregator also analyses origin of request and credentials send in request for risk analyses of request (for personal project part, we can skip this step).
Step 2: Scoring available gateways (5 ms)
After idempotency check, we don't send request to any random gateway or apply round - robin
We score every available gateway with multiple factors and find best fit that payment
Scoring Factors
Success Rate (35% weight): In the last 30 minutes, how many rupay card payments from SBI accounts succeeded on this gateway?
Latency (25% weight): What's the P95 latency? 95% of requests complete faster than this time / done under this time.
Method Affinity (15% weight): Is this gateway good at UPI vs Cards vs Netbanking?
Bank Affinity (12% weight): An SBI credit card through SBI's gateway tends to have higher success rates.
Amount Fit (8% weight): Is this gateway optimized for small (₹100-500) or large (₹10,000+) payments?
Time-of-Day Penalty (5% weight): Does historical data show this gateway struggles at 12 PM or weekends?
All of this gets combined into a single score. The gateway with the highest score wins and gets tried first (not always, we don't want to exhaust best gateway).
Step 3: The Circuit Breaker Check (1 ms)
Before actually calling the gateway, there's a critical safety check: Is this gateway even healthy?
Circuit breakers track three states:
CLOSED (Healthy): Everything's normal. Success rate above 70%, timeout rate low. All requests go through.
OPEN (Unhealthy): Gateway is having issues. Maybe success rate dropped to 35%, or 10 requests in a row timed out. Don't try this gateway—route to backup immediately.
HALF-OPEN (Recovery): After 30 seconds of being OPEN, send 10% of traffic as probes. If probes succeed, state become CLOSED. If they fail, stay OPEN.
Circuit Breaker Example
01:00 PM - SBI gateway processing normally (CLOSED)
01:15 PM - Bank maintenance starts, success rate drops to 38%
01:16 PM - Circuit breaker trips to OPEN
01:16-01:46 PM - All payments route to backup gateways (HDFC, Razorpay)
01:46 PM - Circuit enters HALF-OPEN, sends 10% probe traffic
01:48 PM - Probes successful (bank maintenance over)
01:48 PM - Circuit returns to CLOSED
The user never feels anything even our half of gateways are under maintains.
Step 4: The Actual Gateway call (500-3000 ms)
When we make actual call to gateway, 3 responses can be received
Success (HTTP 200/201): Best case, money moved, everyone happy, just send a success message to user.
Clear Failure (e.g. Insufficient balance, card blocked, user's bank server down): Error occurred due to a known reason. Can return error to user or retry another gateway if required.
Timeout (No response after 3 seconds): This is the real nightmare. You don't know if the gateway received your request and processed it, failed, or never received it. Retry immediately and you might double-charge the user. Don't retry and you might lose a legitimate transaction. In this case mark it as PENDING_VERIFICATION and push in a queue (RabbitMQ) and process later.
Step 5: The Retry Decision (3-4ms) When a gateway call fails, the system classifies every error:
Retriable (e.g. NETWORK_ERROR, HTTP 500): Try a different gateway.
Timeout-like: Don't retry! Queue for async verification. Mark as PENDING_VERIFICATION, enqueue for status check worker.
User errors (e.g. INSUFFICIENT_FUNDS): Don't retry, show error to user.
Example: Attempt 1 → Razorpay NETWORK_ERROR
Attempt 2 → PhonePe TIMEOUT return "Payment processing..." to user. Background worker later calls PhonePe's status API and updates the payment. This prevents double-charging while handling ambiguous states.
Step 6: The Latency Budget
Users won't wait forever (we can't try 10-15 gateways). The system has a latency budget - typically 8-10 seconds total for all retry attempts. When the budget is running low, it strategically skips slow gateways and tries faster ones (even if their success rate is low). That's how you can try maximum gateways in the same time, that increases chances of success of payment.
Step 7: The Metrics Pipeline- How the System Learns
Payment Attempt → Save to DB → Publish to Event Stream → Metrics Worker. The worker updates sliding window metrics (1min, 5min, 15min, 60min), stores hot metrics in Redis and historical in PostgreSQL. The system primarily uses the 30 or 60 - minute window for routing: recent enough to be relevant, stable enough to not overreact to noise.
What Gets Tracked
Success rate, timeout rate, latency percentiles (P50, P95, P99), error distribution, total volume - per (gateway, payment_method, bank). Data update every second.
- P90 means the time before that 90% payments are done.
Problem: If we always choose the best gateway, we may end up exhausting it very quickly.
- We need to design a system that also explores other gateways - such as those ranked 2nd or 4th - giving them opportunities to improve their scores. This ensures they aren’t neglected. Otherwise, a gateway left unused for 60 minutes could become unranked, which exploration helps prevent.
Solution: A/B Testing and Thompson Sampling
We'll run experiments (e.g. 85% control, 15% treatment) with deterministic assignment by customer_id, then statistical testing and auto-stop guardrails if treatment harms users.
Thompson Sampling (multi-armed bandit): For each gateway, maintain a Beta (alpha, beta) belief about success rate. When routing, sample a random success probability from each distribution and pick the highest. After the payment, update alpha on success, beta on failure. Gateways with little data get explored.
The Architecture
Request → Check Idempotency → Build Payment Context → Resolve Experiments → Score All Gateways → Apply Bandit Reordering (if enabled) → Retry Loop: for each gateway check circuit breaker, call API, update circuit breaker, classify error, check latency budget → Persist payment, routing decision, attempts, outbox event → Background: metrics worker, verification worker, experiment analyzer, webhooks.
What This Achieves
Reliability: Failover in <500ms, circuit breakers prevent cascade failures.
Performance: 10ms routing decisions, intelligent latency budgeting.
Safety: Idempotency, timeout handling.
Intelligence: Learns which gateways work best; A/B tests and bandits for low data exploration.
The Hard Parts
Circuit breaker thresholds are highly sensitive; too aggressive or too conservative both hurt.
Metrics need context: segment by (gateway, method, bank), not just gateway.
Cold start: New gateways have no data; use a bootstrap period (e.g. 10% forced traffic for first hour).
Idempotency edge cases: Same key, different amount → hash request body and reject if mismatch.
How do I build it ? :
In my implementation, I split the system into three NestJS microservices:
payment-orchestrator: payment creation, gateway scoring, retry logic, circuit breaker, outbox, and verification flow.
metrics-service: consumes Kafka events and maintains Redis hot metrics plus PostgreSQL history.
experiment-service: handles A/B assignment, experiment outcomes, and Thompson Sampling bandit state.
Kafka is used for payment attempt/completion events.
RabbitMQ is used for async verification and webhook-style background work.
Redis stores hot routing metrics, idempotency, circuit breaker state, and experiment assignment cache.
PostgreSQL stores durable payment, attempt, routing, metric, and experiment data.
Want to try at your own:
Here is a repo that includes practical implementation of it:
Github Repo Link: https://github.com/Devendraxp/payment-gateway
Tech stack: NestJS, Postgres, TypeORM, Redis, Docker, RabbitMQ, Kafka
docker-compose file available in github repo.
Architecture:
Follow me: Devendra Jat | LinkedIn

