
Introduction
In an earlier post I covered how to trim Strapi REST payloads using populate.on. Query tuning helps, but the underlying problem remains: every read hits the database, and Strapi is the bottleneck under load.
This post covers what happened when I put Redis (via AWS ElastiCache) in front of Strapi — the load test, the results, the release plan, and the trade-offs I accepted.
The Setup
I benchmarked two environments under identical conditions to isolate the effect of Redis caching.
Test conditions:
- Tool: k6-style load test
- Virtual users: 100
- Duration: 1 minute
- Profile: Fixed concurrency
- Endpoints under test: four high-volume Strapi endpoints powering the frontend
Two environments:
- QA without Redis — vanilla Strapi, hitting the database for every request
- Pre-prod with Redis — same Strapi, ElastiCache (Redis) in front
Before: Strapi Without Redis
| Endpoint | Total Requests | Req/s | Avg (ms) | p90 (ms) | p95 (ms) | p99 (ms) | Error % |
|---|---|---|---|---|---|---|---|
| Endpoint A | 549 | 8.19 | 2,695 | 8,785 | 13,418 | 19,060 | 0.18 |
| Endpoint B | 533 | 7.95 | 813 | 1,838 | 2,115 | 2,686 | 0.19 |
| Endpoint C | 495 | 7.38 | 2,549 | 6,585 | 7,420 | 9,263 | 0 |
| Endpoint D | 473 | 7.06 | 1,639 | 3,847 | 4,895 | 5,516 | 0 |
The numbers tell the story. Average response times of 813–2,695 ms for content that almost never changes. One endpoint’s p99 was creeping toward 20 seconds. Throughput plateaued at 7–8 req/s per endpoint.
After: Strapi With Redis (ElastiCache)
| Endpoint | Total Requests | Req/s | Avg (ms) | p90 (ms) | p95 (ms) | p99 (ms) | Error % |
|---|---|---|---|---|---|---|---|
| Endpoint A | 2,658 | 39.64 | 452 | 64 | 1,162 | 13,397 | 1.69 |
| Endpoint B | 2,655 | 39.59 | 66 | 86 | 92 | 126 | 0.45 |
| Endpoint C | 2,651 | 39.53 | 66 | 77 | 86 | 120 | 0.41 |
| Endpoint D | 2,646 | 39.46 | 66 | 57 | 68 | 115 | 0.38 |
The Comparison
| Metric | Without Redis | With Redis | Change |
|---|---|---|---|
| Requests/sec | ~7–8 | ~39–40 | ~5× throughput |
| Average response time | 813–2,695 ms | 66–452 ms | up to ~95% reduction |
| p99 latency | 2,686–19,060 ms | 115–13,397 ms* | substantially lower |
| Error rate | 0–0.19% | 0.38–1.69% | small uptick |
*The Endpoint A p99 outlier is discussed below.
What jumps out
- ~5× throughput across the board. Same hardware, same Strapi, just Redis in front. The DB stopped being the limiter.
- Latency collapse. Three of four endpoints settled at a flat 66 ms average with Redis — at that point I was measuring Redis + network, not Strapi.
- p90/p95 stability. The non-Redis case had wide tails (p95 of 2.1–13.4 seconds). With Redis, p90/p95 sit between 57–162 ms on the well-cached endpoints. Tail latency is where Redis really earns its keep.
- Modest error increase (still < 2%). A small price — likely cache-miss races and edge cases on first writes during the test window.
The Endpoint A outlier
Endpoint A (the heaviest of the four — frequently invalidated by editor writes) kept a meaningful p99 even with Redis (13,397 ms) and the highest error rate (1.69%). Two likely causes:
- Cache miss → DB fallback under contention. When the cache key expires during a test burst, every concurrent request that hits the miss races to the DB.
- Higher write/invalidation churn. This endpoint is touched more often by editors, so the cache is invalidated more frequently.
Mitigations on the roadmap: stale-while-revalidate semantics on this key, slightly longer TTL with manual invalidation on publish, and request-coalescing at the cache layer so only one request goes to the DB on a miss while others wait on the result.
The Quietly Huge Result: Infra Cost
This is the number that surprised me most:
Without Redis: the Strapi pod spiked to over 1 GB memory and Kubernetes auto-scaled to 4 pods to absorb the load.
With Redis: the same load was handled with < 450 MB memory and just 2 pods — a >50% reduction in compute footprint.
Caching isn’t only a latency story; it’s a resource-utilisation story. Half the pods, less than half the memory, for the same traffic. At cloud-bill-paying scale that compounds quickly.
How I Released It
I didn’t flip Redis on in production blind. The rollout had two stages — QA first, then production — each with explicit acceptance criteria.
QA Release Requirements
- ElastiCache server — used the existing provisioned instance.
- Strapi service replica — deployed a replica of the current Strapi service to QA.
- Strapi DB replica — for consistent test data.
- Dedicated QA server — Strapi base URL updated to point to the QA env.
Environment Variable Changes
Four new variables, added at runtime:
REDIS_PASSWORD=<>
REDIS_HOST=<>
REDIS_PORT=<>
REDIS_USERNAME=<>
One existing variable updated to include the Redis host:
ALLOWED_CORS_ORIGIN=<>
Acceptance Criteria — QA
The QA release was only signed off if all four conditions held:
- Response time: API p50/p90 reduced by at least 50% vs. current production
- Memory: Strapi pod memory ≤ current production usage
- Redis performance benchmarks: stable throughput, latency, hit rate, memory usage under load
- Cache invalidation tests:
- On update
- On deletion
- On TTL expiry
(All four cleared comfortably — by the time QA wrapped, response time was down 90%+ on the high-volume endpoints.)
Production Release
Same env variables, same acceptance criteria:
- Response time reduced by ≥ 50% vs. existing prod
- Strapi pod memory ≤ current prod usage
- Redis (ElastiCache) demonstrates stable throughput, latency, and hit-rate under prod load
The QA pre-flight gave me confidence. The production rollout went without surprises.
Cache Invalidation: The Part That Always Bites
Caching is easy. Invalidation is the part with footguns.
For CMS content like this, three invalidation paths matter:
1. On update (editor publishes a change)
The clean fix: lifecycle hooks in Strapi. On afterUpdate for a given collection, delete the relevant cache keys. This is tag-based invalidation in spirit — content has a tag, the tag’s keys get evicted on write.
2. On deletion
Same pattern as update. afterDelete → evict keys. Skipping this leaves zombie content in the response.
3. On TTL expiry
Time-based fallback for everything you forgot to invalidate manually. TTLs vary by endpoint — short (seconds–minutes) for editorial content, longer (minutes–hours) for footer/navigation/legal-style data.
The trap I’ve seen most often: only using TTLs. It looks correct because content eventually updates, but editors get a bad experience (“why isn’t my edit live yet?”), and you’ll get the worst of both worlds — stale content and unnecessary DB load.
Lifecycle hooks first, TTL as a safety net.
Trade-offs I Accepted
- Slightly higher error rate (< 2%) under heavy load. Worth the throughput and latency gains.
- A new piece of stateful infrastructure (ElastiCache) to monitor. Standard tooling — CloudWatch metrics on hit rate, evictions, memory, CPU.
- An extra failure mode on writes. If Redis is briefly unavailable, the system degrades to DB reads. The pattern: never let Redis errors propagate to users; log and bypass.
- Cache key discipline. Once you have a cache, you have a place for keys to drift. I standardised key naming early so invalidation logic doesn’t go stale.
Key Takeaways
- Redis in front of a CMS is one of the highest-ROI infra changes you can make. ~5× throughput, ~95% latency reduction, 50%+ pod/memory savings — same code, same DB.
- Tail latency is the headline. The averages are nice; the p90/p95 collapsing from seconds to ~70 ms is what users actually feel.
- Caching is a cost story, not just a latency story. Half the pods at the same load adds up fast on cloud bills.
- Lifecycle hooks beat TTLs for editorial content. TTLs are a safety net, not a strategy.
- Roll out with explicit acceptance criteria. “Latency improves by ≥ 50%” is something you can sign off; “feels faster” isn’t.
- Hunt down outliers individually. The heaviest endpoint needed its own treatment — most caching wins are uniform, but the awkward 5% deserves its own plan.
Conclusion
Putting Redis in front of Strapi was the single largest performance win I shipped on the CMS layer — measurable in throughput (5×), in latency (95% reduction on most endpoints), and in infrastructure (half the pods, half the memory). The release was deliberately staged with QA acceptance criteria before production, which kept it boring — exactly what you want from a stateful infra change.
The general lesson: CMS content is read-heavy and tolerates seconds of staleness — that’s an exact match for Redis. If you’re running Strapi (or any headless CMS) at scale without a cache layer, the upside isn’t 10% better — it’s the difference between fighting your infrastructure and forgetting it’s there.