12 SaaS Architecture Decisions That Scale (Or Don't)

I've seen SaaS products handle 100x growth without breaking a sweat. I've also seen products crumble at 2x their launch traffic. The difference usually comes down to a handful of architectural decisions made early. Decisions that seemed reasonable at the time but created ceilings that couldn't be raised without massive rewrites.

Here are the 12 decisions that matter most, and how to make them well.

1. Database: SQL vs NoSQL vs Both

The decision: Do you use PostgreSQL/MySQL, MongoDB/DynamoDB, or a combination?

What scales: It depends on your access patterns, not on what's trendy.

SQL (PostgreSQL, MySQL): Scales well for relational data with complex queries. ACID compliance matters for financial data, user management, and anything where consistency is critical. With proper indexing and read replicas, PostgreSQL handles millions of users.

NoSQL (MongoDB, DynamoDB): Scales horizontally for simple key-value or document lookups. Better for unstructured data, high write throughput, and when you need to scale reads/writes independently.

The hybrid approach: Many successful SaaS products use PostgreSQL for core business data (users, subscriptions, transactions) and DynamoDB/Redis for high-volume operational data (analytics events, session data, caching).

The trap: Choosing NoSQL because "it scales better" without understanding your access patterns. Then discovering you need joins and transactions, and building them poorly in application code.

My recommendation: Start with PostgreSQL. It handles far more scale than most people realize. Add specialized databases when you have specific access patterns that demand them, not before.

2. Multi-tenancy: Shared vs Isolated

The decision: Do all customers share one database, get separate schemas, or get completely isolated infrastructure?

Options: - Shared database, shared schema: All tenants in the same tables, distinguished by tenant_id column - Shared database, separate schemas: Each tenant gets their own schema in the same database - Separate databases: Each tenant gets their own database instance - Separate infrastructure: Each tenant gets dedicated compute and storage

What scales:

Shared schema scales operationally (one deployment, one migration) but creates noisy neighbor problems and makes large tenant isolation difficult.

Separate infrastructure scales revenue (enterprise customers pay for isolation) but multiplies operational complexity.

The trap: Starting with full isolation because "enterprises need it" and then drowning in operational overhead when you have 500 small customers who each need their own deployment managed.

My recommendation: Start with shared database, shared schema with a well-designed tenant_id pattern. Build your queries and indexes to filter by tenant_id efficiently from day one. Add isolation options (separate schema, separate database) as an enterprise upsell when customers actually demand it and will pay for the operational overhead.

3. Authentication: Build vs Buy

The decision: Do you build authentication yourself or use Auth0/Clerk/Cognito?

What scales:

Building auth yourself scales control. You own every aspect of the user experience and can customize deeply.

Buying auth scales engineering time. You focus on your core product instead of maintaining session management, MFA, SSO integrations, and security patches.

The trap: Building auth yourself and then spending 6 months implementing SAML SSO when an enterprise customer demands it, instead of shipping features.

My recommendation: Buy auth for everything except the most unusual requirements. Auth0, Clerk, and Cognito are mature, secure, and handle the enterprise requirements (SSO, SCIM, audit logs) that you'll eventually need. The cost is trivial compared to the engineering time of building and maintaining auth infrastructure.

The exception: If auth IS your product (identity provider, security tool), obviously build it.

4. API Design: REST vs GraphQL vs tRPC

The decision: What's your primary API paradigm?

What scales:

REST: Scales simplicity. Everyone understands it. Caching works naturally with HTTP semantics. Debugging is straightforward.

GraphQL: Scales client flexibility. Frontend teams can query exactly what they need without backend changes. But requires careful schema design and query complexity limits.

tRPC: Scales type safety for TypeScript teams. End-to-end types from database to frontend. But locks you into TypeScript ecosystem.

The trap: Choosing GraphQL because it's modern, then spending months preventing N+1 queries, implementing rate limiting that accounts for query complexity, and debugging why the server is slow when clients request deeply nested data.

My recommendation: REST for most SaaS products. It's battle-tested, well-understood, and the tooling is mature. If you have a complex product with many different client needs (mobile, web, third-party integrations), consider GraphQL, but invest heavily in schema design and query cost analysis upfront. tRPC is great for internal APIs in TypeScript monorepos.

5. Background Jobs: Queue Architecture

The decision: How do you handle work that shouldn't block HTTP requests?

Options: - Synchronous (just do it in the request) - Simple queue (Redis + Bull, SQS) - Complex workflow (Temporal, Step Functions)

What scales:

Any queue is better than no queue for operations like sending emails, processing uploads, generating reports, or syncing with third-party services.

Simple queues (Bull, SQS) scale well for straightforward job processing with retries.

Workflow engines (Temporal) scale complex multi-step processes with long-running operations, human approvals, and sophisticated error handling.

The trap: Building a "simple" job queue in your database with a status column. It works until it doesn't, and then you're debugging race conditions at 3am during a traffic spike.

My recommendation: Start with a real queue from day one. Bull (Redis-based) or SQS. The overhead is minimal and the reliability improvement is massive. Move to Temporal when you have genuinely complex workflows (multi-step onboarding, order fulfillment, approval chains) that need to survive process restarts and have sophisticated retry/compensation logic.

6. Caching Strategy: Where and What

The decision: What do you cache, where, and for how long?

Layers: - Application-level (in-memory) - Distributed cache (Redis, Memcached) - CDN (Cloudflare, CloudFront) - Database query cache

What scales:

Aggressive CDN caching scales static assets and public pages trivially.

Redis caching scales expensive database queries and computed data.

Application caching scales repeated operations within a request.

The trap: Adding caching without cache invalidation strategy. Your users see stale data for hours because you cached aggressively but forgot to invalidate on updates.

My recommendation: Start with CDN caching for static assets (set long cache times with content hashes in filenames). Add Redis for expensive queries and computed data early. It's cheap and the performance improvement is immediate. Design your cache keys and invalidation patterns before you need them, not after you're debugging stale data bugs.

Rule of thumb: Cache computed data (aggregations, reports), not source data. Invalidate on write, not on timeout when possible.

7. File Storage: Where Do Files Live?

The decision: How do you handle user uploads and generated files?

Options: - Local filesystem - Object storage (S3, GCS, R2) - Specialized services (Cloudinary, Uploadcare)

What scales:

Object storage (S3) scales infinitely at low cost. It's the default answer for almost everything.

Specialized services scale developer time when you need image transformations, video processing, or complex upload handling.

The trap: Storing files in your database as BLOBs. Works for tiny volumes, becomes a nightmare for backups, performance, and cost at scale.

My recommendation: S3 or Cloudflare R2 from day one. Generate signed URLs for direct upload from clients, store the object key in your database. Use a CDN in front for serving. For images that need transformations (thumbnails, responsive sizes), either use a service like Cloudinary or run an image transformation layer (imgproxy) in front of your object storage.

8. Search: Database vs Dedicated Engine

The decision: Do you search using database queries or a dedicated search engine?

Options: - Database LIKE queries - PostgreSQL full-text search - Elasticsearch/OpenSearch - Algolia/Typesense

What scales:

Database LIKE queries don't scale past trivial data volumes or simple exact matches.

PostgreSQL full-text search scales surprisingly well for moderate search needs with proper indexing.

Dedicated search engines (Elasticsearch, Algolia) scale to complex search requirements with faceting, fuzzy matching, and relevance tuning.

The trap: Adding Elasticsearch for a simple search that PostgreSQL could handle, then maintaining a complex sync pipeline and debugging index inconsistencies.

My recommendation: Start with PostgreSQL full-text search. It handles more than people expect with proper tsvector columns and indexes. Move to Elasticsearch/Algolia when you need features PostgreSQL can't provide: faceted navigation, complex relevance scoring, typo tolerance, instant search with highlighting.

9. Real-time: Polling vs WebSockets vs SSE

The decision: How do you push updates to clients?

Options: - Polling (client asks every N seconds) - Long polling (client asks, server holds until update) - WebSockets (persistent bidirectional connection) - Server-Sent Events (persistent unidirectional connection)

What scales:

Polling scales simplicity. It works everywhere and is easy to implement.

WebSockets scale interactivity for truly real-time features (chat, collaboration, gaming).

SSE scales one-way updates (notifications, dashboards, live feeds) with simpler infrastructure than WebSockets.

The trap: Building WebSocket infrastructure for a dashboard that could refresh every 30 seconds with polling. You're maintaining connection state, handling reconnection, and scaling a stateful service for something that didn't need it.

My recommendation: Start with polling unless you genuinely need sub-second updates. For real-time dashboards and notifications, SSE is simpler than WebSockets and sufficient for most cases. Reserve WebSockets for features that require bidirectional communication (chat, collaborative editing, multiplayer). Consider managed services (Pusher, Ably) for WebSocket infrastructure since the operational overhead of running your own WebSocket servers at scale is significant.

10. Deployment: Containers vs Serverless

The decision: Do you deploy containers (ECS, Kubernetes) or serverless functions (Lambda, Vercel)?

What scales:

Containers scale control and predictability. You know exactly what's running, can optimize cold start, and handle long-running processes naturally.

Serverless scales operational simplicity and cost efficiency for variable workloads. Zero servers to manage, pay only for execution.

The trap: Going serverless for everything and then fighting cold starts, function timeout limits, and complex deployment coordination when you need long-running processes or persistent connections.

My recommendation: Start with containers (ECS Fargate, Cloud Run, Railway) for your main application. Use serverless for truly event-driven workloads like webhook handlers, scheduled jobs, and image processing. The hybrid approach gives you the best of both: predictable performance for your core app, elastic scaling for spiky workloads.

11. Monitoring: What Do You Measure?

The decision: What instrumentation and observability do you build in?

Layers: - Logs (structured, searchable) - Metrics (request rates, latencies, error rates) - Traces (request flow through services) - Alerts (when things break)

What scales:

Structured logging scales debugging. When something breaks at 3am, you can search logs effectively instead of grepping through text files.

Metrics scale understanding. You can see trends, set alerts, and understand system behavior over time.

Distributed tracing scales complexity. When a request touches multiple services, you can follow its path.

The trap: Building observability after you're in production and something is broken. You're flying blind when you need visibility most.

My recommendation: Start with structured JSON logging (pino, winston) and basic metrics (Prometheus, DataDog) from day one. Add tracing when you have multiple services that communicate. Set up alerts for error rates and latency percentiles early, not for every metric, just the ones that indicate user impact. Instrument your most critical paths thoroughly: signup, payment, core features.

12. Feature Flags: When and How

The decision: How do you control feature rollout?

Options: - Code branches (just ship it) - Environment variables (per-environment flags) - Feature flag service (LaunchDarkly, Statsig, PostHog)

What scales:

Feature flag services scale risk management. You can roll out features to 1% of users, monitor impact, and disable features without deploying.

Environment variables scale simplicity for small teams with low deployment risk.

The trap: Building an elaborate homegrown feature flag system with database-backed toggle storage, complex targeting rules, and a UI when a managed service would cost $50/month and work better.

My recommendation: Use a feature flag service (LaunchDarkly, Statsig, PostHog) from the start. The cost is trivial, the value is massive: percentage rollouts, user targeting, instant kill switches, experiment integration. You'll use it more than you expect, and the alternative (coordinated big-bang releases) is increasingly risky as your user base grows.

The Meta-Decision: When to Decide

The hardest part of architecture isn't making decisions. It's knowing when to make them.

Decide early: Database choice, multi-tenancy model, authentication approach. These are hard to change later and affect everything.

Decide when you need it: Search engine, real-time strategy, complex workflow orchestration. Don't build infrastructure for problems you don't have yet.

Keep options open: Design interfaces that hide implementation details. If your code talks to an "event bus" abstraction, you can swap implementations later. If it's tightly coupled to SQS APIs, you're locked in.

The best architecture is one that handles your current scale reliably while leaving doors open for the next order of magnitude. You don't need to solve for 100x scale on day one, but you should avoid decisions that make 10x scale a rewrite.

Building a SaaS and facing these decisions? [Let's talk architecture](/contact).

12 SaaS Architecture Decisions That Scale (Or Don't)

12 SaaS Architecture Decisions That Scale (Or Don't)

1. Database: SQL vs NoSQL vs Both

2. Multi-tenancy: Shared vs Isolated

3. Authentication: Build vs Buy

4. API Design: REST vs GraphQL vs tRPC

5. Background Jobs: Queue Architecture

6. Caching Strategy: Where and What

7. File Storage: Where Do Files Live?

8. Search: Database vs Dedicated Engine

9. Real-time: Polling vs WebSockets vs SSE

10. Deployment: Containers vs Serverless

11. Monitoring: What Do You Measure?

12. Feature Flags: When and How

The Meta-Decision: When to Decide

Technical Debt is Killing Your Revenue (And How to Fix It)

The Enterprise Feature Checklist: 15 Things B2B Buyers Expect

HIPAA Compliance Checklist: 10 Technical Requirements for Patient Portals

Want to discuss this topic?