How Auto Scaling Fits into the Bigger Picture

January 16, 2026

So far, we have looked at two core building blocks of a scalable system.

First, we learned that a load balancer is intentionally passive. It doesn’t create servers or decide when to scale. It simply routes traffic to a set of known, healthy instances.

Then, we explored the role of the control plane. The control plane owns the system’s desired state. It decides which instances should exist and registers valid backends for a service, which the load balancer then uses for routing.

With these pieces in place, the next question naturally follows: how does the system decide when to add or remove servers?

What Auto-Scaling Really Is

Auto-scaling is not about scaling traffic. It is about reconciling desired state with observed reality.

Auto-scaling lives squarely inside the control plane.

It:

Observes signals
Makes decisions
Changes the system state

The Problem Auto-Scaling Solves

Any system or service can face two very opposing risks:

Too few servers: This could result in overload and failures.
Too many servers: This could result in wasted resources.

Auto-scaling runs continuously addressing the above issues, answering the question:

What should the size of this service be right now?

How Auto-Scaling Acts

Auto-scaling acts based on signals that reflect how a service behaves over time.

Examples:

CPU usage
Memory pressure
Request latency
Queue depth
Error rates

Signals vs Decisions

One of the most important ideas in auto-scaling is the separation between signals and decisions.

Signals are observations. Decisions are actions.

Auto-scaling works by carefully keeping these two apart.

These signals are observed, configured at the service level. The decision of adding or removing servers for a service depends on these inferred signals.

Understanding Signals

Auto-scaling doesn’t make decisions randomly. It relies on signals, observations that describe how a service behaves over time.

Not all signals come from the same place. In practice, auto-scaling systems tend to rely on three broad categories: metrics, schedules, and predictions. Each one exists to handle a different kind of workload pattern.

1. Metrics-Based Signals (Reacting to What’s Happening Now): This is the most intuitive form of auto-scaling and usually the first one people encounter.

Here, the system watches how busy the service is and reacts when pressure builds up.

Common examples of these signals include:

CPU usage staying high
Memory steadily filling up
Requests taking longer to complete
Queues growing faster than they are being drained

A simple mental model could be a restaurant kitchen. If orders start piling up and cooks are constantly busy, that’s a signal that more help is needed.

In simple terms, sustained CPU usage above a certain threshold often indicates the need for more capacity.

Metrics-based scaling works best when traffic is unpredictable. It doesn’t assume anything about the future, it simply responds to sustained pressure in the present.

2. Scheduled Signals (Reacting to What You Already Know): Some traffic patterns aren’t surprising at all.

For example:

Traffic increases every weekday morning
Load drops sharply at night
A weekly batch job always runs at the same time

In these cases, waiting for metrics to rise is unnecessary. The system already knows when load is coming.

Scheduled signals allow capacity to be adjusted ahead of time.

Think of this like opening extra checkout counters before a known rush hour, instead of waiting for lines to form.

This approach works well when traffic is predictable and repeatable.

For example, administrators can define specific times and dates for scaling actions to occur (e.g., scale up to 10 instances every Monday-Friday at 9:00 AM, scale down to 2 instances at 6:00 PM).

3. Predictive Signals (Reacting to What’s Likely to Happen): Predictive signals sit between metrics and schedules.

Instead of reacting only to the present or relying on fixed times, the system looks at historical behavior and asks:

Based on what usually happens, what is likely to happen next?

For example:

Traffic usually ramps up gradually before a major event
Certain days show repeating growth patterns
New instances take time to warm up

By learning these patterns, the system can start scaling before pressure becomes visible in metrics.

A simple analogy is preparing food before guests arrive because past experience tells you when they usually show up.

Predictive signals are especially useful when starting new instances is slow and reacting late would cause user-visible delays.

How an Instance Failure Becomes a Signal

Here we try to picture how a failing instance indirectly acts as a signal and present it as an example for our auto-scaling scenario.

1. Through Reduced Capacity

When an instance stops serving traffic:

Fewer instances of the service handle the same workload
Remaining instances experience higher load

This turns up as:

Increased CPU or memory usage
Higher latency
Longer queues

Now the service-level signals change.

2. Through Error Trends

If failures are widespread:

Like multiple instances losing DB connectivity
Error rates increasing across the service

Auto-scaling doesn’t see why errors happen, it sees:

Sustained degradation
Correlated failures That pattern becomes a signal.

3. Through Backlog Growth

If traffic continues but capacity drops:

Requests pile up
Queues grow
Latency increases Again, this is observed over time, not per request.

Instance failure
      ↓
Capacity reduction
      ↓
Service-level pressure
      ↓
Observed signals
      ↓
Auto-scaling decision

These above failure scenarios result in a signal after sustaining a certain period of time, remember a brief spike or a single failing instance will not result in signal creation.

The failure itself is not the signal. The impact of the failure is.

What Auto-Scaling Does Not Do

React to a single failing instance
Inspect readiness endpoints
Diagnose root causes
Replace unhealthy instances directly

These responsibilities belong to other parts of the control plane.

How Auto-Scaling Updates the Control Plane

Auto-scaling does not add or remove servers directly. Instead, it works by updating the desired state managed by the control plane.

This distinction is important.

Auto-scaling observes signals and decides whether the current size of a service is too small, too large, or appropriate. When it decides that a change is needed, it expresses that decision as a change in intent, not as an immediate action.

From Observation to Intent

Once sustained signals indicate pressure or underutilization, auto-scaling updates the control plane with a new desired service size.

For example:

This service should run with 6 instances instead of 4
This service can safely shrink from 10 instances to 7

At this point, no servers have been created or removed yet. Only the target state has changed.

The Control Plane Takes Over

After the desired state is updated:

The control plane compares the desired number of instances with the actual number
If there is a gap, lifecycle controllers take action
New instances are created or excess ones are removed

Auto-scaling does not care how this happens. It only cares that the declared state is eventually reached.

How Load Balancers Fit In

As instances are added or removed:

New instances are registered as valid backends
Unhealthy or removed instances are deregistered
Load balancers update their local routing tables automatically At no point does auto-scaling talk to the load balancer directly.

Auto-scaling changes what the system should look like. The control plane handles how the system gets there.

This separation is what keeps large systems predictable under load.

Service Signals
(CPU, latency, errors)
        │
        ▼
 Auto-Scaling Logic
        │
 updates desired size
        ▼
   Control Plane
(desired service state)
        │
 reconciles state
        ▼
 Instance Lifecycle
(create / terminate)
        │
 registers backends
        ▼
  Load Balancer
(routes traffic to the replicas of the service)

Summary

In this post, we explored how auto-scaling fits into the larger system design picture.

Auto-scaling operates within the control plane as a feedback loop that observes service-level signals and updates the desired size of a service over time. The control plane then reconciles this desired state by creating or removing instances, while load balancers passively adapt to these changes.

Understanding auto-scaling as a state-driven process, rather than a reactive traffic mechanism helps clarify why scalable systems remain stable even as load and failures fluctuate.

What’s Next

This post focused on how auto-scaling works. In upcoming posts, I plan to cover:

How Ingress Controllers Route Traffic to Different Services — and How They Differ from Load Balancers

Each post builds on the previous one, starting from fundamentals and gradually moving toward more complex system design concepts.

More posts in this series coming soon.