[Architecture/Software] Caching

July 28, 2025

Caching is the practice of replicating data from an origin datastore into a separate storage so it can be served faster. Caching is an important technique to reduce data response latency and to improve performance for both the server that provides data and the clients that consume it.

The storage used to keep replicated origin data is called a cache store. When a server receives a data lookup request, it first checks the cache store: if the requested data is present, the cached value is returned; otherwise the server fetches the origin data and writes it into the cache store. Caching does not only mean a byte-for-byte copy of data from an external database — it also includes saving newly generated data or values composed from multiple sources into a separate store for faster access.

When a requested value exists in the cache store, we call that a cache hit; when it does not exist, it is a cache miss. The hit and miss rates depend on which cached items receive lookup traffic. A high cache hit rate means the cache contains the items that are frequently needed; a high miss rate means the cache contains many items that are less useful.

Populating the cache — sometimes called cache population or cache computation — is the act of retrieving or generating data and storing it in the cache store. Examples include fetching origin data and storing it directly in cache, performing additional computations before caching a processed result, or generating new data and persisting it in cache.

A cache store can be a physically separate storage on the network or an in-memory area inside an application. When a caching store is placed in its own layer of the system architecture (physically separated on the network), we refer to that as a caching layer.

Local caching vs global caching

Caching can be classified by where cached data is located:

Local caching
Global caching

Local caching stores cached data inside a single application without introducing a separate caching layer. The application may cache origin data fetched from a database either as-is or after some transformation. Local caching avoids additional network requests to a caching layer, so it has no network overhead and provides very fast access to cached items. It is suitable when the system consists of a single server, when lookup traffic is low, or when the cached data size is relatively small.

Local caching can be split into client-side and server-side local caching. Client-side local caching means the client (for example a browser or native mobile app) keeps frequently used data locally to reduce repeated requests for the same data — examples include static asset caching in browsers or mobile apps. Server-side local caching means an origin server keeps frequently served responses in memory so it can serve repeated requests quickly instead of recomputing or re-fetching them from a backing database. In microservice architectures a service can act both as a client (requesting data from another service) and as a server (providing data), so both client-side and server-side local caching can be applied.

A drawback of local caching is that cached data is lost when the application restarts or is taken down for maintenance. Clients will then re-request data and servers will re-populate cache items, which may cause a surge of origin datastore queries.

Global caching uses a separate caching layer — one or more dedicated caching servers — to store cache data. From a client’s perspective, any caching server that provides cached values is part of global caching. When the caching layer contains multiple caching servers, we call this distributed caching. In distributed caching environments cached data must be synchronized across caching servers so clients observe consistent results.

In a distributed system where a load balancer spreads client requests across multiple application instances, requests from a single client may not always go to the same server (unless sticky sessions are enabled). If cached data is replicated across multiple caching servers and those replicas are not synchronized, clients may observe inconsistent data.

Using a global caching layer introduces additional network hops and overhead compared with local caching, so access latency is generally higher than purely in-process caches.

Caching lifecycle and strategies

Typical operations on cached data include:

Cache population (writing cache entries)
Cache expiration
Cache eviction
Cache invalidation
Cache refresh (re-population or recomputation)

Cache population means storing new cache entries in the cache store. On a cache miss the server either fetches origin data or generates it internally and then writes the result into the cache.

Expiration means an entry is removed from the cache after a configured time-to-live (TTL). Expiration is usually performed automatically by the cache store. Cache entries can also be removed manually or reconfigured with a different TTL; if a stale entry remains in the store and is read, the system should detect expiry and repopulate the cache appropriately.

Eviction is the removal of entries from a cache to free space when the cache capacity is limited. Eviction can be performed manually or automatically according to a policy. Common eviction policies include:

LRU (least recently used): evict the items that have not been used recently
LFU (least frequently used): evict the items that are used infrequently
FIFO (first in first out): evict the oldest entries by insertion order

You can also implement a custom eviction algorithm suited to your workload.

Invalidation occurs when the origin data changes and the corresponding cache entries become obsolete. An invalidated cache entry no longer represents the up-to-date origin data; such entries should be marked invalid or removed so future requests will fetch fresh data. If the origin data change frequency is known and predictable, TTL-based expiration may suffice and explicit invalidation might not be required.

Invalidation methods are broadly classified into two types:

Manual (explicit) invalidation: the actor that updates origin data also directly invalidates cache entries
Automatic (event-driven) invalidation: change events are detected and cache entries are invalidated by a separate process

When cache entries are invalidated, subsequent requests will miss in cache and trigger cache population for fresh values.

Difference between eviction and invalidation:

Eviction is about managing limited cache storage (space) and removing items according to a policy to improve hit rate while controlling space usage.
Invalidation is about preventing stale data from being served when origin data changes; it ensures data consistency.

A refresh is updating an existing cached entry to the latest value (repopulation or recomputation). When an entry is invalidated, the server may refresh it by fetching or computing the new value and storing it in cache.

Cache invalidation patterns

Invalidation mechanisms vary in complexity depending on the requirements and the caching layer design.

Invalidation itself does not guarantee an immediate repopulation of cache with fresh data — it may simply mark entries invalid or remove them, and repopulation can occur later.

Two main invalidation approaches were described above:

Manual (explicit) invalidation: the origin updater explicitly invalidates the cache entry when origin data changes
Automatic (event-driven) invalidation: a separate cache management service detects origin changes and invalidates cache entries

Examples of automatic invalidation include:

Polling: the cache server periodically queries the database to detect changes and updates the cache when changes are found
Event-driven: origin data change events are published to a message queue; cache servers subscribe to those messages and update or invalidate cached entries

In distributed systems, invalidation becomes more complex. Consider a system with:

A caching layer consisting of a replicated cache cluster
An application layer where a load balancer distributes requests across multiple application instances; each instance may also keep local caches

A typical flow is:

A single application instance updates origin data and publishes an event to a message queue indicating the related cache entries should be invalidated.
The message queue propagates the event to all subscribers immediately. All application instances or a dedicated cache management service receive the message.
A cache management application (or the subscribers) instructs the primary cache node to delete the entry rather than the origin updater deleting it directly.
Replica invalidation: the cache cluster detects the primary deletion and synchronizes the invalidation to replica nodes.
Local caches in all application instances invalidate their local entries upon receiving the event.
Fresh population: when an application receives a request for the invalidated key, the first request triggers a cache miss; that request fetches origin data and repopulates the cache store.
To prevent thundering herds during repopulation, a distributed lock mechanism can be used: while one application populates the cache, others wait or follow a controlled policy, which prevents multiple applications from simultaneously hitting the origin datastore.

Distributed caching

Distributed caching stores and shares cache entries across multiple physically separated nodes. It helps distribute load when lookup frequency or data size is large, and can place data closer to clients to reduce latency.

Because multiple servers provide data, distributed caching can offer higher availability, scalability, and performance compared to a single cache node. However, it must ensure data consistency so clients receive coherent responses even when they hit different servers.

One option is a distributed in-memory cache, which keeps data in application process memory across nodes; this can be appropriate when memory usage per item is acceptable.

Cache stampede

When a server checks a local cache and finds no entry, it will fetch or compute the origin value and then store it in local cache. If many requests arrive for the same missing or expired key concurrently, multiple replicas or threads may attempt to populate the same cache entry simultaneously. This can overload the origin datastore or the cache store. This phenomenon is called cache stampede or cache miss storm.

Stampede scenario summary:

The key is absent from cache or has expired.
A large number of requests arrive at once for that key.
All requests miss in cache.
Multiple servers/threads attempt to read from the origin datastore concurrently to populate the cache.
The origin datastore receives significantly increased load, which may cause higher latency or failures.
The returned data is written back to the cache store by each writer, causing heavy write traffic.

To avoid redundant concurrent population work, several techniques exist:

Synchronize population so only a single thread or process performs the cache population (locking).
Probabilistically reduce the number of population attempts (probabilistic early expiration/recomputation).
Use asynchronous population instead of synchronous blocking population.

The first method uses synchronization primitives so a single thread performs the population work. When a cache miss occurs, a thread attempts to acquire a lock for that cache key. The lock holder performs the population; other threads either wait for the lock to be released or follow a fallback policy. Options for waiting threads include:

Block until the lock is released. When the lock is released, waiting threads read the refreshed cache and return consistent results. Blocking reduces throughput while the lock is held, but ensures consistency.
Return an empty or failure response instead of waiting. This preserves throughput but reduces availability and consistency.
Return stale data (previous cached value) while the new value is being generated. This preserves throughput but sacrifices consistency.

The second approach uses probabilistic early expiration or probabilistic recomputation: when the TTL of an item is approaching, a requesting thread can decide — using randomness influenced by remaining TTL — whether to recompute the value early. Items closer to expiry have a higher probability of being recomputed. This spreads recomputation over time and reduces simultaneous recomputation risk, at the cost of occasionally serving slightly stale data.

Caching layer performance and redundancy

To improve performance and availability, caching layers are often multiplexed: for example, a system may combine local and global caching strategies.

In key-value caches two important problems can arise which affect performance and stability:

Big key: a single key whose associated value is very large (e.g., a large binary blob or a huge collection). The threshold for a “big key” depends on system resources and latency requirements.
Hot key: a single key that receives disproportionately high request volume compared to other keys (e.g., a trending item or a popular feed). The definition of a hot key depends on system load and capacity.

Big key issues occur when a single key stores too much data (large strings or large collections). On single-threaded cache servers (such as Redis in its default configuration), operations on a big key can block other operations and severely degrade performance. In cluster deployments the impact is localized to the node holding the big key but can still create resource imbalance and bottlenecks.

Hot keys increase resource utilization because a high rate of operations on a single key can monopolize CPU or I/O. On single-threaded stores this can block other requests; on multi-threaded stores concurrency control (for writes) may introduce contention. Hot keys can also cause network congestion when traffic concentrates on specific cluster nodes.

Nginx cache cluster

Binary (byte-array) data

When caching images or video content you can either cache the raw binary byte array or cache the resource URL. Caching the byte array returns the binary data directly to the client, which then interprets it according to its media type.

When you cache a resource URL, the client receives the URL and makes a subsequent request to fetch the binary data. It is common to cache transformed content rather than the original raw bytes — for example, resized images, compressed or transcoded media — and store these transformed artifacts in memory or on disk.

If a proxy server serves images or videos on behalf of the origin (for example to handle CORS), the proxy may return byte arrays directly to clients. Caching binary responses consumes memory/disk resources, so it is important to apply appropriate transformations or compression to optimize resource usage.

References

Twitter Facebook LinkedIn

Bldev

[Architecture/Software] Caching

Local caching vs global caching

Caching lifecycle and strategies

Cache invalidation patterns

Distributed caching

Cache stampede

Caching layer performance and redundancy

Nginx cache cluster

Binary (byte-array) data

References

Comments

You May Also Enjoy

[Kotlin/Spring] Transaction Management

[코틀린/스프링] 트랜잭션 관리

[아키텍처/소프트웨어] 캐싱

[Data Structure/Algorithm] String

Bldev

Local caching vs global caching

Caching lifecycle and strategies

Cache invalidation patterns

Distributed caching

Cache stampede

Caching layer performance and redundancy

Key-related issues

Nginx cache cluster

Binary (byte-array) data

References

Comments

You May Also Enjoy

[Kotlin/Spring] Transaction Management

[코틀린/스프링] 트랜잭션 관리

[아키텍처/소프트웨어] 캐싱

[Data Structure/Algorithm] String