· --

4. Buffer Pool Manager

databasestoragememory

TL;DR — A disk-oriented DBMS stores data on disk but must operate on it in memory. The buffer pool manager handles this by caching pages in fixed-size frames, tracking metadata like dirty flags and pin counters, and choosing which pages to evict using policies like LRU, CLOCK, LRU-K, or ARC. The OS’s own page cache (mmap) can’t do this well enough — the DBMS needs full control over flushing order, prefetching, and scheduling. The OS is not your friend.

Disk-Oriented Architecture

In the previous post, we tackled how the DBMS represents a database in files on disk. Now we turn to the next problem: how the DBMS manages its memory and moves data back and forth from disk.

We are focusing on disk-oriented database management systems — systems whose primary storage lives on persistent media like HDDs or SSDs. In the Von Neumann architecture, data must be in memory before the CPU can operate on it. This creates a fundamental tension: databases are often larger than available memory, yet every query needs data in RAM to execute. Any DBMS must be able to efficiently move data back and forth between disk and memory if it wants to operate on large amounts of data.

That mechanism is the buffer pool manager. At a high level, it moves physical pages of data back and forth between buffers in main memory and persistent storage. It also behaves as a cache — keeping frequently used pages in memory for faster access and evicting cold pages back to storage.

The key illusion is this: from the execution engine’s perspective, the entire database appears to reside in memory. The engine just requests a page by its ID, gets back a pointer to a memory location, and never worries about whether the page was already cached or had to be fetched from disk. The DBMS only requires valid pointers to memory locations to perform its operations.

Disk-oriented DBMS architecture
The execution engine requests Page #2 from the buffer pool. If not cached, the buffer pool fetches it from the database file on disk and returns a pointer.

There are two dimensions the DBMS optimizes for:

  • Spatial Controlwhere pages are physically located on disk. The goal is to keep pages that are used together as physically close as possible, enabling prefetching and sequential I/O.
  • Temporal Controlwhen pages are read into memory and when they are written back to disk. The goal is to minimize stalls from having to read data from disk.

This post focuses primarily on temporal control. Let’s start by understanding how the buffer pool itself is organized.

Buffer Pool Organization

The buffer pool is an in-memory cache organized as an array of fixed-size frames. When the DBMS requests a page, the buffer pool manager first checks whether it is already stored in a frame of memory. If it is not found, the page is read / copied from disk into a free frame. The buffer pool operates as a write-back cache: dirty pages are buffered in memory and not written to disk immediately on mutation. This contrasts with a write-through cache, where every change propagates to disk right away.

Buffer pool organization with frames and on-disk file
The buffer pool is an array of fixed-size frames. Pages from the on-disk file are copied into frames as needed. Dirty pages stay in memory until explicitly flushed.

The DBMS needs this buffer pool memory for many different things — just as most programs need memory for different types of data structures:

  • Tuple Storage and Indexes
  • Sorting and Join Buffers
  • Query and Dictionary Caches
  • Maintenance and Log Buffers

Depending on the implementation, some of these may not be backed by the buffer pool at all and can simply use malloc.

A page directory is maintained on disk — it maps page IDs to their physical locations in database files. All changes to the page directory must be recorded on disk to allow the DBMS to find pages on restart. It is often kept in memory too, to minimize latency for page accesses since it has to be read before accessing any physical page.

Now that we know how pages get into the buffer pool, the next question is: how does the buffer pool keep track of what’s already there?

Buffer Pool Metadata

The buffer pool must maintain certain metadata in order to be used efficiently and correctly. The page table is an in-memory hash table that keeps track of pages currently in memory. It maps page IDs to frame locations in the buffer pool. Since the order of pages in the buffer pool does not necessarily reflect the order on the disk, this extra indirection layer allows identification of page locations in the pool.

Buffer pool page table with dirty flags and pin counters
The page table maps page IDs to buffer pool frames and tracks per-page metadata: dirty flag, pin/reference counter, and access tracking info.

Each page table entry tracks additional metadata:

  • Dirty Flag — set by a thread whenever it modifies a page. This indicates to the storage manager that the page must be written back to disk before eviction.
  • Pin / Reference Counter — tracks how many threads are currently accessing that page (either reading or modifying it). A thread has to increment the counter before accessing the page. If a page’s pin count is greater than zero, the storage manager is not allowed to evict it. Pinning does not prevent other transactions from accessing the page concurrently. If the buffer pool runs out of non-pinned pages to evict and the buffer pool is full, an out-of-memory error will be thrown.
  • Access Tracking Information — tracks what transactions/who is accessing the page.

Page Table vs. Page Directory

These two terms are easy to confuse, so let’s clarify:

Page DirectoryPage Table
MapsPage IDs → locations in database filesPage IDs → frames in buffer pool
StorageMust be on disk (persisted across restarts)In-memory only
PurposeFind pages in the physical database filesFind cached pages in memory

Locks vs. Latches

Before discussing concurrency in the buffer pool, we need to make a distinction between two protection primitives. This distinction matters because the page table is a shared data structure accessed by many threads.

  • Locks — higher-level, logical primitives that protect the contents of a database (tuples, tables, databases) from other transactions. Held for the duration of a transaction. Database systems can expose to the user which locks are being held as queries are run. Need to be able to roll back changes.
  • Latches — low-level primitives that protect the DBMS’s internal data structures (hash tables, regions of memory) from other threads. Held only for the duration of a single operation. This is often implemented with simple language primitives like mutexes and/or conditional variables. Do not need to be able to roll back changes.

The page table, for example, is protected by latches — not locks.

Why Not Use the OS?

At this point you might wonder: why build all of this custom machinery? You could use memory-mapped files (mmap) to store the contents of a database file into the address space of a program and let the OS handle paging transparently. The OS is responsible for moving file pages in and out of memory, so the DBMS doesn’t need to worry about it. What if the DBMS allows multiple threads to access mmap files to hide page fault stalls?

Problems with mmap for database storage
Using mmap lets the OS manage page faults transparently, but introduces four critical problems: uncontrolled dirty page flushing, I/O stalls, SIGBUS errors, and TLB shootdowns.

The problems with mmap are severe:

  1. Transaction Safety — the OS can flush dirty pages at any time. The DBMS has no way to prevent the OS from writing a dirty page to disk before the transaction commits, violating write-ahead logging.
  2. I/O Stalls — the DBMS doesn’t know which pages are in memory. The OS will stall a thread on a page fault with no warning.
  3. Error Handling — difficult to validate pages. Any access can cause a SIGBUS signal that the DBMS must handle.
  4. Performance Issues — OS data structure contention and TLB shootdowns degrade performance.

There are some solutions to some of these problems (madvise, mlock, msync), but using these syscalls to get the OS to behave correctly is just as onerous as managing memory yourself.

The DBMS almost always wants to control things itself and can do a better job than the OS at:

  • Flushing dirty pages to disk in the correct order
  • Specialized prefetching
  • Better buffer replacement policies
  • Thread/process scheduling

The OS is not your friend.

Now that we’ve established why the DBMS needs its own buffer pool, the next question is critical: when the buffer pool is full and we need to bring in a new page, which page do we evict?

Buffer Replacement Policies

When the DBMS needs to free up a frame to make room for a new page, it must decide which page to evict from the buffer pool. This decision is governed by a replacement policy. The implementation goals are correctness, accuracy, speed, and low metadata overhead. Remember: a pinned page can never be evicted.

Least Recently Used (LRU)

The simplest approach, dating back to 1965: maintain a timestamp of when each page was last accessed. When eviction is needed, pick the page with the oldest timestamp. Pages can be kept in a sorted queue to reduce the search time on eviction. However, keeping the data structure sorted and storing a large timestamp has a prohibitive overhead.

LRU replacement policy step-by-step
LRU maintains a sorted list of pages by last-access timestamp. On eviction, the page at the oldest end of the list is removed.

CLOCK

CLOCK (1969) is an approximation of LRU that avoids the cost of maintaining per-page timestamps. Instead, each page gets a single reference bit. When a page is accessed, its bit is set to 1.

Pages are organized in a circular buffer with a “clock hand”. When an eviction is needed, the hand sweeps around:

  • If the current page’s bit is 1 → set it to 0 and move on.
  • If the current page’s bit is 0 → evict it.

The clock hand remembers its position between evictions. This gives us an approximation of “least recently used” without the overhead of sorting timestamps — a page that was recently accessed will have its bit set to 1, and the clock hand will give it a second chance before evicting it.

CLOCK replacement policy step-by-step
The CLOCK algorithm sweeps a circular buffer. Page 1 is accessed (ref=1). On eviction, the hand resets page 1's bit to 0, skips pages with ref=0, and evicts page 2 — the first page found with ref=0.

The Problem: Sequential Flooding

LRU and CLOCK seem reasonable, but they have a critical weakness. Both are susceptible to sequential flooding, where the buffer pool’s contents are polluted due to a sequential scan. Since sequential scans read many pages quickly, the buffer pool fills up with pages that are read once and then never again. Meanwhile, pages from other queries — pages that might actually be needed — get evicted because they have “older” timestamps.

Sequential flooding polluting the buffer pool
Q1 caches page0 for a point query. Q2 then performs a sequential scan, evicting page0 to make room for pages it reads once and discards. When Q3 needs page0 again, it must re-fetch from disk.

In OLAP workloads, the most recently used page is often the best page to evict — exactly the opposite of what LRU assumes.

LRU also does not account for the frequency of access. For example, if we have a page that is frequently accessed over time, we do not want to evict it just because it wasn’t accessed recently. Least-Frequently Used (LFU) addresses this by maintaining access counts and evicting pages with the lowest count. But LFU introduces its own problems: logarithmic implementation complexity relative to cache size, and it ignores time — accumulating stale pages with high frequency counts that may no longer be relevant.

So we need something better. We need a policy that considers both recency and frequency of access. That’s where LRU-K comes in.

LRU-K

LRU-K (1993) tracks the history of the last K accesses to each page as timestamps, and computes the interval between subsequent accesses. This history is used to predict the next time a page is going to be accessed. The page with the largest backward K-distance (time since the Kth most recent access) is evicted first. Pages with fewer than K accesses have infinite backward distance and are evicted preferentially.

LRU-K replacement policy with backward distance computation
LRU-2 tracks the last 2 access timestamps per page. At eviction time, the backward K-distance is computed: pages with only one access have ∞ distance and are evicted first. Ties are broken by oldest access time.

LRU-K also maintains an in-memory ghost cache of recently evicted pages to prevent thrashing — if a recently evicted page is requested again, the ghost cache preserves its history so it doesn’t start from scratch. However, this again has a higher storage overhead.

MySQL’s Approximate LRU-K

In practice, full LRU-K can be expensive. MySQL uses a practical approximation of LRU-2: a single linked list with two entry points — an old sublist and a young sublist. New pages are always inserted at the head of the old list. If a page in the old list is accessed again, it is promoted to the head of the young list. Evictions always come from the tail of the old list.

This is elegant because it captures the key insight of LRU-2 — distinguishing between “accessed once” and “accessed more than once” — without maintaining full timestamp histories.

MySQL approximate LRU-K with old and young sublists
MySQL's buffer pool uses a split linked list. New pages enter the old sublist. Repeated access promotes them to the young sublist. Evictions always come from the old sublist's tail.

ARC: Adaptive Replacement Cache

LRU-K is better than plain LRU, but the ideal balance between recency and frequency depends on the workload — and workloads change over time. ARC (2003), developed by IBM Research, solves this by adapting dynamically. It supports both LRU and LFU by dynamically adjusting the sizes of two lists based on workload access patterns.

ARC maintains four lists and an adaptive parameter:

  • T1 (Recent List) — pages accessed once recently
  • T2 (Frequent List) — pages accessed at least twice
  • B1 (Recent Ghost List) — history of pages recently evicted from T1 (i.e., recency misses)
  • B2 (Frequent Ghost List) — history of pages recently evicted from T2 (i.e., frequency misses)
  • Target Size Parameter (p) — adaptively adjusts how much to favor recency (T1) vs. frequency (T2)
ARC adaptive replacement cache with T1, T2, B1, B2 lists
ARC maintains four lists and adapts the split between recency (T1) and frequency (T2) based on ghost list hits. A hit in B1 increases p (favor recency); a hit in B2 decreases p (favor frequency).

A page can only exist in one list at a time. The DBMS checks both lists upon lookup. The lookup protocol works as follows:

  • Page found in T1 → promote to T2 (it’s now “frequent”)
  • Page found in T2 → stays in T2, reinforcing its frequent status
  • Cache miss, found in B1 → increase p (favor recency), move page into T2 (since it’s now accessed again)
  • Cache miss, found in B2 → decrease p (favor frequency), move page into T2
  • Cache miss, not in any list → evict from appropriate list, insert new page into T1

The ghost lists are the key innovation: they let ARC learn from its mistakes. If pages evicted from the recency list keep coming back (B1 hits), ARC grows T1. If pages evicted from the frequency list keep coming back (B2 hits), ARC grows T2. The system continuously adapts without any manual tuning.

ARC is implemented in IBM DB2, PostgreSQL (rewritten to avoid IBM’s patent), and ZFS.

Better Policies: Localization and Priority Hints

Beyond choosing a good eviction algorithm, two additional optimizations help reduce buffer pool pollution:

Localization — the DBMS chooses which pages to evict on a per-query basis. This minimizes the pollution of the buffer pool from each query. For example, Postgres assigns a limited number of buffer pool pages to a sequential scan query and uses them as a circular ring buffer.

Priority Hints — the DBMS knows the context of each page during query execution and can hint to the buffer pool whether a page is important or not. For instance, the root page of a B-tree index is accessed by nearly every query and should have high eviction priority protection, while leaf pages traversed during an insert are unlikely to be needed again soon.

Dirty Pages

Now let’s consider what happens when we actually evict a page. Not all evictions are equal — it depends on whether the page has been modified.

Fast path vs slow path for dirty page eviction
Evicting a clean page is instant (fast path — just drop it). Evicting a dirty page requires writing it back to disk first (slow path).
  • Fast Path — the page is not dirty, so the DBMS can simply “drop” it from the buffer pool.
  • Slow Path — the page is dirty, so the DBMS must write the changes back to disk before eviction to ensure data synchronization between memory and disk.

There is a trade-off between fast evictions versus dirty writing pages that will not be read again in the future. To avoid always hitting the slow path at eviction time, the DBMS uses background writing (also called page cleaning or buffer flushing). A background thread periodically walks through the page table and preemptively writes dirty pages to disk. When a dirty page is safely flushed, the DBMS can either evict the page immediately or just reset its dirty flag so it can be dropped quickly later.

One caveat: the system must be careful not to write dirty pages before their log records are written — this is the write-ahead logging constraint covered in a later post.

Disk I/O Scheduling

The OS and hardware try to maximize disk bandwidth by reordering and batching I/O requests, but they do not know which requests are more important than others. Many DBMSs recommend switching Linux to the deadline or noop (FIFO) I/O scheduler — for example, Oracle, Vertica, and MySQL all recommend this.

The DBMS maintains its own internal queue(s) to track page read/write requests from the entire system, computing priorities based on several factors:

  • Sequential vs. Random I/O
  • Critical Path Task vs. Background Task
  • Table vs. Index vs. Log vs. Ephemeral Data
  • Transaction Information
  • User-based SLAs

The OS doesn’t know these things and is going to get in the way of our beautiful DBMS.

OS Page Cache and O_DIRECT

Most disk operations go through the OS API. Unless explicitly told otherwise, the OS maintains its own filesystem cache (aka page cache, buffer cache). This creates a problem: the same page exists in both the DBMS’s buffer pool and the OS’s page cache — redundant copies with potentially different eviction policies.

O_DIRECT bypasses the OS page cache
Without O_DIRECT, pages are cached in both the DBMS buffer pool and the OS page cache. Most DBMSs use O_DIRECT to bypass the OS cache entirely.

Most DBMSs use Direct I/O via the O_DIRECT flag to bypass the OS’s cache entirely. This avoids redundant copies of pages, conflicting eviction policies, and loss of control over file I/O. PostgreSQL is a notable exception — it uses the OS’s page cache.

A subtle gotcha: fsync by default has silent errors. If the DBMS calls fsync and it fails with EIO, Linux marks the dirty pages as clean. If the DBMS calls fsync again, Linux reports success — even though the data was never actually written to disk. Since the DBMS thought the OS was its friend, it assumed the write was successful…

Again, the OS is not your friend.

Buffer Pool Optimizations

We’ve built up the core buffer pool: frames, page table, metadata tracking, replacement policies, dirty page management, and disk I/O scheduling. But there are several more optimizations that real systems use to squeeze out better performance.

Multiple Buffer Pools

A DBMS doesn’t always have a single buffer pool for the entire system. It can partition memory across multiple pools — multiple buffer pool instances, per-database buffer pools, or per-page-type buffer pools. Partitioning memory across multiple pools helps reduce latch contention on the page table and improves locality by avoiding contention on LRU tracking metadata.

Two approaches for routing requests to the right pool:

  • Object ID — embed an object identifier in record IDs and maintain a mapping from objects to specific buffer pools.
  • Hashing — hash the page ID to select which buffer pool to access.

Pre-Fetching

So far, the buffer pool has been reactive — it fetches a page when the execution engine requests it. But the DBMS can also prefetch pages based on the query plan rather than waiting for each page fault.

For a sequential scan, when the query starts processing page 0, the buffer pool can proactively fetch pages 1, 2, and 3 in the background. By the time the query finishes page 0 and moves to page 1, it’s already in memory — no disk I/O wait.

For an index scan, the DBMS can do something even smarter: examine the index structure and prefetch the specific leaf pages it knows it will need, even if they aren’t contiguous on disk.

Some DBMS even prefetch to fill in empty frames upon start-up.

Sequential and index scan prefetching
During a sequential scan, the buffer pool prefetches upcoming pages before they are requested. During an index scan, the DBMS uses the index structure to predict which pages to prefetch.

Scan Sharing

Pre-fetching optimizes a single query’s I/O. But what about when multiple queries need to scan the same table? Without scan sharing, each query starts its own scan from page 0, duplicating all the I/O.

Scan sharing (also called synchronized scans) allows multiple queries to attach to a single cursor scanning a table. If Q1 is already scanning table A when Q2 arrives wanting to scan the same table, Q2 can attach to Q1’s cursor and piggyback on pages Q1 is already reading. When Q1 finishes, Q2 continues from where Q1 left off and then wraps around to pick up the pages it missed at the beginning.

Scan sharing between two queries
Q2 arrives while Q1 is scanning pages 0–5. Instead of starting from page 0, Q2 attaches to Q1's cursor at page 3. After Q1 finishes, Q2 wraps back to read pages 0–2.

This is different from result caching — each query still computes its own result, they just share the I/O. Fully supported in DB2, MSSQL, Teradata, and Postgres. Oracle only supports cursor sharing for identical queries.

Buffer Pool Bypass

There’s one more optimization for the case where scan sharing doesn’t help: what if a query is doing a sequential scan over a huge table and we know those pages won’t be needed by anyone else?

For sequential scans that read large contiguous ranges of pages, the DBMS can skip the buffer pool entirely. The fetched pages are stored in memory local to the running query and discarded when done. This avoids polluting the buffer pool with pages that won’t be needed by other queries. Called “Light Scans” in Informix. This can also be used for temporary data like sorting and joins.

Conclusion

The DBMS can almost always manage memory better than the OS. By leveraging semantics from the query plan, it makes better decisions about evictions, allocations, and prefetching. The key takeaway: don’t let the OS manage your database’s memory — it doesn’t understand your workload.