Virtual Memory Allocator

--------------------------------------------------------------------------------

Table of Contents

Overview 
Arenas 
Spans 
Imported Memory (aka vmem sourcing) 
Constrained Allocations 
The vmem Quantum 
Quantum Caching 
Fragmentation Considerations 
Relationship to Kernel Memory Allocator 
Segments 
Segment Lists and Markers 
Allocation Hashing 
Freelist Management 
vmem Locking 
vmem Population 
Auditing 
vmem Interfaces 
rmalloc (9F) Compatibility 
Overview
We divide the kernel address space into a number of logically distinct pieces, or *arenas*: text, data, heap, stack, and so on. Within these arenas we often subdivide further; for example, we use heap addresses not only for the kernel heap (kmem_alloc() space), but also for DVMA, bp_mapin(), /dev/kmem, and even some device mappings like the TOD chip. The kernel address space, therefore, is most accurately described as a tree of arenas in which each node of the tree *imports* some subset of its parent. The virtual memory allocator manages these arenas and supports their natural hierarchical structure. 

Back to Top

Arenas
An arena is nothing more than a set of integers. These integers most commonly represent virtual addresses, but in fact they can represent anything at all. For example, we could use an arena containing the integers minpid through maxpid to allocate process IDs. vmem_create() and vmem_destroy() create and destroy vmem arenas (data type vmem_t). 

Back to Top

Spans
We represent the integers in an arena as a collection of *spans*, or contiguous ranges of integers. For example, the kernel heap consists of just one span: [kernelheap, ekernelheap). [Note: the mathematical convention for describing numeric intervals is to use brackets when the end of the interval is included, parentheses when it is excluded. Therefore, the integers 4, 5, 6, 7, 8 could be represented in any of the following four ways: [4, 8], [4, 9), (3, 8], (3, 9). For our purposes, the notation [4, 9) is the most natural because we can readily see both the starting address (4) and the size (9 - 4).] 

Spans can be added to an arena in two ways: explicitly, by vmem_add(), or implicitly, by import (described below). 

Back to Top

Imported Memory (AKA vmem sourcing)
As mentioned in the overview, some arenas are logical subsets of other arenas. For example, kmem_va_arena (a virtual address cache that satisfies most kmem_slab_create() requests) is just a subset of heap_arena (the kernel heap) that provides caching for the most common slab sizes. When kmem_VA_arena runs out of virtual memory, it *imports* more from the heap; we say that heap_arena is the *vmem source* for kmem_VA_arena. vmem_create() allows you to specify any existing vmem arena as the source for your new arena. Topologically, since every arena is a child of at most one source, the set of all arenas forms a collection of trees. 

Back to Top

Constrained Allocations
Some vmem clients are quite picky about the kind of address they want. For example, the DVMA code may need an address that is at a particular phase with respect to some alignment (to get good cache coloring), or that lies within certain limits (the addressable range of a device), or that doesn't cross some boundary (a DMA counter restriction) - or all of the above. The vmem_xalloc() interface, described below, allows the client to specify any or all of these constraints.

Back to Top

The vmem Quantum
Every arena has a notion of 'quantum', specified at vmem_create() time, which defines the minimum unit of memory the arena deals with. Most commonly the quantum is either 1 or PAGESIZE, but any power of 2 is legal. All vmem allocations are guaranteed to be quantum-aligned. 

Back to Top

Quantum Caching
A vmem arena may be so 'hot' (frequently used) that the scalability of vmem allocation is a significant concern. We address this by allowing the most common allocation sizes to be 'fronted' by the kernel memory allocator, which provides low-latency per-cpu caching. To get kmem caching for all allocations up to size S, simply pass S as the qcache_max parameter to vmem_create(). The vmem allocator will create kmem caches of size 1 * quantum, 2 * quantum, ... up to S, so that any vmem_alloc() of size <= S will be handled by a kmem cache. 

At present, the implementation limits S to (VMEM_NQCACHE_MAX * quantum).

Back to Top

Fragmentation Considerations
Any memory allocator is vulnerable to fragmentation if presented with a sufficiently hostile workload. One way to prevent fragmentation is to make all allocation requests the same size. This isn't practical in general, but if you use quantum caching, it happens as a side-effect because all quantum caches for a given arena have the same slab size - specifically, it will be the next power of 2 at or above 3 * qcache_max. For example, if you cache up to 5 * quantum, then all slabs will be of size 16 * quantum (5 * 3 = 15, round to 16). The kernel memory allocator will divide these slabs differently for the different cache sizes, but the load presented to vmem_xalloc() will always be the same: 16 * quantum. Quantum caching has proven very effective against fragmentation of segkp, which typically issues 2-page through 5-page requests (for thread stacks).

Back to Top

Relationship to Kernel Memory Allocator
Every kmem cache has a vmem arena as its memory source. The kernel memory allocator uses vmem_alloc() and vmem_free() to grow and shrink its caches. Thus vmem arenas replace the old notion of kmem backends.

Back to Top

Segments
An arena's spans define the total address range under management. Individual spans are subdivided into *segments*, each of which is either allocated or free. A segment, like a span, is a contiguous range of integers. Each allocated segment [addr, addr + size) represents exactly one vmem_alloc(size) that returned addr. Free segments represent the space between allocated segments. If two free segments are adjacent, we coalesce them into one larger segment; that is, if segments [a, b) and [b, c) are both free, we merge them into a single segment [a, c). The segments within a span are linked together in increasing-address order so we can easily determine whether coalescing is possible. 

Segments never cross span boundaries. When all segments within an imported span become free, we return the span to its source. 

Back to Top

Segment Lists and Markers
The segment structure (vmem_seg_t) contains two doubly-linked lists.

The arena list (vs_anext/Vs_aprev) links all segments in the arena. In addition to the allocated and free segments, the arena contains special marker segments at span boundaries. Span markers simplify coalescing and importing logic by making it easy to tell both when we're at a span boundary (so we don't coalesce across it), and when a span is completely free (its neighbors will both be span markers).

The next-of-kin list (Vs_knext/Vs_kprev) links segments of the same type: 

For allocated segments, Vs_knext is the hash chain linkage. 
For free segments, Vs_knext is the freelist linkage. 
For span marker segments, Vs_knext is the next span marker. 
Back to Top

Allocation Hashing
We maintain a hash table of all allocated segments, hashed by address. This allows vmem_free() to discover the target segment in constant time. vmem_update() periodically resizes hash tables to keep hash chains short.

Back to Top

Freelist Management
We maintain power-of-2 freelists for free segments; for example, free segments of size >= 2^n reside in vmp->vm_freelist[n]. To ensure constant-time allocation, vmem_xalloc() looks not in the first freelist that *might* satisfy the allocation, but in the first freelist that *definitely* satisfies the allocation (unless VM_BESTFIT is specified, or all larger freelists are empty). For example, a 1000-byte allocation will be satisfied not from the 512..1023-byte freelist, whose members *might* contains a 1000-byte segment, but from a 1024-byte or larger freelist, the first member of which will *definitely* satisfy the allocation. This ensures that vmem_xalloc() works in constant time. 

We maintain a bit map to determine quickly which freelists are non-empty. vmp->vm_freemap & (1 << n) is non-zero iff vmp->vm_freelist[n] is non-empty. 

The different freelists are linked together into one large freelist, with the freelist heads serving as markers. Freelist markers simplify the maintenance of vm_freemap by making it easy to tell when we're taking the last member of a freelist (both of its neighbors will be markers). 

Back to Top

vmem Locking
For simplicity, all arena state is protected by a per-arena lock. If you need scalability, use quantum caching.

Back to Top

vmem Population
Any internal vmem routine that might need to allocate new segment structures must prepare in advance by calling vmem_populate(nsegneeded). The idea is to preallocate enough vmem_seg_t's that we don't have to drop the lock protecting the arena, which makes the code much simpler.

Back to Top

Auditing
If KMF_AUDIT is set in kmem_flags, we audit vmem allocations as well. Since virtual addresses cannot be scribbled on, there is no equivalent in vmem to redzone checking, deadbeef, or other kmem debugging features. Moreover, we do not audit frees because segment coalescing destroys the association between an address and its segment structure. Auditing is thus intended primarily to keep track of who's consuming the arena. Debugging support could certainly be extended in the future if it proves necessary, but we do so much live checking via the allocation hash table that even non-DEBUG systems get quite a bit of sanity checking already. 

Back to Top

vmem Interfaces
vmp = vmem_create(name, base, size, quantum, afunc, ffunc, source, qcache_max, flag);

Create a vmem arena with the requested parameters:  
name arena's name 
base, size initial span [base, base + size) 
quantum minimum alignment for all transactions on the arena 
afunc allocation routine to import memory 
ffunc free routine to return imported memory 
source source of imported memory 
qcache_max maximum allocation size to be 'fronted' by kmem  
caches   
flag VM_SLEEP or VM_NOSLEEP 

vmem_destroy(vmp); 

Destroy arena vmp. 

addr = vmem_alloc(vmp, size, flag); Allocate size bytes. addr = vmem_xalloc(vmp, size, align, phase, nocross, minaddr, maxaddr, flag); 

Allocate size bytes at offset phase from an align boundary such that [addr, addr + size) is a subset of [minaddr, maxaddr) that does not straddle a nocross-aligned boundary. 

vmem_free(vmp, vaddr, size); 

Free size bytes at vaddr. 

vmem_xfree(vmp, vaddr, size); 

Free size bytes at vaddr, where vaddr is a constrained allocation.

void *vmem_add(vmp, vaddr, size, flag); 

Add the span [vaddr, vaddr + size) to arena vmp. 

Returns vaddr on success, NULL on failure (VM_NOSLEEP case only).

int vmem_contains(vmp, vaddr, size); 

Determine whether vmp contains the range [vaddr, vaddr + size). 

void vmem_walk(vmem_t *vmp, int typemask, 

void (*func)(void *, void *, size_t), void *arg); 

Walk the vmp arena, applying func to each segment matching typemask. 

Note that func is called with vmp->vm_lock held. 

This guarantees a consistent snapshot, at the price of requiring that func not block. 

size_t vmem_size(vmem_t *vmp, int typemask); 

Return the total amount of memory whose type matches typemask. 

Thus:

- typemask VMEM_ALLOC yields total memory allocated (in use).

- typemask VMEM_FREE yields total memory free (available). 

- typemask (VMEM_ALLOC | VMEM_FREE) yields total arena size. 

Back to Top

rmalloc(9F) Compatibility
rmalloc(9F) and friends still exist, but are now just simple wrappers around more general vmem services.