【Linux内存管理】SLUB分配算法（6）

2024 年 10 月
一	二	三	四	五	六	日
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

前面已经分析了slub算法的初始化、缓存区的创建、对象的分配、对象的回收，最后分析一下slub分配算法的slab销毁具体实现。

Slab销毁的入口函数为kmem_cache_destroy()，其实现：

【file:/mm/slab_common.c】
void kmem_cache_destroy(struct kmem_cache *s)
{
    /* Destroy all the children caches if we aren't a memcg cache */
    kmem_cache_destroy_memcg_children(s);

    get_online_cpus();
    mutex_lock(&slab_mutex);
    s->refcount--;
    if (!s->refcount) {
        list_del(&s->list);

        if (!__kmem_cache_shutdown(s)) {
            memcg_unregister_cache(s);
            mutex_unlock(&slab_mutex);
            if (s->flags & SLAB_DESTROY_BY_RCU)
                rcu_barrier();

            memcg_free_cache_params(s);
            kfree(s->name);
            kmem_cache_free(kmem_cache, s);
        } else {
            list_add(&s->list, &slab_caches);
            mutex_unlock(&slab_mutex);
            printk(KERN_ERR "kmem_cache_destroy %s: Slab cache still has objects\n",
                s->name);
            dump_stack();
        }
    } else {
        mutex_unlock(&slab_mutex);
    }
    put_online_cpus();
}

该函数中kmem_cache_destroy_memcg_children()删除memcg中相关联的子cache数据，而get_online_cpus()是对cpu_online_map的加锁，其与末尾的put_online_cpus()是配对使用的。接着的mutex_lock()用于获取slab_mutex互斥锁，该锁主要用于全局资源保护。然后对kmem_cache的引用计数refcount自减操作，如果自减后if (!s->refcount)为true，即引用计数为0，表示该缓冲区不存在slab别名挂靠的情况，那么其kmem_cache结构可以删除，否则表示有其他缓冲区别名挂靠，仍有依赖，那么将会解锁slab_mutex并put_online_cpus()释放cpu_online_map锁，然后退出。

if (!s->refcount)为true的分支中，先list_del()将该slab管理结构kmem_cache从slab_caches全局链表中摘除，然后__kmem_cache_shutdown()删除kmem_cache结构信息。如果__kmem_cache_shutdown()执行成功则将返回0，继而if (!__kmem_cache_shutdown(s))为true，将会通过memcg_unregister_cache()去注册memcg的cache，并且memcg_free_cache_params()释放创建时申请的memcg_params资源空间，而kfree()和kmem_cache_free()释放slub的名称空间以及slab空间。如果__kmem_cache_shutdown()执行失败，那么将会把slab重新挂回至slab_caches链表，同时记录日志信息。

由此slab销毁完毕。

kmem_cache_destroy()的核心函数是__kmem_cache_shutdown()，深入分析__kmem_cache_shutdown()的实现：

【file:/mm/slub.c】
int __kmem_cache_shutdown(struct kmem_cache *s)
{
    int rc = kmem_cache_close(s);

    if (!rc) {
        /*
         * We do the same lock strategy around sysfs_slab_add, see
         * __kmem_cache_create. Because this is pretty much the last
         * operation we do and the lock will be released shortly after
         * that in slab_common.c, we could just move sysfs_slab_remove
         * to a later point in common code. We should do that when we
         * have a common sysfs framework for all allocators.
         */
        mutex_unlock(&slab_mutex);
        sysfs_slab_remove(s);
        mutex_lock(&slab_mutex);
    }

    return rc;
}

该函数主要通过kmem_cache_close()删除slab的管理数据kmem_cache，如果执行成功，继而进入if分支对sysfs模块的slab做移除操作。

具体看一下kmem_cache_close()的实现：

【file:/mm/slub.c】
/*
 * Release all resources used by a slab cache.
 */
static inline int kmem_cache_close(struct kmem_cache *s)
{
    int node;

    flush_all(s);
    /* Attempt to free all objects */
    for_each_node_state(node, N_NORMAL_MEMORY) {
        struct kmem_cache_node *n = get_node(s, node);

        free_partial(s, n);
        if (n->nr_partial || slabs_node(s, node))
            return 1;
    }
    free_percpu(s->cpu_slab);
    free_kmem_cache_nodes(s);
    return 0;
}

该函数通过flush_all()释放本地CPU的缓存区，即kmem_cache_cpu管理的缓存区空间；然后通过for_each_node_state()遍历各节点，转而get_node()获取节点下的kmem_cache_node管理结构，然后将其半满队列中的缓存区进行释放free_partial()；最后将kmem_cache的每CPU缓存管理kmem_cache_cpu通过free_percpu()归还给系统，同时通过free_kmem_cache_nodes()释放各内存节点node的缓存管理结构kmem_cache_node占用的空间释放。

最后分析一下较为复杂的flush_all()的实现：

【file:/mm/slub.c】
static void flush_all(struct kmem_cache *s)
{
    on_each_cpu_cond(has_cpu_slab, flush_cpu_slab, s, 1, GFP_ATOMIC);
}

看似封装了on_each_cpu_cond()函数，实际上on_each_cpu_cond()并不执行任何与资源释放的操作，其主要是遍历各个CPU，然后执行作为入参传入的函数has_cpu_slab()，以判断各个处理器上的资源是否存在，如果存在，继而将会通过flush_cpu_slab()对该处理器上的资源进行释放处理。

照例，还是详细看一下on_each_cpu_cond()函数实现：

【file:/mm/slub.c】
/*
 * on_each_cpu_cond(): Call a function on each processor for which
 * the supplied function cond_func returns true, optionally waiting
 * for all the required CPUs to finish. This may include the local
 * processor.
 * @cond_func:	A callback function that is passed a cpu id and
 *		the the info parameter. The function is called
 *		with preemption disabled. The function should
 *		return a blooean value indicating whether to IPI
 *		the specified CPU.
 * @func:	The function to run on all applicable CPUs.
 *		This must be fast and non-blocking.
 * @info:	An arbitrary pointer to pass to both functions.
 * @wait:	If true, wait (atomically) until function has
 *		completed on other CPUs.
 * @gfp_flags:	GFP flags to use when allocating the cpumask
 *		used internally by the function.
 *
 * The function might sleep if the GFP flags indicates a non 
 * atomic allocation is allowed.
 *
 * Preemption is disabled to protect against CPUs going offline but not online.
 * CPUs going online during the call will not be seen or sent an IPI.
 *
 * You must not call this function with disabled interrupts or
 * from a hardware interrupt handler or from a bottom half handler.
 */
void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
            smp_call_func_t func, void *info, bool wait,
            gfp_t gfp_flags)
{
    cpumask_var_t cpus;
    int cpu, ret;

    might_sleep_if(gfp_flags & __GFP_WAIT);

    if (likely(zalloc_cpumask_var(&cpus, (gfp_flags|__GFP_NOWARN)))) {
        preempt_disable();
        for_each_online_cpu(cpu)
            if (cond_func(cpu, info))
                cpumask_set_cpu(cpu, cpus);
        on_each_cpu_mask(cpus, func, info, wait);
        preempt_enable();
        free_cpumask_var(cpus);
    } else {
        /*
         * No free cpumask, bother. No matter, we'll
         * just have to IPI them one by one.
         */
        preempt_disable();
        for_each_online_cpu(cpu)
            if (cond_func(cpu, info)) {
                ret = smp_call_function_single(cpu, func,
                                info, wait);
                WARN_ON_ONCE(!ret);
            }
        preempt_enable();
    }
}

该函数的入参cond_func是一个钩子函数，用于根据调用者传入的CPU信息参数来判断是否需要打断该CPU以执行入参func的操作；而入参info是作为cond_func和func处理函数的入参；至于入参wait则是一个bool类型，用以判断是否需要等待func在各CPU上执行完毕，如果为true将会等待；最后的gfp_flags入参是作为申请cpumask空间的标识。

了解完参数的意思，那么具体看一下其实现，首先might_sleep_if()判断是否需要休眠等待，继而通过zalloc_cpumask_var()申请cpumask的空间；申请到空间后，preempt_disable()禁止内核抢占后，将for_each_online_cpu()遍历各个CPU，根据cond_func()（即has_cpu_slab()）判断是否需要对该CPU进行打断处理，如果需要则cpumask_set_cpu()对该CPU进行标志；标志完后，根据前面的标志，通过on_each_cpu_mask()打断各个标志位对应的CPU去执行func()的操作（即flush_cpu_slab()）；完了将会恢复抢占，释放cpumask空间。至于zalloc_cpumask_var()申请不到空间，将会逐个处理器进行打断再进行处理，其最终功能和作用与申请到空间的情况都是一致的，具体实现就不分析了。

相应看一下作为on_each_cpu_cond()入参的钩子函数has_cpu_slab()的实现：

【file:/mm/slub.c】
static bool has_cpu_slab(int cpu, void *info)
{
    struct kmem_cache *s = info;
    struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);

    return c->page || c->partial;
}

可以看到该函数主要是用于判断本地CPU是否占有缓存区，如果有则返回true。也即意味着该CPU需要被打断去执行其本地的缓存区释放操作。

至于on_each_cpu_cond()另一钩子函数flush_cpu_slab()的实现：

【file:/mm/slub.c】
static void flush_cpu_slab(void *d)
{
    struct kmem_cache *s = d;

    __flush_cpu_slab(s, smp_processor_id());
}

该函数封装了__flush_cpu_slab()，实现为：

【file:/mm/slub.c】
/*
 * Flush cpu slab.
 *
 * Called from IPI handler with interrupts disabled.
 */
static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu)
{
    struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);

    if (likely(c)) {
        if (c->page)
            flush_slab(s, c);

        unfreeze_partials(s, c);
    }
}

函数实现很简单，主要用于将本地CPU的缓存区进行释放。其首先获取本地CPU的kmem_cache_cpu管理结构，如果本地CPU存在缓存区的占用，将会通过flush_slab()去释放本地缓存区，继而通过unfreeze_partials()将本地CPU半满缓存列表进行释放。

而flush_slab()具体实现：

【file:/mm/slub.c】
static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
{
    stat(s, CPUSLAB_FLUSH);
    deactivate_slab(s, c->page, c->freelist);

    c->tid = next_tid(c->tid);
    c->page = NULL;
    c->freelist = NULL;
}

其主要是通过deactivate_slab()去激活本地缓存区，也即是将缓存区进行释放操作。具体deactivate_slab()的实现：

【file:/mm/slub.c】
/*
 * Remove the cpu slab
 */
static void deactivate_slab(struct kmem_cache *s, struct page *page,
                void *freelist)
{
    enum slab_modes { M_NONE, M_PARTIAL, M_FULL, M_FREE };
    struct kmem_cache_node *n = get_node(s, page_to_nid(page));
    int lock = 0;
    enum slab_modes l = M_NONE, m = M_NONE;
    void *nextfree;
    int tail = DEACTIVATE_TO_HEAD;
    struct page new;
    struct page old;

    if (page->freelist) {
        stat(s, DEACTIVATE_REMOTE_FREES);
        tail = DEACTIVATE_TO_TAIL;
    }

    /*
     * Stage one: Free all available per cpu objects back
     * to the page freelist while it is still frozen. Leave the
     * last one.
     *
     * There is no need to take the list->lock because the page
     * is still frozen.
     */
    while (freelist && (nextfree = get_freepointer(s, freelist))) {
        void *prior;
        unsigned long counters;

        do {
            prior = page->freelist;
            counters = page->counters;
            set_freepointer(s, freelist, prior);
            new.counters = counters;
            new.inuse--;
            VM_BUG_ON(!new.frozen);

        } while (!__cmpxchg_double_slab(s, page,
            prior, counters,
            freelist, new.counters,
            "drain percpu freelist"));

        freelist = nextfree;
    }

    /*
     * Stage two: Ensure that the page is unfrozen while the
     * list presence reflects the actual number of objects
     * during unfreeze.
     *
     * We setup the list membership and then perform a cmpxchg
     * with the count. If there is a mismatch then the page
     * is not unfrozen but the page is on the wrong list.
     *
     * Then we restart the process which may have to remove
     * the page from the list that we just put it on again
     * because the number of objects in the slab may have
     * changed.
     */
redo:

    old.freelist = page->freelist;
    old.counters = page->counters;
    VM_BUG_ON(!old.frozen);

    /* Determine target state of the slab */
    new.counters = old.counters;
    if (freelist) {
        new.inuse--;
        set_freepointer(s, freelist, old.freelist);
        new.freelist = freelist;
    } else
        new.freelist = old.freelist;

    new.frozen = 0;

    if (!new.inuse && n->nr_partial > s->min_partial)
        m = M_FREE;
    else if (new.freelist) {
        m = M_PARTIAL;
        if (!lock) {
            lock = 1;
            /*
             * Taking the spinlock removes the possiblity
             * that acquire_slab() will see a slab page that
             * is frozen
             */
            spin_lock(&n->list_lock);
        }
    } else {
        m = M_FULL;
        if (kmem_cache_debug(s) && !lock) {
            lock = 1;
            /*
             * This also ensures that the scanning of full
             * slabs from diagnostic functions will not see
             * any frozen slabs.
             */
            spin_lock(&n->list_lock);
        }
    }

    if (l != m) {

        if (l == M_PARTIAL)

            remove_partial(n, page);

        else if (l == M_FULL)

            remove_full(s, n, page);

        if (m == M_PARTIAL) {

            add_partial(n, page, tail);
            stat(s, tail);

        } else if (m == M_FULL) {

            stat(s, DEACTIVATE_FULL);
            add_full(s, n, page);

        }
    }

    l = m;
    if (!__cmpxchg_double_slab(s, page,
                old.freelist, old.counters,
                new.freelist, new.counters,
                "unfreezing slab"))
        goto redo;

    if (lock)
        spin_unlock(&n->list_lock);

    if (m == M_FREE) {
        stat(s, DEACTIVATE_EMPTY);
        discard_slab(s, page);
        stat(s, FREE_SLAB);
    }
}

if (page->freelist)判断slab的空闲链表freelist是否为空，如果为空，意味着该缓存区的对象已经全部分配到了CPU的kmem_cache_cpu中freelist链表中；如果不为空，那么表示该CPU的slab对象被其他CPU释放了，将会更新统计同时设置tail标识为DEACTIVATE_TO_TAIL。

接下来的while循环是去激活本地CPU的slab步骤一，其主要是通过while循环遍历CPU上的freelist链表get_freepointer()获取空闲对象，继而通过内部的do-while循环，借用__cmpxchg_double_slab()比较交换将对象以插入缓存区页面的freelist空闲链表头的方式归还回去。__cmpxchg_double_slab()前面已经介绍过了的原子操作，这里将不再赘述。不过有个点值得注意的是该步骤的释放操作，其并未将所有的对象都归还回去，这是由于nextfree = get_freepointer(s, freelist)该步骤取下一个空闲对象时得到空指针，那么将会退出while循环；也就意味着如果deactivate_slab()入参中freelist不为空，那么while循环退出时，其也必定不为空，其具体用意稍后再分析。简而言之该步骤其目的是，当页面还处于冻结状态，将会释放每CPU的所有可用的对象回到缓冲区的空闲列表中。

然后是步骤二，即redo标签以下的动作，其首先将缓存区的freelist以及counters信息存到临时old结构中以备后用，接着if (freelist)如果为true，将会把前面步骤一未被归还的那个对象归还到缓冲区中，同时更新new信息，此时new.freelist持有该缓存区的所有空闲对象。往下new.frozen = 0将临时缓存区状态设置为非冻结；然后if (!new.inuse && n->nr_partial > s->min_partial) 表示该slab缓存区中无对象被使用，且部分满slab个数大于最小值，意味着该缓存区需要被销毁，标识m为M_FREE；而else if (new.freelist)表示freelist不为空，仅使用了部分对象，则标识m为M_PARTIAL；至于最后的else分支，表示freelist为空，该缓存区所有对象均已被使用，m标识为M_FULL。再往下if (l != m)的比较是用于判断上一次的缓存区状态l与接下来的操作状态m是否一致，不一致则意味着需要发生变更，其将会先判断l的状态为M_PARTIAL或M_FULL，继而采取对应的remove_partial()或remove_full()链表摘除操作；继而根据m的状态，往半满链表中添加add_partial()还是往满载链表中添加add_full()，接着将l的状态更新为m。现在到了if (!__cmpxchg_double_slab())，这里是用于判断自redo到此，缓存区是否发生过对象操作变更，如果没发生过的话，将会把new暂存的空闲对象挂载到缓存区中以及更新counters，否则将会跳转回redo标签重新执行前面的操作。至此，顺利的话，缓存区已经去激活完毕了。

最后如果m的状态为M_FREE，则表示该缓存区不需要再使用了，将通过discard_slab()将其销毁。

至此，slub算法分析完毕。

归档

分类

内存管理 · 2015-12-01 0

【Linux内存管理】SLUB分配算法（6）

发表回复取消回复

内存管理 · 2015-12-01 0

发表回复 取消回复

发表回复取消回复