内存管理 · 2015-11-21 0

【Linux内存管理】SLUB分配算法(4)

前面已经分析了slub分配算法的初始化及slab资源池的创建,现在进一步分析一下slub分配算法的分配实现。

kmem_cache_alloc()是申请slab对象的入口函数,其实现:

【file:/mm/slub.c】
void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
{
    void *ret = slab_alloc(s, gfpflags, _RET_IP_);

    trace_kmem_cache_alloc(_RET_IP_, ret, s->object_size,
                s->size, gfpflags);

    return ret;
}

 

该函数主要是通过slab_alloc()来分配对象,而trace_kmem_cache_alloc()则是用于记录slab分配轨迹。 那么接下来分析slab_alloc()的实现:

【file:/mm/slub.c】
static __always_inline void *slab_alloc(struct kmem_cache *s,
        gfp_t gfpflags, unsigned long addr)
{
    return slab_alloc_node(s, gfpflags, NUMA_NO_NODE, addr);
}

 

该函数封装了slab_alloc_node():

【file:/mm/slub.c】
/*
 * Inlined fastpath so that allocation functions (kmalloc, kmem_cache_alloc)
 * have the fastpath folded into their functions. So no function call
 * overhead for requests that can be satisfied on the fastpath.
 *
 * The fastpath works by first checking if the lockless freelist can be used.
 * If not then __slab_alloc is called for slow processing.
 *
 * Otherwise we can simply pick the next object from the lockless free list.
 */
static __always_inline void *slab_alloc_node(struct kmem_cache *s,
        gfp_t gfpflags, int node, unsigned long addr)
{
    void **object;
    struct kmem_cache_cpu *c;
    struct page *page;
    unsigned long tid;

    if (slab_pre_alloc_hook(s, gfpflags))
        return NULL;

    s = memcg_kmem_get_cache(s, gfpflags);
redo:
    /*
     * Must read kmem_cache cpu data via this cpu ptr. Preemption is
     * enabled. We may switch back and forth between cpus while
     * reading from one cpu area. That does not matter as long
     * as we end up on the original cpu again when doing the cmpxchg.
     *
     * Preemption is disabled for the retrieval of the tid because that
     * must occur from the current processor. We cannot allow rescheduling
     * on a different processor between the determination of the pointer
     * and the retrieval of the tid.
     */
    preempt_disable();
    c = __this_cpu_ptr(s->cpu_slab);

    /*
     * The transaction ids are globally unique per cpu and per operation on
     * a per cpu queue. Thus they can be guarantee that the cmpxchg_double
     * occurs on the right processor and that there was no operation on the
     * linked list in between.
     */
    tid = c->tid;
    preempt_enable();

    object = c->freelist;
    page = c->page;
    if (unlikely(!object || !node_match(page, node)))
        object = __slab_alloc(s, gfpflags, node, addr, c);

    else {
        void *next_object = get_freepointer_safe(s, object);

        /*
         * The cmpxchg will only match if there was no additional
         * operation and if we are on the right processor.
         *
         * The cmpxchg does the following atomically (without lock
         * semantics!)
         * 1. Relocate first pointer to the current per cpu area.
         * 2. Verify that tid and freelist have not been changed
         * 3. If they were not changed replace tid and freelist
         *
         * Since this is without lock semantics the protection is only
         * against code executing on this cpu *not* from access by
         * other cpus.
         */
        if (unlikely(!this_cpu_cmpxchg_double(
                s->cpu_slab->freelist, s->cpu_slab->tid,
                object, tid,
                next_object, next_tid(tid)))) {

            note_cmpxchg_failure("slab_alloc", s, tid);
            goto redo;
        }
        prefetch_freepointer(s, next_object);
        stat(s, ALLOC_FASTPATH);
    }

    if (unlikely(gfpflags & __GFP_ZERO) && object)
        memset(object, 0, s->object_size);

    slab_post_alloc_hook(s, gfpflags, object);

    return object;
}

 

如果开启了CONFIG_SLUB_DEBUG配置的情况下,将会经由slab_pre_alloc_hook()对slub分配进行预处理,确保申请OK;接着如果开启了CONFIG_MEMCG_KMEM配置的情况下,将会通过memcg_kmem_get_cache()将kmem_cache结构指针转换为mcgroup组的kmem_cache指针。

接下来的slab分配,将会通过preempt_disable()先行禁止内核抢占,继而通过__this_cpu_ptr()获取当前CPU的kmem_cache_cpu结构,随后取得kmem_cache_cpu的tid值后preempt_enable()使能内核抢占。

往下的if (unlikely(!object || !node_match(page, node)))判断当前CPU的slab空闲列表是否为空或者当前slab使用内存页面与管理节点是否不匹配。如果其中某一条件为否定,则将通过__slab_alloc()进行slab分配;否则将进入else分支进行分配操作。

__slab_alloc()的分配稍后分析,现在看一下else分支的动作。其先经get_freepointer_safe()取得slab中空闲对象地址,接着使用this_cpu_cmpxchg_double()原子指令操作取得该空闲对象,如果获取成功将使用prefetch_freepointer()刷新数据,否则将经note_cmpxchg_failure()记录日志后重回redo标签再次尝试分配。这里面的关键是this_cpu_cmpxchg_double()原子指令操作。该原子操作主要做了三件事情:1)重定向首指针指向当前CPU的空间;2)判断tid和freelist未被修改;3)如果未被修改,也就是相等,确信此次slab分配未被CPU迁移,接着将新的tid和freelist数据覆盖过去以更新。

具体将this_cpu_cmpxchg_double()的功能展开用C语言表述就是:

if ((__this_cpu_ptr(s->cpu_slab->freelist) == object) && (__this_cpu_ptr(s->cpu_slab->tid) == tid))

{

__this_cpu_ptr(s->cpu_slab->freelist) = next_object;

__this_cpu_ptr(s->cpu_slab->tid) = next_tid(tid);

return true;

}

else

{

return false;

}

这里使用原子操作,其通过单指令方式实现完以上功能免除了加锁解锁操作,且完全避免了多核的情况下CPU迁移锁资源等待所带来的性能开销,极大地提升了效率。这在通常的程序开发中消除性能瓶颈也是极佳的手段。

完了分配到对象后,将会根据申请标志__GFP_ZERO将该对象进行格式化操作,然后经由slab_post_alloc_hook()进行对象分配后处理。

最后回来看一下__slab_alloc()函数实现:

【file:/mm/slub.c】
/*
 * Slow path. The lockless freelist is empty or we need to perform
 * debugging duties.
 *
 * Processing is still very fast if new objects have been freed to the
 * regular freelist. In that case we simply take over the regular freelist
 * as the lockless freelist and zap the regular freelist.
 *
 * If that is not working then we fall back to the partial lists. We take the
 * first element of the freelist as the object to allocate now and move the
 * rest of the freelist to the lockless freelist.
 *
 * And if we were unable to get a new slab from the partial slab lists then
 * we need to allocate a new slab. This is the slowest path since it involves
 * a call to the page allocator and the setup of a new slab.
 */
static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
              unsigned long addr, struct kmem_cache_cpu *c)
{
    void *freelist;
    struct page *page;
    unsigned long flags;

    local_irq_save(flags);
#ifdef CONFIG_PREEMPT
    /*
     * We may have been preempted and rescheduled on a different
     * cpu before disabling interrupts. Need to reload cpu area
     * pointer.
     */
    c = this_cpu_ptr(s->cpu_slab);
#endif

    page = c->page;
    if (!page)
        goto new_slab;
redo:

    if (unlikely(!node_match(page, node))) {
        stat(s, ALLOC_NODE_MISMATCH);
        deactivate_slab(s, page, c->freelist);
        c->page = NULL;
        c->freelist = NULL;
        goto new_slab;
    }

    /*
     * By rights, we should be searching for a slab page that was
     * PFMEMALLOC but right now, we are losing the pfmemalloc
     * information when the page leaves the per-cpu allocator
     */
    if (unlikely(!pfmemalloc_match(page, gfpflags))) {
        deactivate_slab(s, page, c->freelist);
        c->page = NULL;
        c->freelist = NULL;
        goto new_slab;
    }

    /* must check again c->freelist in case of cpu migration or IRQ */
    freelist = c->freelist;
    if (freelist)
        goto load_freelist;

    stat(s, ALLOC_SLOWPATH);

    freelist = get_freelist(s, page);

    if (!freelist) {
        c->page = NULL;
        stat(s, DEACTIVATE_BYPASS);
        goto new_slab;
    }

    stat(s, ALLOC_REFILL);

load_freelist:
    /*
     * freelist is pointing to the list of objects to be used.
     * page is pointing to the page from which the objects are obtained.
     * That page must be frozen for per cpu allocations to work.
     */
    VM_BUG_ON(!c->page->frozen);
    c->freelist = get_freepointer(s, freelist);
    c->tid = next_tid(c->tid);
    local_irq_restore(flags);
    return freelist;

new_slab:

    if (c->partial) {
        page = c->page = c->partial;
        c->partial = page->next;
        stat(s, CPU_PARTIAL_ALLOC);
        c->freelist = NULL;
        goto redo;
    }

    freelist = new_slab_objects(s, gfpflags, node, &c);

    if (unlikely(!freelist)) {
        if (!(gfpflags & __GFP_NOWARN) && printk_ratelimit())
            slab_out_of_memory(s, gfpflags, node);

        local_irq_restore(flags);
        return NULL;
    }

    page = c->page;
    if (likely(!kmem_cache_debug(s) && pfmemalloc_match(page, gfpflags)))
        goto load_freelist;

    /* Only entered in the debug case */
    if (kmem_cache_debug(s) &&
            !alloc_debug_processing(s, page, freelist, addr))
        goto new_slab;	/* Slab failed checks. Next slab needed */

    deactivate_slab(s, page, get_freepointer(s, freelist));
    c->page = NULL;
    c->freelist = NULL;
    local_irq_restore(flags);
    return freelist;
}

 

__slab_alloc()是slab申请的慢路径,这是由于freelist是空的或者需要执行调试任务。

该函数会先行local_irq_save()禁止本地处理器的中断并且记住它们之前的状态。如果配置CONFIG_PREEMPT了,为了避免因调度切换到不同的CPU,该函数会重新通过this_cpu_ptr()获取CPU域的指针;如果c->page为空,也就是cpu local slab不存在就经由new_slab分支新分配一个。

当c->page不为空的情况下,会经node_match()判断页面与节点是否匹配,如果节点不匹配就通过deactivate_slab()去激活cpu本地slab;再然后通过pfmemalloc_match()判断当前页面属性是否为pfmemalloc,如果不是则同样去激活。

接着会再次检查空闲对象指针freelist是否为空,避免在禁止本地处理器中断前因发生了CPU迁移或者中断,导致本地的空闲对象指针不为空。如果不为空的情况下,将会跳转至load_freelist,这里将会把对象从空闲队列中取出,并更新数据信息,然后恢复中断使能,返回对象地址。如果为空,将会更新慢路径申请对象的统计信息,并通过get_freelist()从页面中获取空闲队列。if (!freelist)表示获取空闲队列失败,此时则需要创建新的slab,否则更新统计信息进入load_freelist分支取得对象并返回。

最后看一下该函数的new_slab分支的实现,首先会if (c->partial)判断partial是否为空,不为空则从partial中取出,然后跳转回redo重试分配。如果partial为空,意味着当前所有的slab都已经满负荷使用,那么则需使用new_slab_objects()创建新的slab。如果创建失败,那么将if (!(gfpflags & __GFP_NOWARN) && printk_ratelimit())判断申请页面是否配置为无告警,并且送往控制台的消息数量在临界值内,则调用slab_out_of_memory()记录日志后使能中断并返回NULL表示申请失败。否则将会if (likely(!kmem_cache_debug(s) && pfmemalloc_match(page, gfpflags)))判断是否未开启调试且页面属性匹配pfmemalloc,是则跳转至load_freelist分支进行slab对象分配;否则将会经if (kmem_cache_debug(s) && !alloc_debug_processing(s, page, freelist, addr)) 判断,若开启调试并且调试初始化失败,则返回创建新的slab。如果未开启调试或page调试初始化失败,都将会deactivate_slab()去激活该page,使能中断并返回。

深入分析一下new_slab_objects()函数实现:

【file:/mm/slub.c】
static inline void *new_slab_objects(struct kmem_cache *s, gfp_t flags,
            int node, struct kmem_cache_cpu **pc)
{
    void *freelist;
    struct kmem_cache_cpu *c = *pc;
    struct page *page;

    freelist = get_partial(s, flags, node, c);

    if (freelist)
        return freelist;

    page = new_slab(s, flags, node);
    if (page) {
        c = __this_cpu_ptr(s->cpu_slab);
        if (c->page)
            flush_slab(s, c);

        /*
         * No other reference to the page yet so we can
         * muck around with it freely without cmpxchg
         */
        freelist = page->freelist;
        page->freelist = NULL;

        stat(s, ALLOC_SLAB);
        c->page = page;
        *pc = c;
    } else
        freelist = NULL;

    return freelist;
}

 

该函数在尝试创建新的slab前,将先通过get_partial()获取存在空闲对象的slab并将对象返回;否则继而通过new_slab()创建slab,如果创建好slab后,将空闲对象链表摘下并返回。

这就是slub分配算法的对象分配流程。