【Linux内存源码分析】SLUB分配算法(4)

前面已经分析了slub分配算法的初始化及slab资源池的创建,现在进一步分析一下slub分配算法的分配实现。

kmem_cache_alloc()是申请slab对象的入口函数,其实现:

【file:/mm/slub.c】
void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
{
	void *ret = slab_alloc(s, gfpflags, _RET_IP_);

	trace_kmem_cache_alloc(_RET_IP_, ret, s->object_size,
				s->size, gfpflags);

	return ret;
}

该函数主要是通过slab_alloc()来分配对象,而trace_kmem_cache_alloc()则是用于记录slab分配轨迹。    那么接下来分析slab_alloc()的实现:

【file:/mm/slub.c】
static __always_inline void *slab_alloc(struct kmem_cache *s,
		gfp_t gfpflags, unsigned long addr)
{
	return slab_alloc_node(s, gfpflags, NUMA_NO_NODE, addr);
}

        该函数封装了slab_alloc_node()

【file:/mm/slub.c】
/*
 * Inlined fastpath so that allocation functions (kmalloc, kmem_cache_alloc)
 * have the fastpath folded into their functions. So no function call
 * overhead for requests that can be satisfied on the fastpath.
 *
 * The fastpath works by first checking if the lockless freelist can be used.
 * If not then __slab_alloc is called for slow processing.
 *
 * Otherwise we can simply pick the next object from the lockless free list.
 */
static __always_inline void *slab_alloc_node(struct kmem_cache *s,
		gfp_t gfpflags, int node, unsigned long addr)
{
	void **object;
	struct kmem_cache_cpu *c;
	struct page *page;
	unsigned long tid;

	if (slab_pre_alloc_hook(s, gfpflags))
		return NULL;

	s = memcg_kmem_get_cache(s, gfpflags);
redo:
	/*
	 * Must read kmem_cache cpu data via this cpu ptr. Preemption is
	 * enabled. We may switch back and forth between cpus while
	 * reading from one cpu area. That does not matter as long
	 * as we end up on the original cpu again when doing the cmpxchg.
	 *
	 * Preemption is disabled for the retrieval of the tid because that
	 * must occur from the current processor. We cannot allow rescheduling
	 * on a different processor between the determination of the pointer
	 * and the retrieval of the tid.
	 */
	preempt_disable();
	c = __this_cpu_ptr(s->cpu_slab);

	/*
	 * The transaction ids are globally unique per cpu and per operation on
	 * a per cpu queue. Thus they can be guarantee that the cmpxchg_double
	 * occurs on the right processor and that there was no operation on the
	 * linked list in between.
	 */
	tid = c->tid;
	preempt_enable();

	object = c->freelist;
	page = c->page;
	if (unlikely(!object || !node_match(page, node)))
		object = __slab_alloc(s, gfpflags, node, addr, c);

	else {
		void *next_object = get_freepointer_safe(s, object);

		/*
		 * The cmpxchg will only match if there was no additional
		 * operation and if we are on the right processor.
		 *
		 * The cmpxchg does the following atomically (without lock
		 * semantics!)
		 * 1. Relocate first pointer to the current per cpu area.
		 * 2. Verify that tid and freelist have not been changed
		 * 3. If they were not changed replace tid and freelist
		 *
		 * Since this is without lock semantics the protection is only
		 * against code executing on this cpu *not* from access by
		 * other cpus.
		 */
		if (unlikely(!this_cpu_cmpxchg_double(
				s->cpu_slab->freelist, s->cpu_slab->tid,
				object, tid,
				next_object, next_tid(tid)))) {

			note_cmpxchg_failure("slab_alloc", s, tid);
			goto redo;
		}
		prefetch_freepointer(s, next_object);
		stat(s, ALLOC_FASTPATH);
	}

	if (unlikely(gfpflags & __GFP_ZERO) && object)
		memset(object, 0, s->object_size);

	slab_post_alloc_hook(s, gfpflags, object);

	return object;
}

如果开启了CONFIG_SLUB_DEBUG配置的情况下,将会经由slab_pre_alloc_hook()slub分配进行预处理,确保申请OK;接着如果开启了CONFIG_MEMCG_KMEM配置的情况下,将会通过memcg_kmem_get_cache()kmem_cache结构指针转换为mcgroup组的kmem_cache指针。

接下来的slab分配,将会通过preempt_disable()先行禁止内核抢占,继而通过__this_cpu_ptr()获取当前CPUkmem_cache_cpu结构,随后取得kmem_cache_cputid值后preempt_enable()使能内核抢占。

往下的if (unlikely(!object || !node_match(page, node)))判断当前CPUslab空闲列表是否为空或者当前slab使用内存页面与管理节点是否不匹配。如果其中某一条件为否定,则将通过__slab_alloc()进行slab分配;否则将进入else分支进行分配操作。

__slab_alloc()的分配稍后分析,现在看一下else分支的动作。其先经get_freepointer_safe()取得slab中空闲对象地址,接着使用this_cpu_cmpxchg_double()原子指令操作取得该空闲对象,如果获取成功将使用prefetch_freepointer()刷新数据,否则将经note_cmpxchg_failure()记录日志后重回redo标签再次尝试分配。这里面的关键是this_cpu_cmpxchg_double()原子指令操作。该原子操作主要做了三件事情:1)重定向首指针指向当前CPU的空间;2)判断tidfreelist未被修改;3)如果未被修改,也就是相等,确信此次slab分配未被CPU迁移,接着将新的tidfreelist数据覆盖过去以更新。

具体将this_cpu_cmpxchg_double()的功能展开用C语言表述就是:

if
((__this_cpu_ptr(s->cpu_slab->freelist) == object) &&
(__this_cpu_ptr(s->cpu_slab->tid) == tid))

{

    __this_cpu_ptr(s->cpu_slab->freelist)
= next_object;

    __this_cpu_ptr(s->cpu_slab->tid) =
next_tid(tid);

    return true;

}

else

{

    return false;

}

这里使用原子操作,其通过单指令方式实现完以上功能免除了加锁解锁操作,且完全避免了多核的情况下CPU迁移锁资源等待所带来的性能开销,极大地提升了效率。这在通常的程序开发中消除性能瓶颈也是极佳的手段。

完了分配到对象后,将会根据申请标志__GFP_ZERO将该对象进行格式化操作,然后经由slab_post_alloc_hook()进行对象分配后处理。

最后回来看一下__slab_alloc()函数实现:

【file:/mm/slub.c】
/*
 * Slow path. The lockless freelist is empty or we need to perform
 * debugging duties.
 *
 * Processing is still very fast if new objects have been freed to the
 * regular freelist. In that case we simply take over the regular freelist
 * as the lockless freelist and zap the regular freelist.
 *
 * If that is not working then we fall back to the partial lists. We take the
 * first element of the freelist as the object to allocate now and move the
 * rest of the freelist to the lockless freelist.
 *
 * And if we were unable to get a new slab from the partial slab lists then
 * we need to allocate a new slab. This is the slowest path since it involves
 * a call to the page allocator and the setup of a new slab.
 */
static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
			  unsigned long addr, struct kmem_cache_cpu *c)
{
	void *freelist;
	struct page *page;
	unsigned long flags;

	local_irq_save(flags);
#ifdef CONFIG_PREEMPT
	/*
	 * We may have been preempted and rescheduled on a different
	 * cpu before disabling interrupts. Need to reload cpu area
	 * pointer.
	 */
	c = this_cpu_ptr(s->cpu_slab);
#endif

	page = c->page;
	if (!page)
		goto new_slab;
redo:

	if (unlikely(!node_match(page, node))) {
		stat(s, ALLOC_NODE_MISMATCH);
		deactivate_slab(s, page, c->freelist);
		c->page = NULL;
		c->freelist = NULL;
		goto new_slab;
	}

	/*
	 * By rights, we should be searching for a slab page that was
	 * PFMEMALLOC but right now, we are losing the pfmemalloc
	 * information when the page leaves the per-cpu allocator
	 */
	if (unlikely(!pfmemalloc_match(page, gfpflags))) {
		deactivate_slab(s, page, c->freelist);
		c->page = NULL;
		c->freelist = NULL;
		goto new_slab;
	}

	/* must check again c->freelist in case of cpu migration or IRQ */
	freelist = c->freelist;
	if (freelist)
		goto load_freelist;

	stat(s, ALLOC_SLOWPATH);

	freelist = get_freelist(s, page);

	if (!freelist) {
		c->page = NULL;
		stat(s, DEACTIVATE_BYPASS);
		goto new_slab;
	}

	stat(s, ALLOC_REFILL);

load_freelist:
	/*
	 * freelist is pointing to the list of objects to be used.
	 * page is pointing to the page from which the objects are obtained.
	 * That page must be frozen for per cpu allocations to work.
	 */
	VM_BUG_ON(!c->page->frozen);
	c->freelist = get_freepointer(s, freelist);
	c->tid = next_tid(c->tid);
	local_irq_restore(flags);
	return freelist;

new_slab:

	if (c->partial) {
		page = c->page = c->partial;
		c->partial = page->next;
		stat(s, CPU_PARTIAL_ALLOC);
		c->freelist = NULL;
		goto redo;
	}

	freelist = new_slab_objects(s, gfpflags, node, &c);

	if (unlikely(!freelist)) {
		if (!(gfpflags & __GFP_NOWARN) && printk_ratelimit())
			slab_out_of_memory(s, gfpflags, node);

		local_irq_restore(flags);
		return NULL;
	}

	page = c->page;
	if (likely(!kmem_cache_debug(s) && pfmemalloc_match(page, gfpflags)))
		goto load_freelist;

	/* Only entered in the debug case */
	if (kmem_cache_debug(s) &&
			!alloc_debug_processing(s, page, freelist, addr))
		goto new_slab;	/* Slab failed checks. Next slab needed */

	deactivate_slab(s, page, get_freepointer(s, freelist));
	c->page = NULL;
	c->freelist = NULL;
	local_irq_restore(flags);
	return freelist;
}

__slab_alloc()是slab申请的慢路径,这是由于freelist是空的或者需要执行调试任务。

该函数会先行local_irq_save()禁止本地处理器的中断并且记住它们之前的状态。如果配置CONFIG_PREEMPT了,为了避免因调度切换到不同的CPU,该函数会重新通过this_cpu_ptr()获取CPU域的指针;如果c->page为空,也就是cpu local slab不存在就经由new_slab分支新分配一个。

c->page不为空的情况下,会经node_match()判断页面与节点是否匹配,如果节点不匹配就通过deactivate_slab()去激活cpu本地slab;再然后通过pfmemalloc_match()判断当前页面属性是否为pfmemalloc,如果不是则同样去激活。

接着会再次检查空闲对象指针freelist是否为空,避免在禁止本地处理器中断前因发生了CPU迁移或者中断,导致本地的空闲对象指针不为空。如果不为空的情况下,将会跳转至load_freelist,这里将会把对象从空闲队列中取出,并更新数据信息,然后恢复中断使能,返回对象地址。如果为空,将会更新慢路径申请对象的统计信息,并通过get_freelist()从页面中获取空闲队列。if (!freelist)表示获取空闲队列失败,此时则需要创建新的slab,否则更新统计信息进入load_freelist分支取得对象并返回。

最后看一下该函数的new_slab分支的实现,首先会if (c->partial)判断partial是否为空,不为空则从partial中取出,然后跳转回redo重试分配。如果partial为空,意味着当前所有的slab都已经满负荷使用,那么则需使用new_slab_objects()创建新的slab。如果创建失败,那么将if (!(gfpflags &
__GFP_NOWARN) && printk_ratelimit())
判断申请页面是否配置为无告警,并且送往控制台的消息数量在临界值内,则调用slab_out_of_memory()记录日志后使能中断并返回NULL表示申请失败。否则将会if (likely(!kmem_cache_debug(s) && pfmemalloc_match(page,
gfpflags)))
判断是否未开启调试且页面属性匹配pfmemalloc,是则跳转至load_freelist分支进行slab对象分配;否则将会经if (kmem_cache_debug(s) && !alloc_debug_processing(s, page,
freelist, addr))
判断,若开启调试并且调试初始化失败,则返回创建新的slab。如果未开启调试或page调试初始化失败,都将会deactivate_slab()去激活该page,使能中断并返回。

       深入分析一下new_slab_objects()函数实现:

【file:/mm/slub.c】
static inline void *new_slab_objects(struct kmem_cache *s, gfp_t flags,
			int node, struct kmem_cache_cpu **pc)
{
	void *freelist;
	struct kmem_cache_cpu *c = *pc;
	struct page *page;

	freelist = get_partial(s, flags, node, c);

	if (freelist)
		return freelist;

	page = new_slab(s, flags, node);
	if (page) {
		c = __this_cpu_ptr(s->cpu_slab);
		if (c->page)
			flush_slab(s, c);

		/*
		 * No other reference to the page yet so we can
		 * muck around with it freely without cmpxchg
		 */
		freelist = page->freelist;
		page->freelist = NULL;

		stat(s, ALLOC_SLAB);
		c->page = page;
		*pc = c;
	} else
		freelist = NULL;

	return freelist;
}

该函数在尝试创建新的slab前,将先通过get_partial()获取存在空闲对象的slab并将对象返回;否则继而通过new_slab()创建slab,如果创建好slab后,将空闲对象链表摘下并返回。

这就是slub分配算法的对象分配流程。

发表评论

电子邮件地址不会被公开。 必填项已用*标注