【Linux内存管理】内存泄漏检测kmemleak分析

2024 年 7 月
一	二	三	四	五	六	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

kmemleak的工作原理很简单，主要是对kmalloc()、vmalloc()、kmem_cache_alloc()等接口分配的内存地址空间进行跟踪，通过对其地址、空间大小、分配调用栈等信息添加到PRIO搜索树中进行管理。当有匹配的内存释放操作时，将会把跟踪的信息从kmemleak管理中移除。

通过内存扫描（包括对保存的寄存器值），如果发现某块内存在没有任何一个指针指向其起始地址或者其空间范围，那么该内存将会被判定为孤立块。因为这意味着，该内存的地址无任何途径被传递到内存释放函数中，由此可以判定该内存存在泄漏行为。

内存扫描的算法实现也很简单：

将所有跟踪的内存对象标识为白色，如果经过内存扫描后，内存对象管理树中仍标志为白色的则会被判定为孤立的；
自数据段以及调用栈空间开始扫描内存，检测是否内存空间数据，判断是否存在数值与kmemleak的PRIO搜索树所记录的内存地址相邻近。如果查找到存在指针值指向被标记为白色的跟踪对象，那么该跟踪对象将会被添加到灰色链表中（标记为灰色）；
扫描完灰色链表中的对象，检查是否存在与kmemleak的PRIO搜索树管理的跟踪内存地址匹配的，因为某些标记为白色的对象可能变成了灰色的并被添加到链表的末端；
经过以上步骤后，仍标记为白色的对象将会被认定为孤立的，将会上报记录到/sys/kernel/debug/kmemleak文件中。

kmemleak的主要函数可以参考/linux/kmemleak.h头文件。主要有：

kmemleak_init – initialize kmemleak

kmemleak_alloc – notify of a memory block allocation

kmemleak_alloc_percpu – notify of a percpu memory block allocation

kmemleak_free – notify of a memory block freeing

kmemleak_free_part – notify of a partial memory block freeing

kmemleak_free_percpu – notify of a percpu memory block freeing

kmemleak_not_leak – mark an object as not a leak

kmemleak_ignore – do not scan or report an object as leak

kmemleak_scan_area – add scan areas inside a memory block

kmemleak_no_scan – do not scan a memory block

kmemleak_erase – erase an old value in a pointer variable

kmemleak_alloc_recursive – as kmemleak_alloc but checks the recursiveness

kmemleak_free_recursive – as kmemleak_free but checks the recursiveness

可以看到kmemleak对外提供了很多定制化的接口，比如忽略某些内存不被扫描、设置某些某些内存非泄漏的等等，对于内核开发调试提供了很多便利，。毕竟kmemleak对内存泄漏的判断也是粗暴的，难免可能存在误报的情况。

按照惯例，由初始化入手。kmemleak_init()函数是kmemleak的内存管理初始化函数，它管理的是被分配出去的内存信息，在start_kernel()函数中被调用。该函数具体实现：

【file:/mm/kmemcheck.c】
/*
 * Kmemleak initialization.
 */
void __init kmemleak_init(void)
{
    int i;
    unsigned long flags;

#ifdef CONFIG_DEBUG_KMEMLEAK_DEFAULT_OFF
    if (!kmemleak_skip_disable) {
        atomic_set(&kmemleak_early_log, 0);
        kmemleak_disable();
        return;
    }
#endif

    jiffies_min_age = msecs_to_jiffies(MSECS_MIN_AGE);
    jiffies_scan_wait = msecs_to_jiffies(SECS_SCAN_WAIT * 1000);

    object_cache = KMEM_CACHE(kmemleak_object, SLAB_NOLEAKTRACE);
    scan_area_cache = KMEM_CACHE(kmemleak_scan_area, SLAB_NOLEAKTRACE);

    if (crt_early_log >= ARRAY_SIZE(early_log))
        pr_warning("Early log buffer exceeded (%d), please increase "
               "DEBUG_KMEMLEAK_EARLY_LOG_SIZE\n", crt_early_log);

    /* the kernel is still in UP mode, so disabling the IRQs is enough */
    local_irq_save(flags);
    atomic_set(&kmemleak_early_log, 0);
    if (atomic_read(&kmemleak_error)) {
        local_irq_restore(flags);
        return;
    } else
        atomic_set(&kmemleak_enabled, 1);
    local_irq_restore(flags);

    /*
     * This is the point where tracking allocations is safe. Automatic
     * scanning is started during the late initcall. Add the early logged
     * callbacks to the kmemleak infrastructure.
     */
    for (i = 0; i < crt_early_log; i++) {
        struct early_log *log = &early_log[i];

        switch (log->op_type) {
        case KMEMLEAK_ALLOC:
            early_alloc(log);
            break;
        case KMEMLEAK_ALLOC_PERCPU:
            early_alloc_percpu(log);
            break;
        case KMEMLEAK_FREE:
            kmemleak_free(log->ptr);
            break;
        case KMEMLEAK_FREE_PART:
            kmemleak_free_part(log->ptr, log->size);
            break;
        case KMEMLEAK_FREE_PERCPU:
            kmemleak_free_percpu(log->ptr);
            break;
        case KMEMLEAK_NOT_LEAK:
            kmemleak_not_leak(log->ptr);
            break;
        case KMEMLEAK_IGNORE:
            kmemleak_ignore(log->ptr);
            break;
        case KMEMLEAK_SCAN_AREA:
            kmemleak_scan_area(log->ptr, log->size, GFP_KERNEL);
            break;
        case KMEMLEAK_NO_SCAN:
            kmemleak_no_scan(log->ptr);
            break;
        default:
            kmemleak_warn("Unknown early log operation: %d\n",
                      log->op_type);
        }

        if (atomic_read(&kmemleak_warning)) {
            print_log_trace(log);
            atomic_set(&kmemleak_warning, 0);
        }
    }
}

该函数首先将jiffies_min_age和jiffies_scan_wait进行初始化，其中jiffies_scan_wait表示内存泄漏时长间隔；然后创建kmemleak_object和kmemleak_scan_area结构的slab内存池；接着设置kmemleak_early_log为0，表示关闭kmemleak的early log记录功能，同时检查kmemleak_error变量判断kmemleak是否发生过严重的错误，如果有的话则直接返回，否则设置kmemleak_enabled为1，表示kmemleak初始化完毕开始使能；最后的for循环early_log[]数组用于处理kmemleak初始化前的early log早期跟踪信息，该信息来自log_early()的调用采集。

其中log_early()的实现：

【file:/mm/kmemcheck.c】
/*
 * Log an early kmemleak_* call to the early_log buffer. These calls will be
 * processed later once kmemleak is fully initialized.
 */
static void __init log_early(int op_type, const void *ptr, size_t size,
                 int min_count)
{
    unsigned long flags;
    struct early_log *log;

    if (atomic_read(&kmemleak_error)) {
        /* kmemleak stopped recording, just count the requests */
        crt_early_log++;
        return;
    }

    if (crt_early_log >= ARRAY_SIZE(early_log)) {
        kmemleak_disable();
        return;
    }

    /*
     * There is no need for locking since the kernel is still in UP mode
     * at this stage. Disabling the IRQs is enough.
     */
    local_irq_save(flags);
    log = &early_log[crt_early_log];
    log->op_type = op_type;
    log->ptr = ptr;
    log->size = size;
    log->min_count = min_count;
    log->trace_len = __save_stack_trace(log->trace);
    crt_early_log++;
    local_irq_restore(flags);
}

该函数实现较为简单，如果kmemleak_error不为0，即kmemleak发生错误的情况下，kmemleak将停止记录early log早期跟踪信息，但仍会将crt_early_log日志信息计数递增；如果kmemleak正常的情况下，crt_early_log计数超过了early_log[]承载的信息数量的时候，将会调用kmemleak_disable()关闭kmemleak记录功能，然后返回；最后如果在能够承载信息的情况下，将当前的内存操作记录下来，记录的信息包括操作类型、内存地址、空间大小、调用栈等信息。

log_early()被调用的地方与其记录的内存操作类型相匹配。分别为：kmemleak_alloc()、kmemleak_alloc_percpu()、kmemleak_free()、kmemleak_free_part()、kmemleak_free_percpu()、kmemleak_not_leak()、kmemleak_ignore()、kmemleak_scan_area()、kmemleak_no_scan()。

对应于类型定义：

【file:/mm/kmemcheck.c】
/*
 * Early object allocation/freeing logging. Kmemleak is initialized after the
 * kernel allocator. However, both the kernel allocator and kmemleak may
 * allocate memory blocks which need to be tracked. Kmemleak defines an
 * arbitrary buffer to hold the allocation/freeing information before it is
 * fully initialized.
 */

/* kmemleak operation type for early logging */
enum {
    KMEMLEAK_ALLOC,
    KMEMLEAK_ALLOC_PERCPU,
    KMEMLEAK_FREE,
    KMEMLEAK_FREE_PART,
    KMEMLEAK_FREE_PERCPU,
    KMEMLEAK_NOT_LEAK,
    KMEMLEAK_IGNORE,
    KMEMLEAK_SCAN_AREA,
    KMEMLEAK_NO_SCAN
};

这里面其实看着定义也能估计个八九不离十了，它是将各操作进行记录跟踪，再在kmemleak初始化中进行匹配操作。例如某内存在KMEMLEAK_ALLOC的操作之后，如果出现KMEMLEAK_FREE对该内存的操作记录，则表明该内存已经被释放了，由此将该内存从操作记录中剔除，以达到对内存分配的管理，同时可以监控到内存泄漏的情况。

当然这都是猜想，具体的实际情况及实现，就以初始函数中的kmemleak_init()中的KMEMLEAK_ALLOC和KMEMLEAK_FREE对应的分支代码调用函数early_alloc()和kmemleak_free()进行求证。

先是early_alloc()函数：

【file:/mm/kmemcheck.c】
/*
 * Log an early allocated block and populate the stack trace.
 */
static void early_alloc(struct early_log *log)
{
    struct kmemleak_object *object;
    unsigned long flags;
    int i;

    if (!atomic_read(&kmemleak_enabled) || !log->ptr || IS_ERR(log->ptr))
        return;

    /*
     * RCU locking needed to ensure object is not freed via put_object().
     */
    rcu_read_lock();
    object = create_object((unsigned long)log->ptr, log->size,
                   log->min_count, GFP_ATOMIC);
    if (!object)
        goto out;
    spin_lock_irqsave(&object->lock, flags);
    for (i = 0; i < log->trace_len; i++)
        object->trace[i] = log->trace[i];
    object->trace_len = log->trace_len;
    spin_unlock_irqrestore(&object->lock, flags);
out:
    rcu_read_unlock();
}

该函数首先将判断kmemleak是否使能以及入参合法性，接着根据需要记录的信息（内存指针、大小等信息）调用create_object()函数创建kmemleak跟踪对象，然后再是将kmemleak的early log早期跟踪信息转储到object对象中，以完成内存跟踪管理。看似简单，其实这里的主要且关键的动作在create_object()函数中。

具体看一下create_object()该函数的实现：

【file:/mm/kmemcheck.c】
/*
 * Create the metadata (struct kmemleak_object) corresponding to an allocated
 * memory block and add it to the object_list and object_tree_root.
 */
static struct kmemleak_object *create_object(unsigned long ptr, size_t size,
                         int min_count, gfp_t gfp)
{
    unsigned long flags;
    struct kmemleak_object *object, *parent;
    struct rb_node **link, *rb_parent;

    object = kmem_cache_alloc(object_cache, gfp_kmemleak_mask(gfp));
    if (!object) {
        pr_warning("Cannot allocate a kmemleak_object structure\n");
        kmemleak_disable();
        return NULL;
    }

    INIT_LIST_HEAD(&object->object_list);
    INIT_LIST_HEAD(&object->gray_list);
    INIT_HLIST_HEAD(&object->area_list);
    spin_lock_init(&object->lock);
    atomic_set(&object->use_count, 1);
    object->flags = OBJECT_ALLOCATED;
    object->pointer = ptr;
    object->size = size;
    object->min_count = min_count;
    object->count = 0;			/* white color initially */
    object->jiffies = jiffies;
    object->checksum = 0;

    /* task information */
    if (in_irq()) {
        object->pid = 0;
        strncpy(object->comm, "hardirq", sizeof(object->comm));
    } else if (in_softirq()) {
        object->pid = 0;
        strncpy(object->comm, "softirq", sizeof(object->comm));
    } else {
        object->pid = current->pid;
        /*
         * There is a small chance of a race with set_task_comm(),
         * however using get_task_comm() here may cause locking
         * dependency issues with current->alloc_lock. In the worst
         * case, the command line is not correct.
         */
        strncpy(object->comm, current->comm, sizeof(object->comm));
    }

    /* kernel backtrace */
    object->trace_len = __save_stack_trace(object->trace);

    write_lock_irqsave(&kmemleak_lock, flags);

    min_addr = min(min_addr, ptr);
    max_addr = max(max_addr, ptr + size);
    link = &object_tree_root.rb_node;
    rb_parent = NULL;
    while (*link) {
        rb_parent = *link;
        parent = rb_entry(rb_parent, struct kmemleak_object, rb_node);
        if (ptr + size <= parent->pointer)
            link = &parent->rb_node.rb_left;
        else if (parent->pointer + parent->size <= ptr)
            link = &parent->rb_node.rb_right;
        else {
            kmemleak_stop("Cannot insert 0x%lx into the object "
                      "search tree (overlaps existing)\n",
                      ptr);
            kmem_cache_free(object_cache, object);
            object = parent;
            spin_lock(&object->lock);
            dump_object_info(object);
            spin_unlock(&object->lock);
            goto out;
        }
    }
    rb_link_node(&object->rb_node, rb_parent, link);
    rb_insert_color(&object->rb_node, &object_tree_root);

    list_add_tail_rcu(&object->object_list, &object_list);
out:
    write_unlock_irqrestore(&kmemleak_lock, flags);
    return object;
}

函数中首先通过kmem_cache_alloc申请一个kmemleak_object结构的slab对象object，继而将该结构各成员数据进行初始化，包括记录内存地址、空间大小、当前时间jiffies、调用栈等（这里虽然记录了调用栈，实际上在early_alloc()函数中，会被early log早期跟踪信息所记载的调用栈覆盖），接着将会根据地址空间所在的位置进行管理挂入到object_tree_root红黑树中，最后还将通过list_add_tail_rcu()函数把object挂接到object_list链表中（这里采用的是二重管理方式）。具体的红黑树算法以后有机会再细化分析，当前暂且记着。

内存跟踪的object的结构定义需要进入分析一下，这里有些点会影响后面对代码的理解：

【file:/mm/kmemcheck.c】
/*
 * Structure holding the metadata for each allocated memory block.
 * Modifications to such objects should be made while holding the
 * object->lock. Insertions or deletions from object_list, gray_list or
 * rb_node are already protected by the corresponding locks or mutex (see
 * the notes on locking above). These objects are reference-counted
 * (use_count) and freed using the RCU mechanism.
 */
struct kmemleak_object {
    spinlock_t lock;            /* 结构原子锁 */
    unsigned long flags;		/* object status flags */
    struct list_head object_list;   /* 挂接到object_list链表的成员 */
    struct list_head gray_list;     /* 挂接到gray_list链表的成员 */
    struct rb_node rb_node;         /* 挂接到红黑树的成员 */
    struct rcu_head rcu;		/* object_list lockless traversal */
    /* object usage count; object freed when use_count == 0 */
    atomic_t use_count;             /* 该object的使用计数，如果为0时释放该object */
    unsigned long pointer;          /* 跟踪的内存空间地址指针 */
    size_t size;                    /* 内存空间大小 */
    /* minimum number of a pointers found before it is considered leak */
    int min_count;                  /* 该内存对象地址最少可被找到的数量 */
    /* the total number of pointers found pointing to this object */
    int count;                      /* 找到指向该内存对象的数量 */
    /* checksum for detecting modified objects */
    u32 checksum;                   /* 跟踪内存空间数据的checksum校验码 */
    /* memory ranges to be scanned inside an object (empty for all) */
    struct hlist_head area_list;
    unsigned long trace[MAX_TRACE];   /* 被跟踪内存的申请调用栈 */
    unsigned int trace_len;           /* 调用栈深度 */
    unsigned long jiffies;		/* creation timestamp */
    pid_t pid;			/* pid of the current task */
    char comm[TASK_COMM_LEN];	/* executable name */
};

从前面kmemleak的扫描算法描述中可以看到内存跟踪的object会被染色标记的，染色的依据主要是count和min_count。此外颜色则不仅有算法中提及的两种，实际上存在三种颜色：白色、灰色和黑色。

白色	内存处于孤立或者没有足够的指向引用（count < min_count）；
灰色	内存不处于孤立、没有标识为屏蔽的（min_count == 0）或者有足够的指向引用（count >= min_count）；
黑色	设置为忽略的内存（kmemleak_ignore()设置的），不包含引用的内存（例如代码文本段）（min_count == -1），由于非动态分配，所以没有泄漏的嫌疑。没有什么特别的功能应用是定义为该颜色的。

其中新创建的内存跟踪object在下次内存扫描将他们标记为白色之前，不会被标记为任何颜色。

往下分析一下KMEMLEAK_FREE对应的kmemleak_free()实现。该函数实现：

【file:/mm/kmemcheck.c】
/**
 * kmemleak_free - unregister a previously registered object
 * @ptr:	pointer to beginning of the object
 *
 * This function is called from the kernel allocators when an object (memory
 * block) is freed (kmem_cache_free, kfree, vfree etc.).
 */
void __ref kmemleak_free(const void *ptr)
{
    pr_debug("%s(0x%p)\n", __func__, ptr);

    if (atomic_read(&kmemleak_enabled) && ptr && !IS_ERR(ptr))
        delete_object_full((unsigned long)ptr);
    else if (atomic_read(&kmemleak_early_log))
        log_early(KMEMLEAK_FREE, ptr, 0, 0);
}

该函数较为简单，里面就两个条件分支。首先是判断kmemleak是否已经使能，如果已经使能的情况下，将会调用delete_object_part()进行对象删除操作；否则将会判断是否处于early log早期跟踪信息记录阶段，继而使用log_early()记录下当前的操作。

具体分析一下delete_object_part()的实现：

【file:/mm/kmemcheck.c】
/*
 * Look up the metadata (struct kmemleak_object) corresponding to ptr and
 * delete it. If the memory block is partially freed, the function may create
 * additional metadata for the remaining parts of the block.
 */
static void delete_object_part(unsigned long ptr, size_t size)
{
    struct kmemleak_object *object;
    unsigned long start, end;

    object = find_and_get_object(ptr, 1);
    if (!object) {
#ifdef DEBUG
        kmemleak_warn("Partially freeing unknown object at 0x%08lx "
                  "(size %zu)\n", ptr, size);
#endif
        return;
    }
    __delete_object(object);

    /*
     * Create one or two objects that may result from the memory block
     * split. Note that partial freeing is only done by free_bootmem() and
     * this happens before kmemleak_init() is called. The path below is
     * only executed during early log recording in kmemleak_init(), so
     * GFP_KERNEL is enough.
     */
    start = object->pointer;
    end = object->pointer + object->size;
    if (ptr > start)
        create_object(start, ptr - start, object->min_count,
                  GFP_KERNEL);
    if (ptr + size < end)
        create_object(ptr + size, end - ptr - size, object->min_count,
                  GFP_KERNEL);

    put_object(object);
}

正如所料，该函数先是根据操作的内存地址通过find_and_get_object()查找到该内存跟踪的信息记录object，继而使用__delete_object()将对象删除；再往下的则是针对free_bootmem()对于部分内存的释放操作，所引起的内存分片后的后续跟踪管理，实现也不复杂，就是判断其释放的空间处于什么位置，剩余的空间是怎样分布的，通过create_object()将其管理起来而已；而末尾的put_object()只是将object释放后的收尾处理，包括计数的自减，同时判断该object的使用计数use_count是否为0，如果是的情况下，将会判定该object需要释放，并添加到RCU队列中。

至此，基本上也清楚了kmemleak对内存泄漏跟踪管理的模式了。虽然前面的分析也仅是针对kmemleak内存管理初始化kmemleak_init()实现的一个延伸扩展而已，但是实际上回归到正常的内存分配释放中，常用的kmemleak跟踪函数kmemleak_alloc()和kmemleak_free()。kmemleak_free()就已经在前面分析过了，而kmemleak_alloc()的实现则和它极为相似：

【file:/mm/kmemcheck.c】
/**
 * kmemleak_alloc - register a newly allocated object
 * @ptr:	pointer to beginning of the object
 * @size:	size of the object
 * @min_count:	minimum number of references to this object. If during memory
 *		scanning a number of references less than @min_count is found,
 *		the object is reported as a memory leak. If @min_count is 0,
 *		the object is never reported as a leak. If @min_count is -1,
 *		the object is ignored (not scanned and not reported as a leak)
 * @gfp:	kmalloc() flags used for kmemleak internal memory allocations
 *
 * This function is called from the kernel allocators when a new object
 * (memory block) is allocated (kmem_cache_alloc, kmalloc, vmalloc etc.).
 */
void __ref kmemleak_alloc(const void *ptr, size_t size, int min_count,
              gfp_t gfp)
{
    pr_debug("%s(0x%p, %zu, %d)\n", __func__, ptr, size, min_count);

    if (atomic_read(&kmemleak_enabled) && ptr && !IS_ERR(ptr))
        create_object((unsigned long)ptr, size, min_count, gfp);
    else if (atomic_read(&kmemleak_early_log))
        log_early(KMEMLEAK_ALLOC, ptr, size, min_count);
}

同样两个条件分支，kmemleak内存管理初始化后和初始化前的差异处理而已。

所以至此，已经对kmemleak的内存管理初始化及内存管理的实现已经分析完毕，接下来分析一下其如何实现内存泄漏检测的功能。其实这里还涉及到kmemleak的一处初始化，这里初始化的是kmemleak检测泄漏的功能主体（kmemleak_init()初始化的是内存管理），函数入口为kmemleak_late_init()，具体实现：

【file:/mm/kmemcheck.c】
/*
 * Late initialization function.
 */
static int __init kmemleak_late_init(void)
{
    struct dentry *dentry;

    atomic_set(&kmemleak_initialized, 1);

    if (atomic_read(&kmemleak_error)) {
        /*
         * Some error occurred and kmemleak was disabled. There is a
         * small chance that kmemleak_disable() was called immediately
         * after setting kmemleak_initialized and we may end up with
         * two clean-up threads but serialized by scan_mutex.
         */
        schedule_work(&cleanup_work);
        return -ENOMEM;
    }

    dentry = debugfs_create_file("kmemleak", S_IRUGO, NULL, NULL,
                     &kmemleak_fops);
    if (!dentry)
        pr_warning("Failed to create the debugfs kmemleak file\n");
    mutex_lock(&scan_mutex);
    start_scan_thread();
    mutex_unlock(&scan_mutex);

    pr_info("Kernel memory leak detector initialized\n");

    return 0;
}
late_initcall(kmemleak_late_init);

该函数通过late_initcall()宏定义注册到系统初始化中，处于module_init()之后被调用初始化。函数实现，首先设置kmemleak初始化完毕标识kmemleak_initialized；继而判断是否kmemleak发生过错误，如果发生过错误，这里也做了一个可靠性保障动作，就是通过schedule_work调度cleanup_work工作队列，对kmemleak进行环境清理动作。

cleanup_work工作队列实现：

【file:/mm/kmemcheck.c】
/*
 * Stop the memory scanning thread and free the kmemleak internal objects if
 * no previous scan thread (otherwise, kmemleak may still have some useful
 * information on memory leaks).
 */
static void kmemleak_do_cleanup(struct work_struct *work)
{
    struct kmemleak_object *object;
    bool cleanup = scan_thread == NULL;

    mutex_lock(&scan_mutex);
    stop_scan_thread();

    if (cleanup) {
        rcu_read_lock();
        list_for_each_entry_rcu(object, &object_list, object_list)
            delete_object_full(object->pointer);
        rcu_read_unlock();
    }
    mutex_unlock(&scan_mutex);
}

static DECLARE_WORK(cleanup_work, kmemleak_do_cleanup);

工作队列的实现主体是kmemleak_do_cleanu()，其先将kmemleak内存泄漏扫描线程停止，然后通过遍历object_list链表，继而delete_object_full()将所有管理的内存对象进行删除操作。

扯远了，回到kmemleak_late_init()函数实现中，如果一切正常，将会通过debugfs_create_file()在debugfs文件系统中创建“kmemleak”文件，同时将文件对应的操作函数集kmemleak_fops注册进去。以便外部操作“kmemleak”文件时，响应相关的操作指令；最后则是通过start_scan_thread()创建kmemleak的内核扫描线程。

【file:/mm/kmemcheck.c】
/*
 * Start the automatic memory scanning thread. This function must be called
 * with the scan_mutex held.
 */
static void start_scan_thread(void)
{
    if (scan_thread)
        return;
    scan_thread = kthread_run(kmemleak_scan_thread, NULL, "kmemleak");
    if (IS_ERR(scan_thread)) {
        pr_warning("Failed to create the scan thread\n");
        scan_thread = NULL;
    }
}

通过start_scan_thread()的实现，可以看到该内核线程的实体函数为kmemleak_scan_thread()。

具体kmemleak_scan_thread()实现：

【file:/mm/kmemcheck.c】
/*
 * Thread function performing automatic memory scanning. Unreferenced objects
 * at the end of a memory scan are reported but only the first time.
 */
static int kmemleak_scan_thread(void *arg)
{
    static int first_run = 1;

    pr_info("Automatic memory scanning thread started\n");
    set_user_nice(current, 10);

    /*
     * Wait before the first scan to allow the system to fully initialize.
     */
    if (first_run) {
        first_run = 0;
        ssleep(SECS_FIRST_SCAN);
    }

    while (!kthread_should_stop()) {
        signed long timeout = jiffies_scan_wait;

        mutex_lock(&scan_mutex);
        kmemleak_scan();
        mutex_unlock(&scan_mutex);

        /* wait before the next scan */
        while (timeout && !kthread_should_stop())
            timeout = schedule_timeout_interruptible(timeout);
    }

    pr_info("Automatic memory scanning thread ended\n");

    return 0;
}

函数首先打印提示信息，然后设置内核线程优先级，判断是否为第一次扫描，如果是则设置标识first_run；继而开始进入扫描，只要kthread_should_stop()判断不为停止的情况下，先行首次kmemleak_scan()扫描，继而通过schedule_timeout_interruptible()控制时长间隔jiffies_scan_wait后，再次进行泄漏检测扫描。其中里面的kthread_should_stop()判断源于内核代码调用kthread_stop()停止，而非接下来看到的另外一种对kmemleak内核线程操作的关闭功能，两者是有区别的。

除了kmemleak_late_init()中通过kmemleak_late_init()启动kmemleak的内核线程之外，其实还有其他的控制途径。就是在该初始化函数中一个不起眼的动作debugfs_create_file(“kmemleak”, S_IRUGO, NULL, NULL,&kmemleak_fops)，这里的kmemleak_fops定义了对文件操作的响应动作。

【file:/mm/kmemcheck.c】
static const struct file_operations kmemleak_fops = {
    .owner		= THIS_MODULE,
    .open		= kmemleak_open,
    .read		= seq_read,
    .write		= kmemleak_write,
    .llseek		= seq_lseek,
    .release	= kmemleak_release,
};

这里有一个很关键的函数kmemleak_write()：

【file:/mm/kmemcheck.c】
/*
 * File write operation to configure kmemleak at run-time. The following
 * commands can be written to the /sys/kernel/debug/kmemleak file:
 *   off	- disable kmemleak (irreversible)
 *   stack=on	- enable the task stacks scanning
 *   stack=off	- disable the tasks stacks scanning
 *   scan=on	- start the automatic memory scanning thread
 *   scan=off	- stop the automatic memory scanning thread
 *   scan=...	- set the automatic memory scanning period in seconds (0 to
 *		  disable it)
 *   scan	- trigger a memory scan
 *   clear	- mark all current reported unreferenced kmemleak objects as
 *		  grey to ignore printing them
 *   dump=...	- dump information about the object found at the given address
 */
static ssize_t kmemleak_write(struct file *file, const char __user *user_buf,
                  size_t size, loff_t *ppos)
{
    char buf[64];
    int buf_size;
    int ret;

    if (!atomic_read(&kmemleak_enabled))
        return -EBUSY;

    buf_size = min(size, (sizeof(buf) - 1));
    if (strncpy_from_user(buf, user_buf, buf_size) < 0)
        return -EFAULT;
    buf[buf_size] = 0;

    ret = mutex_lock_interruptible(&scan_mutex);
    if (ret < 0)
        return ret;

    if (strncmp(buf, "off", 3) == 0)
        kmemleak_disable();
    else if (strncmp(buf, "stack=on", 8) == 0)
        kmemleak_stack_scan = 1;
    else if (strncmp(buf, "stack=off", 9) == 0)
        kmemleak_stack_scan = 0;
    else if (strncmp(buf, "scan=on", 7) == 0)
        start_scan_thread();
    else if (strncmp(buf, "scan=off", 8) == 0)
        stop_scan_thread();
    else if (strncmp(buf, "scan=", 5) == 0) {
        unsigned long secs;

        ret = kstrtoul(buf + 5, 0, &secs);
        if (ret < 0)
            goto out;
        stop_scan_thread();
        if (secs) {
            jiffies_scan_wait = msecs_to_jiffies(secs * 1000);
            start_scan_thread();
        }
    } else if (strncmp(buf, "scan", 4) == 0)
        kmemleak_scan();
    else if (strncmp(buf, "clear", 5) == 0)
        kmemleak_clear();
    else if (strncmp(buf, "dump=", 5) == 0)
        ret = dump_str_object_info(buf + 5);
    else
        ret = -EINVAL;

out:
    mutex_unlock(&scan_mutex);
    if (ret < 0)
        return ret;

    /* ignore the rest of the buffer, only one command at a time */
    *ppos += size;
    return size;
}

这个函数只做一个动作，就是当/sys/kernel/debug/kmemleak写入操作指令时，将会触发调用该函数，继而解析操作指令，根据操作指令进行响应。其中就有start_scan_thread()和stop_scan_thread()。

再次跑远了，最后来个分析收尾，具体分析一下kmemleak_scan()函数，看一下内存泄漏扫描的实现。

【file:/mm/kmemcheck.c】
/*
 * Scan data sections and all the referenced memory blocks allocated via the
 * kernel's standard allocators. This function must be called with the
 * scan_mutex held.
 */
static void kmemleak_scan(void)
{
    unsigned long flags;
    struct kmemleak_object *object;
    int i;
    int new_leaks = 0;

    jiffies_last_scan = jiffies;

    /* prepare the kmemleak_object's */
    rcu_read_lock();
    list_for_each_entry_rcu(object, &object_list, object_list) {
        spin_lock_irqsave(&object->lock, flags);
#ifdef DEBUG
        /*
         * With a few exceptions there should be a maximum of
         * 1 reference to any object at this point.
         */
        if (atomic_read(&object->use_count) > 1) {
            pr_debug("object->use_count = %d\n",
                 atomic_read(&object->use_count));
            dump_object_info(object);
        }
#endif
        /* reset the reference count (whiten the object) */
        object->count = 0;
        if (color_gray(object) && get_object(object))
            list_add_tail(&object->gray_list, &gray_list);

        spin_unlock_irqrestore(&object->lock, flags);
    }
    rcu_read_unlock();

    /* data/bss scanning */
    scan_block(_sdata, _edata, NULL, 1);
    scan_block(__bss_start, __bss_stop, NULL, 1);

#ifdef CONFIG_SMP
    /* per-cpu sections scanning */
    for_each_possible_cpu(i)
        scan_block(__per_cpu_start + per_cpu_offset(i),
               __per_cpu_end + per_cpu_offset(i), NULL, 1);
#endif

    /*
     * Struct page scanning for each node.
     */
    lock_memory_hotplug();
    for_each_online_node(i) {
        unsigned long start_pfn = node_start_pfn(i);
        unsigned long end_pfn = node_end_pfn(i);
        unsigned long pfn;

        for (pfn = start_pfn; pfn < end_pfn; pfn++) {
            struct page *page;

            if (!pfn_valid(pfn))
                continue;
            page = pfn_to_page(pfn);
            /* only scan if page is in use */
            if (page_count(page) == 0)
                continue;
            scan_block(page, page + 1, NULL, 1);
        }
    }
    unlock_memory_hotplug();

    /*
     * Scanning the task stacks (may introduce false negatives).
     */
    if (kmemleak_stack_scan) {
        struct task_struct *p, *g;

        read_lock(&tasklist_lock);
        do_each_thread(g, p) {
            scan_block(task_stack_page(p), task_stack_page(p) +
                   THREAD_SIZE, NULL, 0);
        } while_each_thread(g, p);
        read_unlock(&tasklist_lock);
    }

    /*
     * Scan the objects already referenced from the sections scanned
     * above.
     */
    scan_gray_list();

    /*
     * Check for new or unreferenced objects modified since the previous
     * scan and color them gray until the next scan.
     */
    rcu_read_lock();
    list_for_each_entry_rcu(object, &object_list, object_list) {
        spin_lock_irqsave(&object->lock, flags);
        if (color_white(object) && (object->flags & OBJECT_ALLOCATED)
            && update_checksum(object) && get_object(object)) {
            /* color it gray temporarily */
            object->count = object->min_count;
            list_add_tail(&object->gray_list, &gray_list);
        }
        spin_unlock_irqrestore(&object->lock, flags);
    }
    rcu_read_unlock();

    /*
     * Re-scan the gray list for modified unreferenced objects.
     */
    scan_gray_list();

    /*
     * If scanning was stopped do not report any new unreferenced objects.
     */
    if (scan_should_stop())
        return;

    /*
     * Scanning result reporting.
     */
    rcu_read_lock();
    list_for_each_entry_rcu(object, &object_list, object_list) {
        spin_lock_irqsave(&object->lock, flags);
        if (unreferenced_object(object) &&
            !(object->flags & OBJECT_REPORTED)) {
            object->flags |= OBJECT_REPORTED;
            new_leaks++;
        }
        spin_unlock_irqrestore(&object->lock, flags);
    }
    rcu_read_unlock();

    if (new_leaks)
        pr_info("%d new suspected memory leaks (see "
            "/sys/kernel/debug/kmemleak)\n", new_leaks);

}

kmemleak内存扫描前，有个准备动作，遍历object_list链表，重置查找到指针值匹配对象的个数count，通过color_gray()判断object颜色是否为灰色，若是则get_object()增加object的计数，然后将该object添加到gray_list链表中。

开始正式扫描，首先是扫描data段以及bss段（代码如下），其根据_sdata和_edata对data段的标识以及__bss_start和__bss_stop对bss段的标识进行扫描，data段和bss段都是存储程序的全局变量和静态遍历的地方，区别仅在于是否赋了初值而已；

/* data/bss scanning */

scan_block(_sdata, _edata, NULL, 1);

scan_block(__bss_start, __bss_stop, NULL, 1);

继而扫描percpu内存空间（代码如下），通过for_each_possible_cpu()循环遍历每个CPU的内存段；

/* per-cpu sections scanning */

for_each_possible_cpu(i)

scan_block(__per_cpu_start + per_cpu_offset(i),

__per_cpu_end + per_cpu_offset(i), NULL, 1);

接着扫描内核内存页面空间（代码如下），主要是由于其空间来自于动态分配，所以也要进行扫描检测；

* Struct page scanning for each node.

lock_memory_hotplug();

for_each_online_node(i) {

unsigned long start_pfn = node_start_pfn(i);

unsigned long end_pfn = node_end_pfn(i);

unsigned long pfn;

for (pfn = start_pfn; pfn < end_pfn; pfn++) {

struct page *page;

if (!pfn_valid(pfn))

continue;

page = pfn_to_page(pfn);

/* only scan if page is in use */

if (page_count(page) == 0)

continue;

scan_block(page, page + 1, NULL, 1);

}

unlock_memory_hotplug();

然后扫描内核进程栈空间（代码如下），do_each_thread()-while_each_thread()主要是遍历所有线程信息（包括进程），这由于每个线程有自己单独的内核栈信息；

* Scanning the task stacks (may introduce false negatives).

if (kmemleak_stack_scan) {

struct task_struct *p, *g;

read_lock(&tasklist_lock);

do_each_thread(g, p) {

scan_block(task_stack_page(p), task_stack_page(p) +

THREAD_SIZE, NULL, 0);

} while_each_thread(g, p);

read_unlock(&tasklist_lock);

}

最后调用scan_gray_list()扫描分配的内存块内部，具体代码这里就不拎出来了，其主要是遍历gray_list链表，对每一项调用scan_object()进行扫描；紧接着检测前面扫描中新增的或者存在修改的未引用的object，并标识色彩；再接着的是scan_gray_list()重复扫描。

扫描完毕之后，判断scan_should_stop()是否扫描停止，若是则停止上报，否则将会遍历object_list链表，查找出标记为白色的object，其跟踪的内存则是存在疑似泄漏的行为。新发现的泄漏内存将会设置为标识OBJECT_REPORTED。这里并不会直接写到kmemleak中，它通过文件操作关联，当kmemleak_open()打开文件时，才会把内容刷新到上面。

结尾前，看一下内存块扫描函数scan_block()的实现：

【file:/mm/kmemcheck.c】
/*
 * Scan a memory block (exclusive range) for valid pointers and add those
 * found to the gray list.
 */
static void scan_block(void *_start, void *_end,
               struct kmemleak_object *scanned, int allow_resched)
{
    unsigned long *ptr;
    unsigned long *start = PTR_ALIGN(_start, BYTES_PER_POINTER);
    unsigned long *end = _end - (BYTES_PER_POINTER - 1);

    for (ptr = start; ptr < end; ptr++) {
        struct kmemleak_object *object;
        unsigned long flags;
        unsigned long pointer;

        if (allow_resched)
            cond_resched();
        if (scan_should_stop())
            break;

        /* don't scan uninitialized memory */
        if (!kmemcheck_is_obj_initialized((unsigned long)ptr,
                          BYTES_PER_POINTER))
            continue;

        pointer = *ptr;

        object = find_and_get_object(pointer, 1);
        if (!object)
            continue;
        if (object == scanned) {
            /* self referenced, ignore */
            put_object(object);
            continue;
        }

        /*
         * Avoid the lockdep recursive warning on object->lock being
         * previously acquired in scan_object(). These locks are
         * enclosed by scan_mutex.
         */
        spin_lock_irqsave_nested(&object->lock, flags,
                     SINGLE_DEPTH_NESTING);
        if (!color_white(object)) {
            /* non-orphan, ignored or new */
            spin_unlock_irqrestore(&object->lock, flags);
            put_object(object);
            continue;
        }

        /*
         * Increase the object's reference count (number of pointers
         * to the memory block). If this count reaches the required
         * minimum, the object's color will become gray and it will be
         * added to the gray_list.
         */
        object->count++;
        if (color_gray(object)) {
            list_add_tail(&object->gray_list, &gray_list);
            spin_unlock_irqrestore(&object->lock, flags);
            continue;
        }

        spin_unlock_irqrestore(&object->lock, flags);
        put_object(object);
    }
}

该函数主要是for循环自_start往_end进行内存遍历扫描。循环体内先是根据入参判断是否需要产生重新调度cond_resched()，继而scan_should_stop()判断kmemleak的扫描动作是否停止，接着则是通过kmemcheck_is_obj_initialized()对即将扫描的内存空间做合法性判断；每遍历的内存将会根据地址作为指针数据取值，继而通过find_and_get_object()查找该指针数据是否在内存管理树中存在，如果能够找到对应的内存管理object对象，那边表明该内存是未泄漏的，将会continue继续，但如果找不到，则判断该内存地址是否为需要做忽略处理的自引用内存；接下来则是根据count和min_count判断，做相关的颜色标识处理。

归档

分类

内存管理 · 2017-01-09 0

【Linux内存管理】内存泄漏检测kmemleak分析

发表回复取消回复

内存管理 · 2017-01-09 0

发表回复 取消回复

发表回复取消回复