http://blog.chinaunix.net/uid-26859697-id-4848272.html

前面分析了<span style="-ms-word-wrap: break-word;">memblock</span>算法、内核页表的建立、内存管理框架的构建,这些都是<span style="-ms-word-wrap: break-word;">x86</span>处理的<span style="-ms-word-wrap: break-word;">setup_arch()</span>函数里面初始化的,因地制宜,具有明显处理器的特征。而<span style="-ms-word-wrap: break-word;">start_kernel()</span>接下来的初始化则是<span style="-ms-word-wrap: break-word;">linux</span>通用的内存管理算法框架了。

build_all_zonelists()用来初始化内存分配器使用的存储节点中的管理区链表,是为内存管理算法(伙伴管理算法)做准备工作的。具体实现:


1. 【file:/mm/page_alloc.c】
2. /
3.   Called with zonelists_mutex held always
4.   unless system_state == SYSTEM_BOOTING.
5.  /
6. void ref build_all_zonelists(pg_data_t pgdat, struct zone zone)
7. {
8.     set_zonelist_order();
9.  
10.     if (system_state == SYSTEM_BOOTING) {
11.         
build_all_zonelists(NULL);
12.         mminit_verify_zonelist();
13.         cpuset_init_current_mems_allowed();
14.     } else {
15. #ifdef CONFIG_MEMORY_HOTPLUG
16.         if (zone)
17.             setup_zone_pageset(zone);
18. #endif
19.         / we have to stop all cpus to guarantee there is no user
20.            of zonelist /
21.         stop_machine(__build_all_zonelists, pgdat, NULL);
22.         / cpuset refresh routine should be here /
23.     }
24.     vm_total_pages = nr_free_pagecache_pages();
25.     /
26.       Disable grouping by mobility if the number of pages in the
27.       system is too low to allow the mechanism to work. It would be
28.       more accurate, but expensive to check per-zone. This check is
29.       made on memory-hotadd so a system can start with mobility
30.       disabled and enable it later
31.      /
32.     if (vm_total_pages < (pageblock_nr_pages MIGRATE_TYPES))
33.         page_group_by_mobility_disabled = 1;
34.     else
35.         page_group_by_mobility_disabled = 0;
36.  
37.     printk("Built %i zonelists in %s order, mobility grouping %s. "
38.         "Total pages: %ld\n",
39.             nr_online_nodes,
40.             zonelist_order_name[current_zonelist_order],
41.             page_group_by_mobility_disabled ? "off" : "on",
42.             vm_total_pages);
43. #ifdef CONFIG_NUMA
44.     printk("Policy zone: %s\n", zone_names[policy_zone]);
45. #endif
46. }
<span style="line-height: 1.5; -ms-word-wrap: break-word;">&nbsp; &nbsp; 首先看到</span><span style="line-height: 1.5; -ms-word-wrap: break-word;">set_zonelist_order()</span><span style="line-height: 1.5; -ms-word-wrap: break-word;">:</span>


1. 【file:/mm/page_alloc.c】
2. static void set_zonelist_order(void)
3. {
4.     current_zonelist_order = ZONELIST_ORDER_ZONE;
5. }
此处用于设置<span style="-ms-word-wrap: break-word;">zonelist</span>的顺序,<span style="-ms-word-wrap: break-word;">ZONELIST_ORDER_ZONE</span>用于表示顺序<span style="-ms-word-wrap: break-word;">(-zonetype, [node] distance)</span>,另外还有<span style="-ms-word-wrap: break-word;">ZONELIST_ORDER_NODE</span>表示顺序<span style="-ms-word-wrap: break-word;">([node] distance, -zonetype)</span>。但其仅限于对<span style="-ms-word-wrap: break-word;">NUMA</span>环境存在区别,非<span style="-ms-word-wrap: break-word;">NUMA</span>环境则毫无差异。

如果系统状态<span style="-ms-word-wrap: break-word;">system_state</span>为<span style="-ms-word-wrap: break-word;">SYSTEM_BOOTING</span>,系统状态只有在<span style="-ms-word-wrap: break-word;">start_kernel</span>执行到最后一个函数<span style="-ms-word-wrap: break-word;">rest_init</span>后,才会进入<span style="-ms-word-wrap: break-word;">SYSTEM_RUNNING</span>,于是初始化时将会接着是<span style="-ms-word-wrap: break-word;">__build_all_zonelists()</span>函数<span style="-ms-word-wrap: break-word;">:</span>


1. 【file:/mm/page_alloc.c】
2. / return values int ....just for stop_machine() /
3. static int __build_all_zonelists(void data)
4. {
5.     int nid;
6.     int cpu;
7.     pg_data_t self = data;
8.  
9. #ifdef CONFIG_NUMA
10.     memset(node_load, 0, sizeof(node_load));
11. #endif
12.  
13.     if (self && !node_online(self->node_id)) {
14.         build_zonelists(self);
15.         build_zonelist_cache(self);
16.     }
17.  
18.     for_each_online_node(nid) {
19.         pg_data_t pgdat = NODE_DATA(nid);
20.  
21.         build_zonelists(pgdat);
22.         build_zonelist_cache(pgdat);
23.     }
24.  
25.     /
26.       Initialize the boot_pagesets that are going to be used
27.       for bootstrapping processors. The real pagesets for
28.       each zone will be allocated later when the per cpu
29.       allocator is available.
30.      
31.       boot_pagesets are used also for bootstrapping offline
32.       cpus if the system is already booted because the pagesets
33.       are needed to initialize allocators on a specific cpu too.
34.       F.e. the percpu allocator needs the page allocator which
35.       needs the percpu allocator in order to allocate its pagesets
36.       (a chicken-egg dilemma).
37.      /
38.     for_each_possible_cpu(cpu) {
39.         setup_pageset(&per_cpu(boot_pageset, cpu), 0);
40.  
41. #ifdef CONFIG_HAVE_MEMORYLESS_NODES
42.         /
43.           We now know the "local memory node" for each node--
44.           i.e., the node of the first zone in the generic zonelist.
45.           Set up numa_mem percpu variable for on-line cpus. During
46.           boot, only the boot cpu should be on-line; we'll init the
47.           secondary cpus' numa_mem as they come on-line. During
48.           node/memory hotplug, we'll fixup all on-line cpus.
49.          /
50.         if (cpu_online(cpu))
51.             set_cpu_numa_mem(cpu, local_memory_node(cpu_to_node(cpu)));
52. #endif
53.     }
54.  
55.     return 0;
56. }
<span style="line-height: 1.5; -ms-word-wrap: break-word;">&nbsp; &nbsp; 首先分析该函数里面调用的</span><span style="line-height: 1.5; -ms-word-wrap: break-word;">build_zonelists()</span><span style="line-height: 1.5; -ms-word-wrap: break-word;">和</span><span style="line-height: 1.5; -ms-word-wrap: break-word;">build_zonelist_cache()</span><span style="line-height: 1.5; -ms-word-wrap: break-word;">函数,其中</span><span style="line-height: 1.5; -ms-word-wrap: break-word;">build_zonelists()</span><span style="line-height: 1.5; -ms-word-wrap: break-word;">:</span>


1. 【file:/mm/page_alloc.c】
2. static void build_zonelists(pg_data_t pgdat)
3. {
4.     int node, local_node;
5.     enum zone_type j;
6.     struct zonelist zonelist;
7.  
8.     local_node = pgdat->node_id;
9.  
10.     zonelist = &pgdat->node_zonelists[0];
11.     j = build_zonelists_node(pgdat, zonelist, 0);
12.  
13.     /
14.       Now we build the zonelist so that it contains the zones
15.       of all the other nodes.
16.       We don't want to pressure a particular node, so when
17.       building the zones for node N, we make sure that the
18.       zones coming right after the local ones are those from
19.       node N+1 (modulo N)
20.      /
21.     for (node = local_node + 1; node < MAX_NUMNODES; node++) {
22.         if (!node_online(node))
23.             continue;
24.         j = build_zonelists_node(NODE_DATA(node), zonelist, j);
25.     }
26.     for (node = 0; node < local_node; node++) {
27.         if (!node_online(node))
28.             continue;
29.         j = build_zonelists_node(NODE_DATA(node), zonelist, j);
30.     }
31.  
32.     zonelist->_zonerefs[j].zone = NULL;
33.     zonelist->_zonerefs[j].zone_idx = 0;
34. }
<span style="line-height: 1.5; -ms-word-wrap: break-word;">&nbsp; &nbsp; 其中</span><span style="line-height: 1.5; -ms-word-wrap: break-word;">build_zonelists_node()</span><span style="line-height: 1.5; -ms-word-wrap: break-word;">函数实现:</span>


1. 【file:/mm/page_alloc.c】
2. /
3.   Builds allocation fallback zone lists.
4.  
5.   Add all populated zones of a node to the zonelist.
6.  /
7. static int build_zonelists_node(pg_data_t pgdat, struct zonelist zonelist,
8.                 int nr_zones)
9. {
10.     struct zone zone;
11.     enum zone_type zone_type = MAX_NR_ZONES;
12.  
13.     do {
14.         zone_type--;
15.         zone = pgdat->node_zones + zone_type;
16.         if (populated_zone(zone)) {
17.             zoneref_set_zone(zone,
18.                 &zonelist->_zonerefs[nr_zones++]);
19.             check_highest_zone(zone_type);
20.         }
21.     } while (zone_type);
22.  
23.     return nr_zones;
24. }
populated_zone()用于判断管理区<span style="-ms-word-wrap: break-word;">zone</span>的<span style="-ms-word-wrap: break-word;">present_pages</span>成员是否为<span style="-ms-word-wrap: break-word;">0</span>,如果不为<span style="-ms-word-wrap: break-word;">0</span>的话,表示该管理区存在页面,那么则通过<span style="-ms-word-wrap: break-word;">zoneref_set_zone()</span>将其设置到<span style="-ms-word-wrap: break-word;">zonelist</span>的<span style="-ms-word-wrap: break-word;">_zonerefs</span>里面,而<span style="-ms-word-wrap: break-word;">check_highest_zone()</span>在没有开启<span style="-ms-word-wrap: break-word;">NUMA</span>的情况下是个空函数。由此可以看出<span style="-ms-word-wrap: break-word;">build_zonelists_node()</span>实则上是按照<span style="-ms-word-wrap: break-word;">ZONE_HIGHMEM&mdash;&gt;ZONE_NORMAL&mdash;&gt;ZONE_DMA</span>的顺序去迭代排布到<span style="-ms-word-wrap: break-word;">_zonerefs</span>里面的,表示一个申请内存的代价由低廉到昂贵的顺序,这是一个分配内存时的备用次序。

回到<span style="-ms-word-wrap: break-word;">build_zonelists()</span>函数中,而它代码显示将本地的内存管理区进行分配备用次序排序,接着再是分配内存代价低于本地的,最后才是分配内存代价高于本地的。

分析完<span style="-ms-word-wrap: break-word;">build_zonelists()</span>,再回到<span style="-ms-word-wrap: break-word;">__build_all_zonelists()</span>看一下<span style="-ms-word-wrap: break-word;">build_zonelist_cache()</span>:


1. 【file:/mm/page_alloc.c】
2. / non-NUMA variant of zonelist performance cache - just NULL zlcache_ptr /
3. static void build_zonelist_cache(pg_data_t *pgdat)
4. {
5.     pgdat->node_zonelists[0].zlcache_ptr = NULL;
6. }
该函数与<span style="-ms-word-wrap: break-word;">CONFIG_NUMA</span>相关,用来设置<span style="-ms-word-wrap: break-word;">zlcache</span>相关的成员。由于没有开启该配置,故直接设置为<span style="-ms-word-wrap: break-word;">NULL</span>。

基于<span style="-ms-word-wrap: break-word;">build_all_zonelists()</span>调用<span style="-ms-word-wrap: break-word;">__build_all_zonelists()</span>入参为<span style="-ms-word-wrap: break-word;">NULL</span>,由此可知<span style="-ms-word-wrap: break-word;">__build_all_zonelists()</span>运行的代码是:

&nbsp;&nbsp;&nbsp; for_each_online_node(nid) {

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; pg_data_t *pgdat = NODE_DATA(nid);

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; build_zonelists(pgdat);

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; build_zonelist_cache(pgdat);

&nbsp;&nbsp;&nbsp; }

主要是设置各个内存管理节点<span style="-ms-word-wrap: break-word;">node</span>里面各自的内存管理分区<span style="-ms-word-wrap: break-word;">zone</span>的内存分配次序。

__build_all_zonelists()接着的是:

&nbsp;&nbsp;&nbsp; for_each_possible_cpu(cpu) {

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; setup_pageset(&amp;per_cpu(boot_pageset, cpu), 0);

#ifdef CONFIG_HAVE_MEMORYLESS_NODES

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if (cpu_online(cpu))

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; set_cpu_numa_mem(cpu, local_memory_node(cpu_to_node(cpu)));

#endif

&nbsp;&nbsp;&nbsp; }

其中<span style="-ms-word-wrap: break-word;">CONFIG_HAVE_MEMORYLESS_NODES</span>未配置,主要分析一下<span style="-ms-word-wrap: break-word;">setup_pageset()</span>:


1. 【file:/mm/page_alloc.c】
2. static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
3. {
4.     pageset_init(p);
5.     pageset_set_batch(p, batch);
6. }
setup_pageset()里面调用的两个函数较为简单,就直接过一下。先是:


1. 【file:/mm/page_alloc.c】
2. static void pageset_init(struct per_cpu_pageset p)
3. {
4.     struct per_cpu_pages pcp;
5.     int migratetype;
6.  
7.     memset(p, 0, sizeof(*p));
8.  
9.     pcp = &p->pcp;
10.     pcp->count = 0;
11.     for (migratetype = 0; migratetype < MIGRATE_PCPTYPES; migratetype++)
12.         INIT_LIST_HEAD(&pcp->lists[migratetype]);
13. }
<span style="line-height: 1.5; -ms-word-wrap: break-word;">&nbsp; &nbsp; pageset_init()主要是将</span><span style="line-height: 1.5; -ms-word-wrap: break-word;">struct per_cpu_pages</span><span style="line-height: 1.5; -ms-word-wrap: break-word;">结构体进行初始化,而</span><span style="line-height: 1.5; -ms-word-wrap: break-word;">pageset_set_batch()</span><span style="line-height: 1.5; -ms-word-wrap: break-word;">则是对其进行设置。</span><span style="line-height: 1.5; -ms-word-wrap: break-word;">pageset_set_batch()</span><span style="line-height: 1.5; -ms-word-wrap: break-word;">实现:</span>


1. 【file:/mm/page_alloc.c】
2. /
3.   pcp->high and pcp->batch values are related and dependent on one another:
4.   ->batch must never be higher then ->high.
5.   The following function updates them in a safe manner without read side
6.   locking.
7.  
8.   Any new users of pcp->batch and pcp->high should ensure they can cope with
9.   those fields changing asynchronously (acording the the above rule).
10.  
11.   mutex_is_locked(&pcp_batch_high_lock) required when calling this function
12.   outside of boot time (or some other assurance that no concurrent updaters
13.   exist).
14.  /
15. static void pageset_update(struct per_cpu_pages pcp, unsigned long high,
16.         unsigned long batch)
17. {
18.        / start with a fail safe value for batch /
19.     pcp->batch = 1;
20.     smp_wmb();
21.  
22.        / Update high, then batch, in order /
23.     pcp->high = high;
24.     smp_wmb();
25.  
26.     pcp->batch = batch;
27. }
28.  
29. / a companion to pageset_set_high() /
30. static void pageset_set_batch(struct per_cpu_pageset p, unsigned long batch)
31. {
32.     pageset_update(&p->pcp, 6 batch, max(1UL, 1 * batch));
33. }
setup_pageset()函数入参<span style="-ms-word-wrap: break-word;">p</span>是一个<span style="-ms-word-wrap: break-word;">struct per_cpu_pageset</span>结构体的指针,<span style="-ms-word-wrap: break-word;">per_cpu_pageset</span>结构是内核的各个<span style="-ms-word-wrap: break-word;">zone</span>用于每<span style="-ms-word-wrap: break-word;">CPU</span>的页面高速缓存管理结构。该高速缓存包含一些预先分配的页面,以用于满足本地<span style="-ms-word-wrap: break-word;">CPU</span>发出的单一内存请求。而<span style="-ms-word-wrap: break-word;">struct per_cpu_pages</span>定义的<span style="-ms-word-wrap: break-word;">pcp</span>是该管理结构的成员,用于具体页面管理。原本是每个管理结构有两个<span style="-ms-word-wrap: break-word;">pcp</span>数组成员,里面的两条队列分别用于冷页面和热页面管理,而当前分析的<span style="-ms-word-wrap: break-word;">3.14.12</span>版本已经将两者合并起来,统一管理冷热页,热页面在队列前面,而冷页面则在队列后面。暂且先记着这么多,后续在<span style="-ms-word-wrap: break-word;">Buddy</span>算法的时候再详细分析了。

至此,可以知道<span style="-ms-word-wrap: break-word;">__build_all_zonelists()</span>是内存管理框架向后续的内存页面管理算法做准备,排布了内存管理区<span style="-ms-word-wrap: break-word;">zone</span>的分配次序,同时初始化了冷热页管理。

&nbsp; &nbsp; &nbsp;最后回到<span style="-ms-word-wrap: break-word;">build_all_zonelists()</span>函数。由于没有开启内存初始化调试功能<span style="-ms-word-wrap: break-word;">CONFIG_DEBUG_MEMORY_INIT</span>,<span style="-ms-word-wrap: break-word;">mminit_verify_zonelist()</span>是一个空函数。

基于<span style="-ms-word-wrap: break-word;">CONFIG_CPUSETS</span>配置项开启的情况下,而<span style="-ms-word-wrap: break-word;">cpuset_init_current_mems_allowed()</span>实现如下:


1. 【file:/kernel/cpuset.c】
2. void cpuset_init_current_mems_allowed(void)
3. {
4.     nodes_setall(current->mems_allowed);
5. }
这里面的<span style="-ms-word-wrap: break-word;">current </span>是一个<span style="-ms-word-wrap: break-word;">cpuset</span>的数据结构,用来管理<span style="-ms-word-wrap: break-word;">cgroup</span>中的任务能够使用的<span style="-ms-word-wrap: break-word;">cpu</span>和内存节点。而成员<span style="-ms-word-wrap: break-word;">mems_allowed</span>,该成员是<span style="-ms-word-wrap: break-word;">nodemask_t</span>类型的结构体:


1. 【file:/include/linux/nodemask.h】
2. typedef struct { DECLARE_BITMAP(bits, MAX_NUMNODES); } nodemask_t;
该结构其实就是定义了一个位域,每个位对应一个内存结点,如果置<span style="-ms-word-wrap: break-word;">1</span>表示该节点内存可用。而<span style="-ms-word-wrap: break-word;">nodes_setall</span>则是将这个位域中每个位都置<span style="-ms-word-wrap: break-word;">1</span>。

末了看一下<span style="-ms-word-wrap: break-word;">build_all_zonelists()</span>里面<span style="-ms-word-wrap: break-word;">nr_free_pagecache_pages()</span>的实现:


1. 【file:/mm/page_alloc.c】
2. /
3.   nr_free_pagecache_pages - count number of pages beyond high watermark
4.  
5.   nr_free_pagecache_pages() counts the number of pages which are beyond the
6.   high watermark within all zones.
7.  */
8. unsigned long nr_free_pagecache_pages(void)
9. {
10.     return nr_free_zone_pages(gfp_zone(GFP_HIGHUSER_MOVABLE));
11. }
而里面调用的<span style="-ms-word-wrap: break-word;">nr_free_zone_pages()</span>实现为:


1. 【file:/mm/page_alloc.c】
2. /
3.   nr_free_zone_pages - count number of pages beyond high watermark
4.   @offset: The zone index of the highest zone
5.  
6.   nr_free_zone_pages() counts the number of counts pages which are beyond the
7.   high watermark within all zones at or below a given zone index. For each
8.   zone, the number of pages is calculated as:
9.   managed_pages - high_pages
10.  /
11. static unsigned long nr_free_zone_pages(int offset)
12. {
13.     struct zoneref z;
14.     struct zone zone;
15.  
16.     / Just pick one node, since fallback list is circular /
17.     unsigned long sum = 0;
18.  
19.     struct zonelist *zonelist = node_zonelist(numa_node_id(), GFP_KERNEL);
20.  
21.     for_each_zone_zonelist(zone, z, zonelist, offset) {
22.         unsigned long size = zone->managed_pages;
23.         unsigned long high = high_wmark_pages(zone);
24.         if (size > high)
25.             sum += size - high;
26.     }
27.  
28.     return sum;
29. }
可以看到<span style="-ms-word-wrap: break-word;">nr_free_zone_pages()</span>遍历所有内存管理区并将各管理区的内存空间求和,其实质是用于统计所有的管理区可以用于分配的内存页面数。

接着在<span style="-ms-word-wrap: break-word;">build_all_zonelists()</span>后面则是判断当前系统中的内存页框数目,以决定是否启用流动分组机制<span style="-ms-word-wrap: break-word;">(Mobility Grouping)</span>,该机制可以在分配大内存块时减少内存碎片。通常只有内存足够大时才会启用该功能,否则将会提升消耗降低性能。其中<span style="-ms-word-wrap: break-word;">pageblock_nr_pages</span>表示伙伴系统中的最高阶页块所能包含的页面数。

至此,内存管理框架算法基本准备完毕。