TCMalloc源码学习（三）（小块内存分配）

Frank__Library 2016-08-11

展开全文

线程本地cache

线程本地cache对应的是类 ThreadCache，每一个thread一个实例，初始化代码在static函数CreateCacheIfNecessary中，在该线程第一次申请内存的时候初始化，调用堆栈是 :

1        tcmalloc::ThreadCache::CreateCacheIfNecessary()
2       tcmalloc::ThreadCache::GetCache()
3       do_malloc_no_errno(unsigned int size)
4       do_malloc(unsigned int size)
5       do_malloc_or_cpp_alloc(unsigned int size)
6       tc_malloc(unsigned int size)

CreateCacheIfNecessary代码如下：

 1 ThreadCache* ThreadCache ::CreateCacheIfNecessary() {
 2 
 3 // Initialize per-thread data if necessary
 4 
 5 ThreadCache* heap = NULL ;
 6 
 7   {
 8 
 9 SpinLockHolder h (Static:: pageheap_lock());
10 
11 // On some old glibc's, and on freebsd's libc (as of freebsd 8.1),
12 
13 // calling pthread routines (even pthread_self) too early could
14 
15 // cause a segfault.  Since we can call pthreads quite early, we
16 
17 // have to protect against that in such situations by making a
18 
19 // 'fake' pthread.  This is not ideal since it doesn't work well
20 
21 // when linking tcmalloc statically with apps that create threads
22 
23 // before main, so we only do it if we have to.
24 
25 #ifdef PTHREADS_CRASHES_IF_RUN_TOO_EARLY
26 
27 pthread_t me ;
28 
29 if (!tsd_inited_ ) {
30 
31 memset(&me , 0, sizeof( me));
32 
33     } else {
34 
35 me = pthread_self ();
36 
37     }
38 
39 #else
40 
41 const pthread_t me = pthread_self();
42 
43 #endif
44 
45 // This may be a recursive malloc call from pthread_setspecific()
46 
47 // In that case, the heap for this thread has already been created
48 
49 // and added to the linked list.  So we search for that first.
50 
51 for (ThreadCache * h = thread_heaps_; h != NULL; h = h ->next_) {
52 
53 if (h ->tid_ == me) {
54 
55 heap = h ;
56 
57 break;
58 
59       }
60 
61     }
62 
63 if (heap == NULL) heap = NewHeap (me);
64 
65   }
66 
67 // We call pthread_setspecific() outside the lock because it may
68 
69 // call malloc() recursively.  We check for the recursive call using
70 
71 // the "in_setspecific_" flag so that we can avoid calling
72 
73 // pthread_setspecific() if we are already inside pthread_setspecific().
74 
75 if (! heap->in_setspecific_ && tsd_inited_) {
76 
77 heap->in_setspecific_ = true;
78 
79 perftools_pthread_setspecific(heap_key_ , heap);
80 
81 #ifdef HAVE_TLS
82 
83 // Also keep a copy in __thread for faster retrieval
84 
85     threadlocal_data_. heap = heap ;
86 
87 SetMinSizeForSlowPath(kMaxSize + 1);
88 
89 #endif
90 
91 heap->in_setspecific_ = false;
92 
93   }
94 
95 return heap;
96 
97 }
98 
99

代码都有详细注释，简单描述就是所有线程的 ThreadCache 用链表串起来了，thread_heaps_指向链表头，NewHeap往前插节点；如果平台有支持TLS（比如windows），会把heap保存到一个TLS变量中，这样取当前线程的ThreadCache的时候比用 pthread_getspecific() 要高效。

线程本地cache分配内存

通过接口 ThreadCache::Allocate分配对应size的内存，代码如下：

 1 inline void * ThreadCache:: Allocate(size_t size, size_t cl ) {
 2 
 3 ASSERT(size <= kMaxSize);
 4 
 5 ASSERT(size == Static:: sizemap()->ByteSizeForClass (cl));
 6 
 7 FreeList* list = &list_[ cl];
 8 
 9 if (list ->empty()) {
10 
11 return FetchFromCentralCache (cl, size);
12 
13                 }
14 
15 size_ -= size ;
16 
17 return list ->Pop();
18 
19 }
20

ThreadCache的list_是一个FreeeList的数组，每一项代表对应size class当前可用的空闲内存；FreeeList在内存布局上来看是一大段连续内存，然后按照对应size划分了一个个object节点，节点之间彼此链接起来，节点并没有像通常的list一样有显示定义next，而是把节点内存的前面几个字节保存了下一个节点的地址，所有实现都是围绕void *，看起来非常简洁，比如取下个节点的地址和设置下个节点代码：

 1 inline void *SLL_Next( void *t ) {
 2 
 3 return *( reinterpret_cast<void **>(t));
 4 
 5 }
 6 
 7 inline void SLL_SetNext( void *t , void * n) {
 8 
 9   *(reinterpret_cast< void**>(t )) = n;
10 
11 }
12

Allocate判断list是不是空，若不空直接从list链表头弹出一个object。否则需要从Central Cache去请求一次分配。

 1 // Remove some objects of class "cl" from central cache and add to thread heap.
 2 
 3 // On success, return the first object for immediate use; otherwise return NULL.
 4 
 5 void* ThreadCache ::FetchFromCentralCache( size_t cl , size_t byte_size) {
 6 
 7 FreeList* list = &list_[ cl];
 8 
 9 ASSERT(list ->empty());
10 
11 const int batch_size = Static::sizemap ()->num_objects_to_move( cl);
12 
13 const int num_to_move = min<int >(list-> max_length(), batch_size );
14 
15 void *start , *end;
16 
17 int fetch_count = Static:: central_cache()[cl ].RemoveRange(
18 
19                                 & start, &end , num_to_move);
20 
21 ASSERT(( start == NULL ) == (fetch_count == 0));
22 
23 if (-- fetch_count >= 0) {
24 
25 size_ += byte_size * fetch_count;
26 
27 list->PushRange (fetch_count, SLL_Next(start ), end);
28 
29   }
30 
31 // Increase max length slowly up to batch_size.  After that,
32 
33 // increase by batch_size in one shot so that the length is a
34 
35 // multiple of batch_size.
36 
37 if ( list->max_length () < batch_size) {
38 
39 list->set_max_length (list-> max_length() + 1);
40 
41   } else {
42 
43 // Don't let the list get too long.  In 32 bit builds, the length
44 
45 // is represented by a 16 bit int, so we need to watch out for
46 
47 // integer overflow.
48 
49 int new_length = min< int>(list ->max_length() + batch_size,
50 
51 kMaxDynamicFreeListLength);
52 
53 // The list's max_length must always be a multiple of batch_size,
54 
55 // and kMaxDynamicFreeListLength is not necessarily a multiple
56 
57 // of batch_size.
58 
59 new_length -= new_length % batch_size;
60 
61 ASSERT(new_length % batch_size == 0);
62 
63 list->set_max_length (new_length);
64 
65   }
66 
67 return start;
68 
69 }
70 
71

FetchFromCentralCache 便是ThreadCache向Central Cache请求分配内存的函数，代码如下：

简单来说，从SizeMap获取一次分配应该获得的objects的个数，然后把这么多内存从central cache中移动到线程本地free list中，最后就是更新free list最大可能长度这个属性，每次分配增大最大可能长度，让本地线程频繁使用的object 内存所在的空闲链表能容纳更多object。

从central cache分配内存

ThreadCache的每一个FreeeList都有其对应的CentralFreeeList，从Centreal Cache移动内存到ThreadCache就是通过CentralFreeList的RemoveRange接口来完成的，代码如下：

 1 int CentralFreeList ::RemoveRange( void **start , void ** end, int N) {
 2 
 3 ASSERT( N > 0);
 4 
 5 lock_. Lock();
 6 
 7 if ( N == Static ::sizemap()-> num_objects_to_move(size_class_ ) &&
 8 
 9 used_slots_ > 0) {
10 
11 int slot = --used_slots_;
12 
13 ASSERT(slot >= 0);
14 
15 TCEntry *entry = &tc_slots_[ slot];
16 
17     * start = entry ->head;
18 
19     * end = entry ->tail;
20 
21 lock_.Unlock ();
22 
23 return N ;
24 
25   }
26 
27 int result = 0;
28 
29 void* head = NULL ;
30 
31 void* tail = NULL ;
32 
33 // TODO: Prefetch multiple TCEntries?
34 
35 tail = FetchFromSpansSafe();
36 
37 if ( tail != NULL ) {
38 
39 SLL_SetNext(tail , NULL);
40 
41 head = tail ;
42 
43 result = 1;
44 
45 while (result < N) {
46 
47 void *t = FetchFromSpans();
48 
49 if (!t ) break;
50 
51 SLL_Push(&head , t);
52 
53 result++;
54 
55     }
56 
57   }
58 
59 lock_. Unlock();
60 
61   *start = head;
62 
63   *end = tail;
64 
65 return result;
66 
67 }
68 
69

因为CentralFreeList是所有线程共享了，所以操作的时候要锁住先。另外，针对内存在CentralFreeList和ThreadCache的FreeList频繁移动的free list，CentralFreeList又维护了一个转移缓存TCEntry，每次RemoveRange 先判断该缓存中有没，有则直接返回，否则要从Span获取。

Span

Span是什么，他标识一段连续的内存页，他可以作为节点和其他Span串起来，他可以把内存页划分成一个个objects供分配小块内存，他定义为如下的结构体：

 1 struct Span {
 2 
 3 PageID start;          // Starting page number
 4 
 5 Length length;         // Number of pages in span
 6 
 7 Span*         next;           // Used when in link list
 8 
 9 Span*         prev;           // Used when in link list
10 
11 void*         objects;        // Linked list of free objects
12 
13 ...
14 
15 }
16 
17

CentralFreeList有两个Spans链表：

Span empty_; // Dummy header for list of empty spans

Span nonempty_; // Dummy header for list of non-empty spans

回到之前的RemoveRange，如果转移缓存没有命中，就会转而调用FetchFromSpansSafe：

 1 void* CentralFreeList ::FetchFromSpansSafe() {
 2 
 3 void * t = FetchFromSpans ();
 4 
 5 if (! t) {
 6 
 7 Populate();
 8 
 9 t = FetchFromSpans ();
10 
11   }
12 
13 return t;
14 
15 }
16 
17

先试着从FetchFromSpans获取：

 1 void* CentralFreeList ::FetchFromSpans() {
 2   if ( tcmalloc::DLL_IsEmpty (&nonempty_)) return NULL ;
 3   Span* span = nonempty_ .next;
 4 
 5   ASSERT( span->objects != NULL);
 6   span-> refcount++;
 7   void* result = span ->objects;
 8   span-> objects = *(reinterpret_cast <void**>( result));
 9   if ( span->objects == NULL) {
10     // Move to empty list
11     tcmalloc::DLL_Remove (span);
12     tcmalloc::DLL_Prepend (&empty_, span);
13     Event(span , 'E', 0);
14   }
15   counter_--;
16   return result;
17 }
18

nonempty_放的是那些非空的span（即标识的内存还未分配完），FetchFromSpans每次Fetch一个object，span的引用计数随即加1，若span标识的内存都分配完了，则把该span从nonempty_移动到empty_。

如果nonempty_也是空的，就要从page heap中获取内存了，Populate函数就是用来从page heap申请内存的：

 1 // Fetch memory from the system and add to the central cache freelist.
 2 
 3 void CentralFreeList ::Populate() {
 4 
 5 // Release central list lock while operating on pageheap
 6 
 7 lock_. Unlock();
 8 
 9 const size_t npages = Static:: sizemap()->class_to_pages (size_class_);
10 
11 Span* span;
12 
13   {
14 
15 SpinLockHolder h(Static ::pageheap_lock());
16 
17 span = Static ::pageheap()-> New(npages );
18 
19 if (span ) Static:: pageheap()->RegisterSizeClass (span, size_class_);
20 
21   }
22 
23 if ( span == NULL ) {
24 
25 Log(kLog , __FILE__, __LINE__,
26 
27 "tcmalloc: allocation failed" , npages << kPageShift);
28 
29 lock_.Lock ();
30 
31 return;
32 
33   }
34 
35 ASSERT( span->length == npages);
36 
37 // Cache sizeclass info eagerly.  Locking is not necessary.
38 
39 // (Instead of being eager, we could just replace any stale info
40 
41 // about this span, but that seems to be no better in practice.)
42 
43 for ( int i = 0; i < npages; i ++) {
44 
45 Static::pageheap ()->CacheSizeClass( span->start + i, size_class_);
46 
47   }
48 
49 // Split the block into pieces and add to the free-list
50 
51 // TODO: coloring of objects to avoid cache conflicts?
52 
53 void** tail = &span ->objects;
54 
55 char* ptr = reinterpret_cast <char*>( span->start << kPageShift);
56 
57 char* limit = ptr + (npages << kPageShift);
58 
59 const size_t size = Static:: sizemap()->ByteSizeForClass (size_class_);
60 
61 int num = 0;
62 
63 while ( ptr + size <= limit) {
64 
65     * tail = ptr ;
66 
67 tail = reinterpret_cast <void**>( ptr);
68 
69 ptr += size ;
70 
71 num++;
72 
73   }
74 
75 ASSERT( ptr <= limit );
76 
77   *tail = NULL;
78 
79 span-> refcount = 0; // No sub-object in use yet
80 
81 // Add span to list of non-empty spans
82 
83 lock_. Lock();
84 
85 tcmalloc:: DLL_Prepend(&nonempty_ , span);
86 
87   ++num_spans_;
88 
89 counter_ += num;
90 
91 }
92 
93

SizeMap决定一次从page heap分配的内存页数。操作page heap同样需要另一把锁，从page heap New出来的内存同样也是用Span标识，分配出来后还要经过切分处理，按照对应size把内存切分成一个个objects，彼此之间链接起来，最后Span保存链表头指针，用于之后的分配。切分完成，就把该新span放入nonempty_等待分配。

从 Page Heap分配内存

PageHeap维护一个free_(SpanList数组,SpanList被定义成两个Span链表），第n项代表长度为n页的内存，free_[i]里面再分成了两条可用链表，一种是普通的可用内存，一种是返回给系统的内存。从PageHeap分配内存是通过PageHeap::New接口：

 1 Span* PageHeap::New(Length n) {
 2 
 3   ASSERT(Check());
 4 
 5   ASSERT(n > 0);
 6 
 7   Span* result = SearchFreeAndLargeLists(n);
 8 
 9 if (result != NULL)
10 
11 return result;
12 
13 if (stats_.free_bytes != 0 && stats_.unmapped_bytes != 0
14 
15       && stats_.free_bytes + stats_.unmapped_bytes >= stats_.system_bytes / 4
16 
17       && (stats_.system_bytes / kForcedCoalesceInterval
18 
19           != (stats_.system_bytes + (n << kPageShift)) / kForcedCoalesceInterval)) {
20 
21 // We're about to grow heap, but there are lots of free pages.
22 
23 // tcmalloc's design decision to keep unmapped and free spans
24 
25 // separately and never coalesce them means that sometimes there
26 
27 // can be free pages span of sufficient size, but it consists of
28 
29 // "segments" of different type so page heap search cannot find
30 
31 // it. In order to prevent growing heap and wasting memory in such
32 
33 // case we're going to unmap all free pages. So that all free
34 
35 // spans are maximally coalesced.
36 
37 //
38 
39 // We're also limiting 'rate' of going into this path to be at
40 
41 // most once per 128 megs of heap growth. Otherwise programs that
42 
43 // grow heap frequently (and that means by small amount) could be
44 
45 // penalized with higher count of minor page faults.
46 
47 //
48 
49 // See also large_heap_fragmentation_unittest.cc and
50 
51 // https://code.google.com/p/gperftools/issues/detail?id=368
52 
53     ReleaseAtLeastNPages( static_cast<Length>(0x7fffffff));
54 
55 // then try again. If we are forced to grow heap because of large
56 
57 // spans fragmentation and not because of problem described above,
58 
59 // then at the very least we've just unmapped free but
60 
61 // insufficiently big large spans back to OS. So in case of really
62 
63 // unlucky memory fragmentation we'll be consuming virtual address
64 
65 // space, but not real memory
66 
67     result = SearchFreeAndLargeLists(n);
68 
69 if (result != NULL) return result;
70 
71   }
72 
73 // Grow the heap and try again.
74 
75 if (!GrowHeap(n)) {
76 
77     ASSERT(Check());
78 
79 return NULL;
80 
81   }
82 
83 return SearchFreeAndLargeLists(n);
84 
85 }
86 
87

整个实现先从空闲内存获取，SearchFreeAndLargeLists如下：

 1 Span* PageHeap ::SearchFreeAndLargeLists( Length n ) {
 2 
 3 ASSERT( Check());
 4 
 5 ASSERT( n > 0);
 6 
 7 // Find first size >= n that has a non-empty list
 8 
 9 for ( Length s = n; s < kMaxPages ; s++) {
10 
11 Span* ll = &free_[ s].normal ;
12 
13 // If we're lucky, ll is non-empty, meaning it has a suitable span.
14 
15 if (!DLL_IsEmpty (ll)) {
16 
17 ASSERT(ll ->next-> location == Span ::ON_NORMAL_FREELIST);
18 
19 return Carve (ll-> next, n );
20 
21     }
22 
23 // Alternatively, maybe there's a usable returned span.
24 
25 ll = &free_ [s]. returned;
26 
27 if (!DLL_IsEmpty (ll)) {
28 
29 // We did not call EnsureLimit before, to avoid releasing the span
30 
31 // that will be taken immediately back.
32 
33 // Calling EnsureLimit here is not very expensive, as it fails only if
34 
35 // there is no more normal spans (and it fails efficiently)
36 
37 // or SystemRelease does not work (there is probably no returned spans).
38 
39 if (EnsureLimit (n)) {
40 
41 // ll may have became empty due to coalescing
42 
43 if (!DLL_IsEmpty (ll)) {
44 
45 ASSERT(ll ->next-> location == Span ::ON_RETURNED_FREELIST);
46 
47 return Carve (ll-> next, n );
48 
49         }
50 
51       }
52 
53     }
54 
55   }
56 
57 // No luck in free lists, our last chance is in a larger class.
58 
59 return AllocLarge(n );  // May be NULL
60 
61 }
62 
63

在大于等于n页的空闲内存链表里面开始搜索内存对应的Span，先是normal链表再是returned链表，如果找到了一个Span，可能是在一个大于n页的内存里面找到的，所以要切割一下，Carve函数主要是把多余的内存切割出来再放回空闲链表。如果在返还给系统的内存中找到合适的Span，相当于要从系统申请内存了，EnsureLimit要确保从系统申请的内存不能超过上限。

如果仍然没有找到可用的Span，就要跳转到AllocLarge去获取了，PageHeap还有一个larger_ （SpanList类型），对于页数大于kMaxPages的内存不是放在free_而是larger_，从中查找合适Span的过程也是类似free_[i]链表，只不过larger_里面的Span链表链接起来的Spans标识的内存不一定都是一样的内存大小。

以上还是找不到空闲的内存，回到PageHeap::New，考虑到free的内存和unmmaped的内存（既在normal spans list中标识的内存和returned spans list中标识的内存）可能有足够多的小块页内存，于是可以把free的内存都释放（PageHeap ::ReleaseAtLeastNPages调用，后面可以看到，其实成功释放回给系统的内存都会放到returned spans list中），释放过程其实也会有和邻近的页内存合并的过程（在PageHeap::MergeIntoFreeList），这样就达到了把所有可用的小块内存在returned spans list中进行合并的目的，合并完后再进行一次可用内存搜索（PageHeap :: SearchFreeAndLargeLists调用）。

最后，这时实在找不到可用内存了就扩充堆的大小（ PageHeap:: GrowHeap调用），GrowHeap从系统获取指定大小的内存（页对齐的）。

这里有一个问题，为什么成功释放回给系统的内存，还可以放在returned spans list中并供下次分配直接使用？看释放一个Span内存的代码：

 1 Length PageHeap ::ReleaseLastNormalSpan( SpanList* slist ) {
 2 
 3 Span* s = slist ->normal. prev;
 4 
 5 ASSERT( s->location == Span:: ON_NORMAL_FREELIST);
 6 
 7 if ( TCMalloc_SystemRelease(reinterpret_cast <void*>( s->start << kPageShift),
 8 
 9 static_cast<size_t >(s-> length << kPageShift ))) {
10 
11 RemoveFromFreeList(s );
12 
13 const Length n = s->length ;
14 
15 s->location = Span:: ON_RETURNED_FREELIST;
16 
17 MergeIntoFreeList(s );  // Coalesces if possible.
18 
19 return n ;
20 
21   }
22 
23 return 0;
24 
25 }
26

关键是在TCMalloc_SystemRelease 的实现中。比如在linux平台，TCMalloc_SystemRelease的实现使用MADV_DONTNEED参数调用madivse，告诉系统这一段内存没有再引用了你可以释放与其相关的资源，具体来说应该是可以把这些页对应的物理内存换出了，而且还有一个特性，那就是下次你访问这个释放的内存的时候还是能够直接访问而不需要其他系统调用，只不过系统这时候需要重新加载这段内存，这个特性才使得TCMalloc能够把已经释放的内存保存起来作为可用内存（不过不是优先使用的）。

MADV_DONTNEED
              Do not expect access in the near future. (For the time being, the application is finished with the given range, so the kernel can
              free resources associated with it.) Subsequent accesses of pages in this range will succeed, but will result either in re-loading of
              the memory contents from the underlying mapped file (see mmap(2)) or zero-fill-on-demand pages for mappings without an underlying
              file.

但是在windows平台，TCMalloc_SystemRelease没有任何实现直接返回了false（估计是没有madivse这种系统调用），也就是在windows上TCMalloc所占用的系统内存只会增加不会减少，而且returned spans list也是不可用的。这样看起来成为了某种意义上的内存泄漏了，不过所幸有PageHeap::EnsureLimit，这个函数确保可以配置从系统申请的内存上限，超过上限就返回false，每次申请内存的操作都会调用EnsureLimit确保内存增长在一个可控范围内。

总结：

纵观TCMalloc小内存分配，内存块的流转都是在 ThreadCache，CentralFreeList，PageHeap之间进行，每一个区块代表一个cache，每一个cache有不同表达形式的内存，内存申请逐一而上，层次分明。整体设计看起来很简单，但是却有许多细节和优化。数据结构Spans构思巧妙实现简洁，是标识内存非常好的创意，后面可以单独画图分析。