Skip to content

Commit 2091a83

Browse files
authored
Merge pull request #6 from ffengc/dev
dev to main
2 parents 0feae7d + 286b82b commit 2091a83

File tree

12 files changed

+581
-44
lines changed

12 files changed

+581
-44
lines changed

README.md

Lines changed: 106 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,10 @@
1-
# Google-tcmalloc-simulation-implementation
1+
# Google-tcmalloc-simulation-implementation(未完成)
22
谷歌开源项目tcmalloc高并发内存池学习和模拟实现
33

44
开题日期:20240504
55

6-
- [Google-tcmalloc-simulation-implementation](#google-tcmalloc-simulation-implementation)
6+
- [Google-tcmalloc-simulation-implementation(未完成)](#google-tcmalloc-simulation-implementation未完成)
7+
- [==bugs to fix (项目目前待解决的问题)==](#bugs-to-fix-项目目前待解决的问题)
78
- [前言](#前言)
89
- [threadCache整体框架](#threadcache整体框架)
910
- [开始写threadCache代码](#开始写threadcache代码)
@@ -22,9 +23,23 @@
2223
- [page\_cache内存释放](#page_cache内存释放)
2324
- [大于256k的情况](#大于256k的情况)
2425
- [处理代码中`new`的问题](#处理代码中new的问题)
26+
- [解决free,使其不用传大小](#解决free使其不用传大小)
27+
- [多线程场景下深度测试](#多线程场景下深度测试)
28+
- [分析性能瓶颈](#分析性能瓶颈)
29+
- [用Radix Tree进行优化](#用radix-tree进行优化)
2530

2631
***
2732

33+
## ==bugs to fix (项目目前待解决的问题)==
34+
35+
1. 在ubuntu_arm64环境下,如果调用多线程,出现段错误(原因未知,待解决)
36+
2. 在ubuntu_arm64环境下,radix tree需要用第三棵,前两棵用不了,需要解决。
37+
3. 在window32位环境下,可以偶尔成功运行,出现偶发段错误,原因未知,待解决。
38+
39+
经过radixtree优化后,模拟实现的tcmalloc效率高于malloc。(win32下测试,会出现偶发段错误)
40+
41+
![](./assets/5.png)
42+
2843
## 前言
2944

3045
当前项目是实现一个高并发的内存池,他的原型是google的一个开源项目tcmalloc,tcmalloc全称 Thread-Caching Malloc,即线程缓存的malloc,实现了高效的多线程内存管理,用于替代系统的内存分配相关的函数(malloc、free)。
@@ -1196,4 +1211,92 @@ void page_cache::release_span_to_page(span* s) {
11961211

11971212
## 处理代码中`new`的问题
11981213

1199-
代码中有些地方用了`new span`。这个就很不对。我们弄这个tcmalloc是用来替代malloc的,既然是替代,那我们的代码里面怎么能有`new``new`也是调用`malloc`的,所以我们要改一下。
1214+
代码中有些地方用了`new span`。这个就很不对。我们弄这个tcmalloc是用来替代malloc的,既然是替代,那我们的代码里面怎么能有`new``new`也是调用`malloc`的,所以我们要改一下。
1215+
1216+
然后之前是写了一个定长内存池的,可以用来代替new。
1217+
1218+
**博客地址:[内存池是什么原理?|内存池简易模拟实现|为学习高并发内存池tcmalloc做准备](https://blog.csdn.net/Yu_Cblog/article/details/131741601)**
1219+
1220+
page_cache.hpp
1221+
```cpp
1222+
class page_cache {
1223+
private:
1224+
span_list __span_lists[PAGES_NUM];
1225+
static page_cache __s_inst;
1226+
page_cache() = default;
1227+
page_cache(const page_cache&) = delete;
1228+
std::unordered_map<PAGE_ID, span*> __id_span_map;
1229+
object_pool<span> __span_pool;
1230+
```
1231+
多加一个`object_pool<span> __span_pool;`对象。
1232+
1233+
然后,`new span`的地方都替换掉。`delete`的地方也换掉就行。
1234+
1235+
然后这里面也改一下。
1236+
1237+
tcmalloc.hpp
1238+
```cpp
1239+
static void* tcmalloc(size_t size) {
1240+
if (size > MAX_BYTES) {
1241+
// 处理申请大内存的情况
1242+
size_t align_size = size_class::round_up(size);
1243+
size_t k_page = align_size >> PAGE_SHIFT;
1244+
page_cache::get_instance()->__page_mtx.lock();
1245+
span* cur_span = page_cache::get_instance()->new_span(k_page); // 直接找pc
1246+
page_cache::get_instance()->__page_mtx.unlock();
1247+
void* ptr = (void*)(cur_span->__page_id << PAGE_SHIFT); // span转化成地址
1248+
return ptr;
1249+
}
1250+
if (p_tls_thread_cache == nullptr) {
1251+
// 相当于单例
1252+
// p_tls_thread_cache = new thread_cache;
1253+
static object_pool<thread_cache> tc_pool;
1254+
p_tls_thread_cache = tc_pool.new_();
1255+
}
1256+
#ifdef PROJECT_DEBUG
1257+
LOG(DEBUG) << "tcmalloc find tc from mem" << std::endl;
1258+
#endif
1259+
return p_tls_thread_cache->allocate(size);
1260+
}
1261+
```
1262+
1263+
## 解决free,使其不用传大小
1264+
1265+
因为我们已经有页号到span的映射了。所以我们在span里面增加一个字段,obj_size就行。
1266+
1267+
## 多线程场景下深度测试
1268+
1269+
**首先要明确一点,我们不是去造一个轮子,我们要和malloc对比,不是说要比malloc快多少,因为我们在很多细节上,和tcmalloc差的还是很远的。**
1270+
1271+
测试代码可以见bench\_mark.cc。
1272+
1273+
结果
1274+
```bash
1275+
parallels@ubuntu-linux-22-04-desktop:~/Project/Google-tcmalloc-simulation-implementation$ ./out
1276+
==========================================================
1277+
4个线程并发执行10轮次,每轮次concurrent alloc 1000次: 花费:27877 ms
1278+
4个线程并发执行10轮次,每轮次concurrent dealloc 1000次: 花费:52190 ms
1279+
4个线程并发concurrent alloc&dealloc 40000次,总计花费:80067 ms
1280+
1281+
1282+
4个线程并发执行10次,每轮次malloc 1000次: 花费:2227ms
1283+
4个线程并发执行10轮次,每轮次free 1000次: 花费:1385 ms
1284+
4个线程并发malloc&free 40000次,总计花费:3612 ms
1285+
==========================================================
1286+
parallels@ubuntu-linux-22-04-desktop:~/Project/Google-tcmalloc-simulation-implementation$
1287+
```
1288+
1289+
比malloc差。
1290+
1291+
## 分析性能瓶颈
1292+
1293+
linux和windows(VS STUDIO)下都有很多性能分析的工具,可以检测哪里调用的时间多。
1294+
1295+
在这里直接出结论:锁用了很多时间。
1296+
1297+
可以用基数树进行优化。
1298+
1299+
## 用Radix Tree进行优化
1300+
1301+
radix tree 我们可以直接用tcmalloc源码里面的。`page_map.hpp`
1302+

assets/5.png

75.5 KB
Loading

bench_mark.cc

Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
2+
3+
#include "./include/tcmalloc.hpp"
4+
#include <atomic>
5+
#include <thread>
6+
7+
// ntimes 一轮申请和释放内存的次数
8+
// rounds 轮次
9+
void BenchmarkMalloc(size_t ntimes, size_t nworks, size_t rounds) {
10+
std::vector<std::thread> vthread(nworks);
11+
std::atomic<size_t> malloc_costtime(0);
12+
std::atomic<size_t> free_costtime(0);
13+
for (size_t k = 0; k < nworks; ++k) {
14+
vthread[k] = std::thread([&, k]() {
15+
std::vector<void*> v;
16+
v.reserve(ntimes);
17+
for (size_t j = 0; j < rounds; ++j) {
18+
size_t begin1 = clock();
19+
for (size_t i = 0; i < ntimes; i++) {
20+
v.push_back(malloc(16));
21+
// v.push_back(malloc((16 + i) % 8192 + 1));
22+
}
23+
size_t end1 = clock();
24+
size_t begin2 = clock();
25+
for (size_t i = 0; i < ntimes; i++) {
26+
free(v[i]);
27+
}
28+
size_t end2 = clock();
29+
v.clear();
30+
malloc_costtime += (end1 - begin1);
31+
free_costtime += (end2 - begin2);
32+
}
33+
});
34+
}
35+
for (auto& t : vthread) {
36+
t.join();
37+
}
38+
std::cout << nworks << "threads run" << rounds << " times, each round malloc " << ntimes << " times, cost: " << malloc_costtime.load() << "ms\n";
39+
std::cout << nworks << "threads run" << rounds << " times, each round free " << ntimes << " times, cost: " << free_costtime.load() << " ms\n";
40+
std::cout << nworks << "threads run malloc and free " << nworks * rounds * ntimes << " time, total cost: " << malloc_costtime.load() + free_costtime.load() << " ms\n";
41+
}
42+
43+
// 单轮次申请释放次数 线程数 轮次
44+
void BenchmarkConcurrentMalloc(size_t ntimes, size_t nworks, size_t rounds) {
45+
std::vector<std::thread> vthread(nworks);
46+
std::atomic<size_t> malloc_costtime(0);
47+
std::atomic<size_t> free_costtime(0);
48+
for (size_t k = 0; k < nworks; ++k) {
49+
vthread[k] = std::thread([&]() {
50+
std::vector<void*> v;
51+
v.reserve(ntimes);
52+
for (size_t j = 0; j < rounds; ++j) {
53+
size_t begin1 = clock();
54+
for (size_t i = 0; i < ntimes; i++) {
55+
v.push_back(tcmalloc(16));
56+
// v.push_back(ConcurrentAlloc((16 + i) % 8192 + 1));
57+
}
58+
size_t end1 = clock();
59+
size_t begin2 = clock();
60+
for (size_t i = 0; i < ntimes; i++) {
61+
tcfree(v[i]);
62+
}
63+
size_t end2 = clock();
64+
v.clear();
65+
malloc_costtime += (end1 - begin1);
66+
free_costtime += (end2 - begin2);
67+
}
68+
});
69+
}
70+
for (auto& t : vthread) {
71+
t.join();
72+
}
73+
std::cout << nworks << "threads run" << rounds << " times, each round malloc " << ntimes << " times, cost: " << malloc_costtime.load() << "ms\n";
74+
std::cout << nworks << "threads run" << rounds << " times, each round free " << ntimes << " times, cost: " << free_costtime.load() << " ms\n";
75+
std::cout << nworks << "threads run tcmalloc and tcfree " << nworks * rounds * ntimes << " time, total cost: " << malloc_costtime.load() + free_costtime.load() << " ms\n";
76+
}
77+
78+
int main() {
79+
size_t n = 1000;
80+
BenchmarkConcurrentMalloc(n, 4, 10);
81+
std::cout << std::endl
82+
<< std::endl;
83+
BenchmarkMalloc(n, 4, 10);
84+
return 0;
85+
}

include/common.hpp

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,8 +28,10 @@ static const size_t PAGE_SHIFT = 13;
2828

2929
#if defined(_WIN64) || defined(__x86_64__) || defined(__ppc64__) || defined(__aarch64__)
3030
typedef unsigned long long PAGE_ID;
31+
#define SYS_BYTES 64
3132
#else
3233
typedef size_t PAGE_ID;
34+
#define SYS_BYTES 32
3335
#endif
3436

3537
inline static void* system_alloc(size_t kpage) {
@@ -199,6 +201,7 @@ class span {
199201
size_t __use_count = 0; // 切成段小块内存,被分配给threadCache的计数器
200202
void* __free_list = nullptr; // 切好的小块内存的自由链表
201203
bool __is_use = false; // 是否在被使用
204+
size_t __obj_size; // 切好的小对象的大小
202205
};
203206

204207
// 带头双向循环链表

include/object_pool.hpp

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
2+
3+
#ifndef __YUFC_OBJECT_POOL_HPP__
4+
#define __YUFC_OBJECT_POOL_HPP__
5+
6+
#include <iostream>
7+
#include <vector>
8+
#include "./common.hpp"
9+
10+
#define __DEFAULT_KB__ 128
11+
12+
13+
14+
template <class T>
15+
class object_pool {
16+
private:
17+
char* __memory = nullptr; // char 方便切
18+
size_t __remain_bytes = 0; // 大块内存在切的过程中剩余的字节数
19+
void* __free_list = nullptr; // 还回来的时候形成的自由链表
20+
public:
21+
T* new_() {
22+
T* obj = nullptr;
23+
// 不够空间 首选是把还回来的内存块对象进行再次利用
24+
if (__free_list) {
25+
// 头删
26+
void* next = *((void**)__free_list);
27+
obj = (T*)__free_list;
28+
__free_list = next;
29+
return obj;
30+
}
31+
if (__remain_bytes < sizeof(T)) {
32+
// 空间不够了,要重新开一个空间
33+
__remain_bytes = __DEFAULT_KB__ * 1024;
34+
__memory = (char*)malloc(__remain_bytes);
35+
if (__memory == nullptr) {
36+
throw std::bad_alloc();
37+
}
38+
}
39+
obj = (T*)__memory;
40+
size_t obj_size = sizeof(T) < sizeof(void*) ? sizeof(void*) : sizeof(T);
41+
__memory += obj_size;
42+
__remain_bytes -= obj_size;
43+
new (obj) T;
44+
return obj;
45+
}
46+
void delete_(T* obj) {
47+
obj->~T();
48+
*(void**)obj = __free_list;
49+
__free_list = obj;
50+
}
51+
};
52+
53+
#endif

include/page_cache.hpp

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,14 +4,18 @@
44
#define __YUFC_PAGE_CACHE_HPP__
55

66
#include "./common.hpp"
7+
#include "./object_pool.hpp"
8+
#include "./page_map.hpp"
79

810
class page_cache {
911
private:
1012
span_list __span_lists[PAGES_NUM];
1113
static page_cache __s_inst;
1214
page_cache() = default;
1315
page_cache(const page_cache&) = delete;
14-
std::unordered_map<PAGE_ID, span*> __id_span_map;
16+
// std::unordered_map<PAGE_ID, span*> __id_span_map;
17+
TCMalloc_PageMap3<SYS_BYTES - PAGE_SHIFT> __id_span_map;
18+
object_pool<span> __span_pool;
1519

1620
public:
1721
std::mutex __page_mtx;
@@ -21,6 +25,7 @@ class page_cache {
2125
span* map_obj_to_span(void* obj);
2226
// 释放空闲的span回到pc,并合并相邻的span
2327
void release_span_to_page(span* s, size_t size = 0);
28+
2429
public:
2530
// 获取一个K页的span
2631
span* new_span(size_t k);

0 commit comments

Comments
 (0)