性能优化(图片来自网络)
性能优化(图片来自网络)

本文是为了记录下《QMWS》项目服务器在对外测试期间,性能表现和技术审核时的性能表现差距很大,从而做出的一些优化过程,期间还是比较头疼,接近两个连续通宵来修改。第一个通宵一直在查找问题和猜问题,找问题是如何出现的,第二天主要是解决对应的性能问题。

性能问题主要集中在:

内存使用过快

内存泄露

某些时间内协议过多

逻辑功能处理不当

其他网络消耗什么的暂时也没有考虑,下面详细说说这几个问题是如何引起的以及如何一步步解决的。

1.内存使用过快问题

我们项目使用的是C++后端,用了团队主程之前写的,后来我参与开发的一个开源的服务器数据管理框架NFrame,下面是github地址。

GitHub 仓库挂件 WordPress 插件

ketoo / NoahGameFrame

A fast, scalable, distributed game server engine/framework for C++, include actor library, network library,can be used as a real time multiplayer game engine ( MMO RPG/MOBA ), which support C#/Lua script/ Unity3d, Cocos2dx and plan to support Unreal.

https://github.com/ketoo/NoahGameFrame/wiki

这个框架就不多做介绍了,有兴趣可以从github pull下来研究研究,或者加群讨论。

这个框架的主要问题就是小的变量和内存申请太多了,导致产生了很多碎片,在linux系统上回收不够及时导致的,而且我们错误的估计了shared_ptr的能力了,智能指针确实能减少很多内存管理,但是生命周期却不好把握,所以我们临时用了大google的TCMalloc,每5分钟主动回收一次内存,确实解决了我们的问题。

2.内存泄漏

内存泄漏这块主要的问题还是代码上的漏洞,逻辑不够严谨,释放对象的时候因为一些判断导致对象没有被释放。还有就是一些数据管理上的问题,不是特别严重。

3.某些操作下逻辑处理过多

因为NFrame采用的是事件回调的方式来处理数据变动,我们会在自己感兴趣的属性(Property)和表(Record)数据上注册回调函数,从而做一些逻辑。这个时候就得特别注意逻辑需求了,因为某些时候属性会有多次变化,导致多次计算其他属性,我们遇到的问题就是当属性变化时计算多个武将的战斗力,如果某个操作导致了多个属性变化,就会导致成倍的战斗力计算,加重了逻辑负担。修改方式是去掉部分回调,在逻辑操作中特定的时候处理回调做的事情,从而降低调用次数。

4.逻辑功能处理不当

这个就是一些常见的问题了,经过重构整理以及性能测试就可以很好的看出来问题,就不多说了。

 

推荐的一些工具:

VLD C++内存泄漏检查工具

VTune C++性能内存检查工具,完美和VS IDE结合,非常好用

Valgrind linux下内存检查工具,缺点是没有IDE环境,报告不太容易看,不过分析还是蛮准确的

Border check(DevPartner) 自动化测试/覆盖性测试工具,大borland(估计很多人都没听过了)出品,业界一流产品

 

好的结果是问题基本已经排查到了,所以后续的版本测试基本没有特别大的性能问题,不过还是得争取高的压测结果。

PS:线上环境和本地测试还是有很大差距,前些天看到一个测试方法,觉得很有意思,与大家分享下。是通过线上开一台功能服务器A,然后再开一台同样的功能服务器B,A和B做的事情完全相同,A是对外服务,B是内部测试服务器,A多做一件事情,将收到的消息顺便转发到B,B再按照收到的协议做处理,这样通过观察B的数据和性能就能模拟线上环境,从而更容易发现问题处理问题,这个东西github上也有对应的项目,叫TCP Copy,有兴趣的朋友可以看看。

文章转自:boost::flat_map and its performance compared to map and unordered_map

have run a benchmark on different data structures very recently at my company so I feel I need to drop a word. It is very complicated to benchmark something correctly.

Benchmarking

On the web we rarely find (if ever) a well engineered benchmark. Until today I only found benchmarks that were done the journalist way (pretty quickly and sweeping dozens of variables under the carpet).

1) You need to consider about cache warming

Most people running benchmarks are afraid of timer discrepancy, therefore they run their stuff thousands of times and take the whole time, they just are careful to take the same thousand of times for every operation, and then consider that comparable.

The truth is, in real world it makes little sense, because your cache will not be warm, and your operation will likely be called just once. Therefore you need to benchmark using RDTSC, and time stuff calling them once only. Intel has made a paper describing how to use RDTSC (using a cpuid instruction to flush the pipeline, and calling it at least 3 times at the beginning of the program to stabilize it).

2) rdtsc accuracy measure

I also recommend doing this:

u64 g_correctionFactor;  // number of clocks to offset after each measurment to remove the overhead of the measurer itself.
u64 g_accuracy;

static u64 const errormeasure = ~((u64)0);

#ifdef _MSC_VER
#pragma intrinsic(__rdtsc)
inline u64 GetRDTSC()
{
    int a[4];
    __cpuid(a, 0x80000000);  // flush OOO instruction pipeline
    return __rdtsc();
}

inline void WarmupRDTSC()
{
    int a[4];
    __cpuid(a, 0x80000000);  // warmup cpuid.
    __cpuid(a, 0x80000000);
    __cpuid(a, 0x80000000);

    // measure the measurer overhead with the measurer (crazy he..)
    u64 minDiff = LLONG_MAX;
    u64 maxDiff = 0;   // this is going to help calculate our PRECISION ERROR MARGIN
    for (int i = 0; i < 80; ++i)
    {
        u64 tick1 = GetRDTSC();
        u64 tick2 = GetRDTSC();
        minDiff = Aska::Min(minDiff, tick2 - tick1);   // make many takes, take the smallest that ever come.
        maxDiff = Aska::Max(maxDiff, tick2 - tick1);
    }
    g_correctionFactor = minDiff;

    printf("Correction factor %llu clocks\n", g_correctionFactor);

    g_accuracy = maxDiff - minDiff;
    printf("Measurement Accuracy (in clocks) : %llu\n", g_accuracy);
}
#endif

This is a discrepancy measurer, and it will take the minimum of all measured values, to avoid to get a -10**18 (64 bits first negatives values) from time to time.

Notice the use of intrinsics and not inline assembly. First inline assembly is rarely supported by compilers nowadays, but much worse of all, the compiler creates a full ordering barrier around inline assembly because it cannot static analyze the inside, so this is a problem to benchmark real world stuff, especially when calling stuff just once. So an intrinsic is suited here, because it doesn’t break the compiler free-re-ordering of instructions.

3) parameters

The last problem is people usually test for too few variations of the scenario. A container performance is affected by:

  1. Allocator
  2. size of contained type
  3. cost of implementation of copy operation, assignment operation, move operation, construction operation, of the contained type.
  4. number of elements in the container (size of the problem)
  5. type has trivial 3.-operations
  6. type is POD

Point 1 is important because containers do allocate from time to time, and it matters a lot if they allocate using the CRT “new” or some user defined operation, like pool allocation or freelist or other…

Point 2 is because some containers (say A) will loose time copying stuff around, and the bigger the type the bigger the overhead. The problem is that when comparing to another container B, A may win over B for small types, and loose for larger types.

Point 3 is the same than point 2, except it multiplies the the cost by some weighting factor.

Point 4 is a question of big O mixed with cache issues. Some bad complexities containers can largely outperform low complexity containers for small number of types (like map vs. vector, because their cache locality is good, but map fragments the memory). And then at some crossing point, they will lose, because the contained overall size starts to “leak” to main memory and cause cache misses, that plus the fact that the asymptoptic complexity can start to be felt.

Point 5 is about compilers being able to ellude stuff that are empty or trivial at compile time. This can optimize greatly some operations, because the containers are template, therefore each type will have its own performance profile.

Point 6 same as point 5, PODS can benefit from the fact that copy construction is just a memcpy, and some containers can have a specific implementation for these cases, using partial template specializations, or SFINAE to select algorithms according to traits of T.

About the flat map

Apparently the flat map is a sorted vector wrapper, like Loki AssocVector, but with some supplementary modernizations coming with C++11, exploiting move semantics to accelerate insert and delete of single elements.

This is still an ordered container. Most people usually don’t need to ordering part, therefore the existence of unordered...

Have you considered that maybe you need a flat_unorderedmap ? which would be something like google::sparse_map or something like that. An open address hash map.

The problem of open address hash maps, is that at the time of rehash they have to copy all around to the new extended flat land. When a standard unordered map just have to recreate the hash index, but the allocated data stays where it is. The disadvantage of course is that the memory is fragmented like hell.

The criterion of a rehash in an open address hash map is when the capacity overpasses the size of the bucket vector multiplied by the load factor.

A typical load factor is 0.8 therefore, you need to care about that, if you can pre-size your hash map before filling it, always presize to: intended_filling * (1/0.8) + epsilon this will give you a guarantee of never having to spuriously rehash and recopy everything during filling.

The advantage of closed address maps (std::unordered..) is that you don’t have to care about those parameters.

But the boost flat_map is an ordered vector therefore it will always have a log(N) asymptoptic complexity, which is less good than the open address hash map (amortized constant time). You should consider that as well.

Benchmark results

This is a test involving different maps (with int key and __int64/somestruct as value) and std::vector.

tested types information:

typeid=__int64 .  sizeof=8 . ispod=yes
typeid=struct MediumTypePod .  sizeof=184 . ispod=yes

Insertion

EDIT:

Ok, because my previous results included a bug, they actually tested ordered insertion, which exhibited a very fast behavior for the flat maps.
I left those results thereunder because they are interesting.
This is the correct test: enter image description here

enter image description here

I have checked the implementation, there is no such thing as a deferred sort implemented in the flat maps here. Each insertion sorts on the fly, therefore this benchmark exhibits the asymptotic tendencies:

map : N * log(N)
hashmaps : amortized N
vector and flatmaps : N * N

Warning: hereafter the test for std::map and both flat_maps here is buggy and actually testsordered insertion:
random insert of 100 elements without reservation

We can see that ordered insertion, results in back pushing, and is extremely fast. However, from non-charted results of my benchmark, I can also say that this is not near the optimiality of back-insertion in a vector. Which is 3Million cycles, we observe 4.8M here for boost (160% of the optimal).

random insert of 10000 elements without reservation

Random search of 3 elements (clocks renormalized to 1)

in size = 100

rand search within container of 100 elements

in size = 10000

rand search within container of 10000 elements

Iteration

over size 100 (only MediumPod type)

Iteration over 100 medium pods

over size 10000 (only MediumPod type)

Iteration over 10000 medium pods

best regards