文章转自:boost::flat_map and its performance compared to map and unordered_map

have run a benchmark on different data structures very recently at my company so I feel I need to drop a word. It is very complicated to benchmark something correctly.

Benchmarking

On the web we rarely find (if ever) a well engineered benchmark. Until today I only found benchmarks that were done the journalist way (pretty quickly and sweeping dozens of variables under the carpet).

1) You need to consider about cache warming

Most people running benchmarks are afraid of timer discrepancy, therefore they run their stuff thousands of times and take the whole time, they just are careful to take the same thousand of times for every operation, and then consider that comparable.

The truth is, in real world it makes little sense, because your cache will not be warm, and your operation will likely be called just once. Therefore you need to benchmark using RDTSC, and time stuff calling them once only. Intel has made a paper describing how to use RDTSC (using a cpuid instruction to flush the pipeline, and calling it at least 3 times at the beginning of the program to stabilize it).

2) rdtsc accuracy measure

I also recommend doing this:

u64 g_correctionFactor;  // number of clocks to offset after each measurment to remove the overhead of the measurer itself.
u64 g_accuracy;

static u64 const errormeasure = ~((u64)0);

#ifdef _MSC_VER
#pragma intrinsic(__rdtsc)
inline u64 GetRDTSC()
{
    int a[4];
    __cpuid(a, 0x80000000);  // flush OOO instruction pipeline
    return __rdtsc();
}

inline void WarmupRDTSC()
{
    int a[4];
    __cpuid(a, 0x80000000);  // warmup cpuid.
    __cpuid(a, 0x80000000);
    __cpuid(a, 0x80000000);

    // measure the measurer overhead with the measurer (crazy he..)
    u64 minDiff = LLONG_MAX;
    u64 maxDiff = 0;   // this is going to help calculate our PRECISION ERROR MARGIN
    for (int i = 0; i < 80; ++i)
    {
        u64 tick1 = GetRDTSC();
        u64 tick2 = GetRDTSC();
        minDiff = Aska::Min(minDiff, tick2 - tick1);   // make many takes, take the smallest that ever come.
        maxDiff = Aska::Max(maxDiff, tick2 - tick1);
    }
    g_correctionFactor = minDiff;

    printf("Correction factor %llu clocks\n", g_correctionFactor);

    g_accuracy = maxDiff - minDiff;
    printf("Measurement Accuracy (in clocks) : %llu\n", g_accuracy);
}
#endif

This is a discrepancy measurer, and it will take the minimum of all measured values, to avoid to get a -10**18 (64 bits first negatives values) from time to time.

Notice the use of intrinsics and not inline assembly. First inline assembly is rarely supported by compilers nowadays, but much worse of all, the compiler creates a full ordering barrier around inline assembly because it cannot static analyze the inside, so this is a problem to benchmark real world stuff, especially when calling stuff just once. So an intrinsic is suited here, because it doesn’t break the compiler free-re-ordering of instructions.

3) parameters

The last problem is people usually test for too few variations of the scenario. A container performance is affected by:

  1. Allocator
  2. size of contained type
  3. cost of implementation of copy operation, assignment operation, move operation, construction operation, of the contained type.
  4. number of elements in the container (size of the problem)
  5. type has trivial 3.-operations
  6. type is POD

Point 1 is important because containers do allocate from time to time, and it matters a lot if they allocate using the CRT “new” or some user defined operation, like pool allocation or freelist or other…

Point 2 is because some containers (say A) will loose time copying stuff around, and the bigger the type the bigger the overhead. The problem is that when comparing to another container B, A may win over B for small types, and loose for larger types.

Point 3 is the same than point 2, except it multiplies the the cost by some weighting factor.

Point 4 is a question of big O mixed with cache issues. Some bad complexities containers can largely outperform low complexity containers for small number of types (like map vs. vector, because their cache locality is good, but map fragments the memory). And then at some crossing point, they will lose, because the contained overall size starts to “leak” to main memory and cause cache misses, that plus the fact that the asymptoptic complexity can start to be felt.

Point 5 is about compilers being able to ellude stuff that are empty or trivial at compile time. This can optimize greatly some operations, because the containers are template, therefore each type will have its own performance profile.

Point 6 same as point 5, PODS can benefit from the fact that copy construction is just a memcpy, and some containers can have a specific implementation for these cases, using partial template specializations, or SFINAE to select algorithms according to traits of T.

About the flat map

Apparently the flat map is a sorted vector wrapper, like Loki AssocVector, but with some supplementary modernizations coming with C++11, exploiting move semantics to accelerate insert and delete of single elements.

This is still an ordered container. Most people usually don’t need to ordering part, therefore the existence of unordered...

Have you considered that maybe you need a flat_unorderedmap ? which would be something like google::sparse_map or something like that. An open address hash map.

The problem of open address hash maps, is that at the time of rehash they have to copy all around to the new extended flat land. When a standard unordered map just have to recreate the hash index, but the allocated data stays where it is. The disadvantage of course is that the memory is fragmented like hell.

The criterion of a rehash in an open address hash map is when the capacity overpasses the size of the bucket vector multiplied by the load factor.

A typical load factor is 0.8 therefore, you need to care about that, if you can pre-size your hash map before filling it, always presize to: intended_filling * (1/0.8) + epsilon this will give you a guarantee of never having to spuriously rehash and recopy everything during filling.

The advantage of closed address maps (std::unordered..) is that you don’t have to care about those parameters.

But the boost flat_map is an ordered vector therefore it will always have a log(N) asymptoptic complexity, which is less good than the open address hash map (amortized constant time). You should consider that as well.

Benchmark results

This is a test involving different maps (with int key and __int64/somestruct as value) and std::vector.

tested types information:

typeid=__int64 .  sizeof=8 . ispod=yes
typeid=struct MediumTypePod .  sizeof=184 . ispod=yes

Insertion

EDIT:

Ok, because my previous results included a bug, they actually tested ordered insertion, which exhibited a very fast behavior for the flat maps.
I left those results thereunder because they are interesting.
This is the correct test: enter image description here

enter image description here

I have checked the implementation, there is no such thing as a deferred sort implemented in the flat maps here. Each insertion sorts on the fly, therefore this benchmark exhibits the asymptotic tendencies:

map : N * log(N)
hashmaps : amortized N
vector and flatmaps : N * N

Warning: hereafter the test for std::map and both flat_maps here is buggy and actually testsordered insertion:
random insert of 100 elements without reservation

We can see that ordered insertion, results in back pushing, and is extremely fast. However, from non-charted results of my benchmark, I can also say that this is not near the optimiality of back-insertion in a vector. Which is 3Million cycles, we observe 4.8M here for boost (160% of the optimal).

random insert of 10000 elements without reservation

Random search of 3 elements (clocks renormalized to 1)

in size = 100

rand search within container of 100 elements

in size = 10000

rand search within container of 10000 elements

Iteration

over size 100 (only MediumPod type)

Iteration over 100 medium pods

over size 10000 (only MediumPod type)

Iteration over 10000 medium pods

best regards

via:http://www.sudops.com/linux-kernel-tcp-ip-sysctl-optimize.html

Linux下TCP/IP及内核参数优化有多种方式,参数配置得当可以大大提高系统的性能,也可以根据特定场景进行专门的优化,如TIME_WAIT过高,DDOS攻击等等。
如下配置是写在sysctl.conf中,可使用sysctl -p生效,文中附带了一些默认值和中文解释(从网上收集和翻译而来),确有些辛苦,转载请保留链接,谢谢~。
相关参数仅供参考,具体数值还需要根据机器性能,应用场景等实际情况来做更细微调整。

net.core.netdev_max_backlog = 400000
#该参数决定了,网络设备接收数据包的速率比内核处理这些包的速率快时,允许送到队列的数据包的最大数目。

net.core.optmem_max = 10000000
#该参数指定了每个套接字所允许的最大缓冲区的大小

net.core.rmem_default = 10000000
#指定了接收套接字缓冲区大小的缺省值(以字节为单位)。

net.core.rmem_max = 10000000
#指定了接收套接字缓冲区大小的最大值(以字节为单位)。

net.core.somaxconn = 100000
#Linux kernel参数,表示socket监听的backlog(监听队列)上限

net.core.wmem_default = 11059200
#定义默认的发送窗口大小;对于更大的 BDP 来说,这个大小也应该更大。

net.core.wmem_max = 11059200
#定义发送窗口的最大大小;对于更大的 BDP 来说,这个大小也应该更大。

net.ipv4.conf.all.rp_filter = 1
net.ipv4.conf.default.rp_filter = 1
#严谨模式 1 (推荐)
#松散模式 0

net.ipv4.tcp_congestion_control = bic
#默认推荐设置是 htcp

net.ipv4.tcp_window_scaling = 0
#关闭tcp_window_scaling
#启用 RFC 1323 定义的 window scaling;要支持超过 64KB 的窗口,必须启用该值。

net.ipv4.tcp_ecn = 0
#把TCP的直接拥塞通告(tcp_ecn)关掉

net.ipv4.tcp_sack = 1
#关闭tcp_sack
#启用有选择的应答(Selective Acknowledgment),
#这可以通过有选择地应答乱序接收到的报文来提高性能(这样可以让发送者只发送丢失的报文段);
#(对于广域网通信来说)这个选项应该启用,但是这会增加对 CPU 的占用。

net.ipv4.tcp_max_tw_buckets = 10000
#表示系统同时保持TIME_WAIT套接字的最大数量

net.ipv4.tcp_max_syn_backlog = 8192
#表示SYN队列长度,默认1024,改成8192,可以容纳更多等待连接的网络连接数。

net.ipv4.tcp_syncookies = 1
#表示开启SYN Cookies。当出现SYN等待队列溢出时,启用cookies来处理,可防范少量SYN攻击,默认为0,表示关闭;

net.ipv4.tcp_timestamps = 1
#开启TCP时间戳
#以一种比重发超时更精确的方法(请参阅 RFC 1323)来启用对 RTT 的计算;为了实现更好的性能应该启用这个选项。

net.ipv4.tcp_tw_reuse = 1
#表示开启重用。允许将TIME-WAIT sockets重新用于新的TCP连接,默认为0,表示关闭;

net.ipv4.tcp_tw_recycle = 1
#表示开启TCP连接中TIME-WAIT sockets的快速回收,默认为0,表示关闭。

net.ipv4.tcp_fin_timeout = 10
#表示如果套接字由本端要求关闭,这个参数决定了它保持在FIN-WAIT-2状态的时间。

net.ipv4.tcp_keepalive_time = 1800
#表示当keepalive起用的时候,TCP发送keepalive消息的频度。缺省是2小时,改为30分钟。

net.ipv4.tcp_keepalive_probes = 3
#如果对方不予应答,探测包的发送次数

net.ipv4.tcp_keepalive_intvl = 15
#keepalive探测包的发送间隔

net.ipv4.tcp_mem
#确定 TCP 栈应该如何反映内存使用;每个值的单位都是内存页(通常是 4KB)。
#第一个值是内存使用的下限。
#第二个值是内存压力模式开始对缓冲区使用应用压力的上限。
#第三个值是内存上限。在这个层次上可以将报文丢弃,从而减少对内存的使用。对于较大的 BDP 可以增大这些值(但是要记住,其单位是内存页,而不是字节)。

net.ipv4.tcp_rmem
#与 tcp_wmem 类似,不过它表示的是为自动调优所使用的接收缓冲区的值。

net.ipv4.tcp_wmem = 30000000 30000000 30000000
#为自动调优定义每个 socket 使用的内存。
#第一个值是为 socket 的发送缓冲区分配的最少字节数。
#第二个值是默认值(该值会被 wmem_default 覆盖),缓冲区在系统负载不重的情况下可以增长到这个值。
#第三个值是发送缓冲区空间的最大字节数(该值会被 wmem_max 覆盖)。

net.ipv4.ip_local_port_range = 1024 65000
#表示用于向外连接的端口范围。缺省情况下很小:32768到61000,改为1024到65000。

net.ipv4.netfilter.ip_conntrack_max=204800
#设置系统对最大跟踪的TCP连接数的限制

net.ipv4.tcp_slow_start_after_idle = 0
#关闭tcp的连接传输的慢启动,即先休止一段时间,再初始化拥塞窗口。

net.ipv4.route.gc_timeout = 100
#路由缓存刷新频率,当一个路由失败后多长时间跳到另一个路由,默认是300。

net.ipv4.tcp_syn_retries = 1
#在内核放弃建立连接之前发送SYN包的数量。

net.ipv4.icmp_echo_ignore_broadcasts = 1
# 避免放大攻击

net.ipv4.icmp_ignore_bogus_error_responses = 1
# 开启恶意icmp错误消息保护

net.inet.udp.checksum=1
#防止不正确的udp包的攻击

net.ipv4.conf.default.accept_source_route = 0
#是否接受含有源路由信息的ip包。参数值为布尔值,1表示接受,0表示不接受。
#在充当网关的linux主机上缺省值为1,在一般的linux主机上缺省值为0。
#从安全性角度出发,建议你关闭该功能。

NFrame是我和一个朋友在闲暇时间写的服务器框架,主要思想是模块化,插件化,分层设计,事件驱动,现在开始用它做游戏项目了,很快我们就会有一个基于unity3d的游戏项目出现,后面大家可以看看。

代码语言是C++,现在已经支持lua脚本语言,后面会支持C#和python。

项目已经开源,在github上
NFrame开源服务器地址:NFrame
如果你觉得有用,请点击starwatch,非常感谢。

以下是简介:

最近因为被一同事忽悠,打算技术移民去新西兰,所以准备学习学习英语,备战雅思,新西兰的最低要求是雅思均分6.5分,看了看英语单词数和雅思剑桥真题,发现还是难度蛮大的。

本身中国人对英语的学习程度都比较低,上学时候都是走走过程,只会基本的写和读,听和说就不必多讲了,全是泪啊,口语一塌糊涂的,看了这么多年英美剧和欧美电影,发现还是差距太大了。

只会说一些最最基本的打招呼的,再见啊等等,完全没有美剧里巴拉巴拉飙英语的快感啊。

看到申请英国的签证还需要一些特别的手续,如下:

申请英国签证翻译件新规

自2009年起,英国使馆对签证申请的翻译件做出了如下规定:

申请人应提供所有原文文件的英文翻译件,且申请人不能自己翻译本人的申请文件;

提交翻译件时,还必须同时提供a. 翻译人员全名; b.翻译人员所属工作单位; c.翻译人员所在工作单位详细地址和联系方式;d.翻译人员资历证明; e.翻译件真实性证明; f.翻译人员签字/翻译公司盖章;g.翻译日期

这个时候就需要翻译公司了,我在网上搜到一家上海翻译公司,看了看报价还不错,如果以后想去英国了,可以找他们翻译一下。