Linux Assemblycollection of fast libraries

LinAsm benchmarks

performance.png

LinAsm benchmarks were made to demonstrate how fast the libraries are. Benchmarks source and latest LinAsm release may be downloaded by the links at the bottom of this page. You may compile and run these tests on your own PC to get personal results and find is there any space for speed improvement for your platform and the code.

Following tests use vector extensions of x86 processors from first SSE version to SSE 4.2 instructions set. They do not use AVX extension for now. I will try to fix this and make AVX version in future. But I cannot provide any deadline for this.

Benchmarks compare LinAsm functions with scalar code and analogous functions from GNU libc and C++ STL. Not all LinAsm functions have analogs in C library, because the main goal of this project is not to replace C library or STL, but reimplement some widely used algorithms from different libraries and improve their speed. This is very critical for engineering calculations, data analysis and multimedia software.

Testing platform

All tests were made for single thread (except tests for ADTs, which have single and concurrent threads benchmarks) on Intel Core i5 CPU (2 physical cores with hyper-threading technology) using a laptop with 4Gb of RAM, running 64-bit Arch Linux live CD. No other tasks were running during the tests. All functions in LinAsm libraries are thread safe, except the abstract data types (which use access predicates to provide thread safety). So you may expect linear speed improvement for multithreading programs with no memory conflicts. CPU specification of the laptop is provided below. Its peak performance is around 6 GFlops per each core.

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 37
model name      : Intel(R) Core(TM) i5 CPU       M 480  @ 2.67GHz
stepping        : 5
microcode       : 0x2
cpu MHz         : 2667.000
cache size      : 3072 KB
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 2
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 11
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm
                  pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni dtes64
                  monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 popcnt lahf_lm ida arat dtherm tpr_shadow vnmi
                  flexpriority ept vpid
bogomips        : 5321.30
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

All benchmarks were compiled by GCC (version 5.3.0) with following optimization flags:

-std=gnu++11 -O2 -march=native -mtune=native -fomit-frame-pointer -fno-builtin -fno-inline

Numbers conversion

Numbers conversion library is the finite-state automaton that extracts number values from string, convert them to binary representation, and check numbers for overflow and underflow conditions. This is common task for importing data from CSV and TSV files.

For this test were generated 16,384 random numbers. They were converted to string format and then were extracted 1000 times from strings to get precise results. As you may see in following table, "sscanf" functions are very slow because they should recognize number format and then call associated core function to extract number value. "strtoul" and "strtod" work faster, but they are also not so very fast as assembly written variants, which can be much more effective for floating-point numbers.

FunctionBenchmark valueFunctionBenchmark value
Octal int numbers conversionDecimal int numbers conversion
sscanf3,737,705 nums/sstrtoul7,091,779 nums/s
LinAsm31,240,336 nums/sLinAsm33,405,495 nums/s
Accelerationx8.36Accelerationx4.71
Octal int numbers conversionHexadecimal flt numbers conversion
strtoul12,371,567 nums/ssscanf2,055,056 nums/s
LinAsm31,584,624 nums/sLinAsm18,666,089 nums/s
Accelerationx2.55Accelerationx9.08
Hexadecimal int numbers conversionHexadecimal flt numbers conversion
sscanf3,398,960 nums/sstrtod4,266,261 nums/s
LinAsm20,195,693 nums/sLinAsm18,725,130 nums/s
Accelerationx5.94Accelerationx4.39
Hexadecimal int numbers conversionDecimal flt numbers conversion
strtoul8,044,625 nums/ssscanf2,953,008 nums/s
LinAsm20,199,339 nums/sLinAsm44,920,448 nums/s
Accelerationx2.51Accelerationx15.21
Decimal int numbers conversionDecimal flt numbers conversion
sscanf3,155,849 nums/sstrtod5,935,734 nums/s
LinAsm33,511,791 nums/sLinAsm45,040,590 nums/s
Accelerationx10.62Accelerationx7.59
FunctionBenchmark valueFunctionBenchmark value

Time conversion

Time conversion is not most used algorithm in opposite to other functions described in these benchmarks. When I was processing time series, I found that lot of time were spent in time stamp extraction procedure. It took too much time, so I decide to improve such procedure. I developed time conversion library, which provides such functional to time and date values and do it very fast.

Following benchmark generates 131,072 time stamps and then converts them 100 times from Unix time format to Gregorian time and date format and vice versa. LinAsm functions beat standard lib C time functions in many times.

FunctionBenchmark valueFunctionBenchmark valueFunctionBenchmark value
Unix time to UTC Gregorian date conversionUnix time to local Gregorian date conversionGregorian date to Unix time conversion
gmtime16,662,701 calls/slocaltime1,070,848 calls/smktime1,004,177 calls/s
LinAsm28,851,300 calls/sLinAsm15,235,056 calls/sLinAsm66,723,597 calls/s
Accelerationx1.73Accelerationx14.23Accelerationx66.45
FunctionBenchmark valueFunctionBenchmark valueFunctionBenchmark value

Math library

Math library is excellent place for optimizations, especially if you work with data analysis algorithms and create models of real processes. Currently the Math library implements only basic math functions, I used for my projects. But I have plans to extend its functionality and add more functions to it.

This benchmark generates 131,072 random numbers and use them to calculate standard math functions, operating with these values as parameters. Each test runs in 1,000 round to get precise time, each function takes for calculation. Math functions guarantee precision loss no more than 2 less significant bits. It is enough for multimedia and data processing.

Scalar functions

FunctionBenchmark valueFunctionBenchmark valueFunctionBenchmark value
Absolute valueNatural logarithmHyperbolic ArcSine
GNU libc320,583,799 calls/sGNU libc27,791,520 calls/sGNU libc16,468,853 calls/s
LinAsm319,642,808 calls/sLinAsm45,552,691 calls/sLinAsm19,853,076 calls/s
Accelerationx1.00Accelerationx1.64Accelerationx1.21
Minimum valueNatural logarithm of value plus 1Hyperbolic ArcCosine
GNU libc288,547,315 calls/sGNU libc40,535,751 calls/sGNU libc17,313,269 calls/s
LinAsm357,292,558 calls/sLinAsm44,171,955 calls/sLinAsm26,042,793 calls/s
Accelerationx1.24Accelerationx1.09Accelerationx1.50
Maximum valueHypotenuseHyperbolic ArcTangent
GNU libc287,757,970 calls/sGNU libc48,949,533 calls/sGNU libc24,095,564 calls/s
LinAsm359,123,015 calls/sLinAsm91,293,740 calls/sLinAsm27,917,050 calls/s
Accelerationx1.25Accelerationx1.87Accelerationx1.16
Minimum absolute valueSineRound down (floor)
GNU libc114,655,067 calls/sGNU libc22,022,897 calls/sGNU libc322,160,945 calls/s
LinAsm317,098,355 calls/sLinAsm32,057,435 calls/sLinAsm320,740,804 calls/s
Accelerationx2.77Accelerationx1.46Accelerationx1.00
Maximum absolute valueCosineRound up (ceil)
GNU libc114,862,062 calls/sGNU libc21,873,853 calls/sGNU libc321,413,014 calls/s
LinAsm317,156,505 calls/sLinAsm31,470,238 calls/sLinAsm290,366,772 calls/s
Accelerationx2.76Accelerationx1.44Accelerationx0.90
Square rootSine and cosineRound to nearest even integer
GNU libc103,597,540 calls/sGNU libc14,070,492 calls/sGNU libc321,748,469 calls/s
LinAsm149,873,082 calls/sLinAsm26,016,987 calls/sLinAsm321,621,637 calls/s
Accelerationx1.45Accelerationx1.85Accelerationx1.00
Scale by power of 2TangentRound to nearest integer
GNU libc119,157,634 calls/sGNU libc16,292,607 calls/sGNU libc246,542,534 calls/s
LinAsm287,824,038 calls/sLinAsm25,387,566 calls/sLinAsm287,037,558 calls/s
Accelerationx2.42Accelerationx1.56Accelerationx1.16
Power of 2ArcSineTruncate to integer
GNU libc49,415,264 calls/sGNU libc35,767,892 calls/sGNU libc253,612,955 calls/s
LinAsm70,695,737 calls/sLinAsm40,013,384 calls/sLinAsm321,065,964 calls/s
Accelerationx1.43Accelerationx1.12Accelerationx1.27
Power of 10ArcCosineCheck for normal value
GNU libc17,334,208 calls/sGNU libc36,330,003 calls/sGNU libc290,038,870 calls/s
LinAsm56,934,765 calls/sLinAsm37,917,505 calls/sLinAsm318,984,655 calls/s
Accelerationx3.28Accelerationx1.04Accelerationx1.10
Power of EArcTangentCheck for finite value
GNU libc36,930,198 calls/sGNU libc28,743,095 calls/sGNU libc473,825,840 calls/s
LinAsm56,865,350 calls/sLinAsm47,107,414 calls/sLinAsm318,069,504 calls/s
Accelerationx1.54Accelerationx1.64Accelerationx0.67
Power of E minus 1ArcTangent2Check for infinite value
GNU libc40,208,793 calls/sGNU libc16,759,420 calls/sGNU libc475,281,035 calls/s
LinAsm57,936,776 calls/sLinAsm32,529,029 calls/sLinAsm357,350,738 calls/s
Accelerationx1.44Accelerationx1.94Accelerationx0.75
PowerHyperbolic sineCheck for NaN value
GNU libc17,057,306 calls/sGNU libc23,814,110 calls/sGNU libc478,990,996 calls/s
LinAsm34,880,151 calls/sLinAsm38,867,188 calls/sLinAsm320,172,910 calls/s
Accelerationx2.04Accelerationx1.63Accelerationx0.67
Logarithm to base 2Hyperbolic cosine
GNU libc48,285,716 calls/sGNU libc33,884,353 calls/s
LinAsm45,554,278 calls/sLinAsm38,644,402 calls/s
Accelerationx0.94Accelerationx1.14
Logarithm to base 10Hyperbolic tangent
GNU libc23,791,664 calls/sGNU libc25,536,869 calls/s
LinAsm45,552,268 calls/sLinAsm39,367,433 calls/s
Accelerationx1.91Accelerationx1.54
FunctionBenchmark valueFunctionBenchmark valueFunctionBenchmark value

Vectorized functions (single precision)

FunctionBenchmark valueFunctionBenchmark valueFunctionBenchmark value
Absolute valueLogarithm to base 10Hyperbolic sine
GNU libc324,308,110 calls/sGNU libc42,342,596 calls/sGNU libc27,369,086 calls/s
LinAsm1,295,451,434 calls/sLinAsm228,321,086 calls/sLinAsm198,967,638 calls/s
Accelerationx3.99Accelerationx5.39Accelerationx7.27
Minimum valueNatural logarithmHyperbolic cosine
GNU libc265,006,842 calls/sGNU libc62,844,510 calls/sGNU libc56,774,535 calls/s
LinAsm1,165,630,415 calls/sLinAsm228,444,821 calls/sLinAsm202,957,564 calls/s
Accelerationx4.40Accelerationx3.64Accelerationx3.57
Maximum valueNatural logarithm of value plus 1Hyperbolic tangent
GNU libc265,314,531 calls/sGNU libc51,105,329 calls/sGNU libc29,514,099 calls/s
LinAsm1,165,481,765 calls/sLinAsm237,639,816 calls/sLinAsm202,698,417 calls/s
Accelerationx4.39Accelerationx4.65Accelerationx6.87
Minimum absolute valueHypotenuseHyperbolic ArcSine
GNU libc100,371,575 calls/sGNU libc91,315,223 calls/sGNU libc30,934,969 calls/s
LinAsm1,141,266,159 calls/sLinAsm595,154,150 calls/sLinAsm94,050,661 calls/s
Accelerationx11.37Accelerationx6.52Accelerationx3.04
Maximum absolute valueSineHyperbolic ArcCosine
GNU libc98,612,418 calls/sGNU libc74,064,899 calls/sGNU libc34,980,431 calls/s
LinAsm1,131,885,254 calls/sLinAsm42,163,212 calls/sLinAsm113,234,823 calls/s
Accelerationx11.48Accelerationx0.57Accelerationx3.24
Square rootCosineHyperbolic ArcTangent
GNU libc98,377,251 calls/sGNU libc73,978,695 calls/sGNU libc30,142,314 calls/s
LinAsm674,817,836 calls/sLinAsm42,213,056 calls/sLinAsm159,531,687 calls/s
Accelerationx6.86Accelerationx0.57Accelerationx5.29
Power of 2Sine and cosineRound down (floor)
GNU libc52,858,140 calls/sGNU libc65,247,241 calls/sGNU libc321,852,478 calls/s
LinAsm352,005,180 calls/sLinAsm41,650,174 calls/sLinAsm1,281,182,366 calls/s
Accelerationx6.66Accelerationx0.64Accelerationx3.98
Power of 10TangentRound up (ceil)
GNU libc34,224,983 calls/sGNU libc29,651,034 calls/sGNU libc322,019,933 calls/s
LinAsm262,794,537 calls/sLinAsm41,811,619 calls/sLinAsm1,277,422,152 calls/s
Accelerationx7.68Accelerationx1.41Accelerationx3.97
Power of EArcSineRound to nearest even integer
GNU libc76,305,289 calls/sGNU libc70,338,925 calls/sGNU libc321,658,149 calls/s
LinAsm263,001,376 calls/sLinAsm166,928,615 calls/sLinAsm1,275,920,059 calls/s
Accelerationx3.45Accelerationx2.37Accelerationx3.97
Power of E minus 1ArcCosineRound to nearest integer
GNU libc41,108,198 calls/sGNU libc65,136,500 calls/sGNU libc191,374,697 calls/s
LinAsm267,795,355 calls/sLinAsm169,324,706 calls/sLinAsm1,146,311,871 calls/s
Accelerationx6.51Accelerationx2.60Accelerationx5.99
PowerArcTangentTruncate to integer
GNU libc9,932,465 calls/sGNU libc55,373,646 calls/sGNU libc207,982,695 calls/s
LinAsm33,038,060 calls/sLinAsm262,521,262 calls/sLinAsm1,275,589,500 calls/s
Accelerationx3.33Accelerationx4.74Accelerationx6.13
Logarithm to base 2ArcTangent2
GNU libc57,154,090 calls/sGNU libc24,149,571 calls/s
LinAsm228,205,961 calls/sLinAsm189,508,590 calls/s
Accelerationx3.99Accelerationx7.85
FunctionBenchmark valueFunctionBenchmark valueFunctionBenchmark value

Vectorized functions (double precision)

FunctionBenchmark valueFunctionBenchmark valueFunctionBenchmark value
Absolute valueLogarithm to base 10Hyperbolic sine
GNU libc359,388,560 calls/sGNU libc23,953,735 calls/sGNU libc23,796,774 calls/s
LinAsm697,567,363 calls/sLinAsm71,724,842 calls/sLinAsm74,368,091 calls/s
Accelerationx1.94Accelerationx2.99Accelerationx3.13
Minimum valueNatural logarithmHyperbolic cosine
GNU libc261,926,679 calls/sGNU libc27,622,351 calls/sGNU libc33,538,382 calls/s
LinAsm628,559,020 calls/sLinAsm70,711,205 calls/sLinAsm74,089,859 calls/s
Accelerationx2.40Accelerationx2.56Accelerationx2.21
Maximum valueNatural logarithm of value plus 1Hyperbolic tangent
GNU libc262,361,131 calls/sGNU libc40,496,201 calls/sGNU libc25,308,355 calls/s
LinAsm570,279,381 calls/sLinAsm73,550,247 calls/sLinAsm78,192,188 calls/s
Accelerationx2.17Accelerationx1.82Accelerationx3.09
Minimum absolute valueHypotenuseHyperbolic ArcSine
GNU libc106,739,601 calls/sGNU libc46,784,021 calls/sGNU libc16,465,227 calls/s
LinAsm568,036,107 calls/sLinAsm176,954,234 calls/sLinAsm31,500,673 calls/s
Accelerationx5.32Accelerationx3.78Accelerationx1.91
Maximum absolute valueSineHyperbolic ArcCosine
GNU libc106,794,055 calls/sGNU libc22,453,277 calls/sGNU libc17,414,551 calls/s
LinAsm567,515,968 calls/sLinAsm32,942,922 calls/sLinAsm39,087,076 calls/s
Accelerationx5.31Accelerationx1.47Accelerationx2.24
Square rootCosineHyperbolic ArcTangent
GNU libc105,112,995 calls/sGNU libc22,226,302 calls/sGNU libc24,041,115 calls/s
LinAsm226,788,703 calls/sLinAsm33,508,911 calls/sLinAsm52,987,191 calls/s
Accelerationx2.16Accelerationx1.51Accelerationx2.20
Power of 2Sine and cosineRound down (floor)
GNU libc49,244,566 calls/sGNU libc14,274,131 calls/sGNU libc360,992,071 calls/s
LinAsm124,966,988 calls/sLinAsm31,691,947 calls/sLinAsm713,763,090 calls/s
Accelerationx2.54Accelerationx2.22Accelerationx1.98
Power of 10TangentRound up (ceil)
GNU libc17,057,982 calls/sGNU libc16,601,831 calls/sGNU libc360,542,288 calls/s
LinAsm100,099,548 calls/sLinAsm31,780,370 calls/sLinAsm715,045,670 calls/s
Accelerationx5.87Accelerationx1.91Accelerationx1.98
Power of EArcSineRound to nearest even integer
GNU libc37,026,020 calls/sGNU libc35,287,695 calls/sGNU libc360,795,385 calls/s
LinAsm102,103,679 calls/sLinAsm60,062,778 calls/sLinAsm719,332,379 calls/s
Accelerationx2.76Accelerationx1.70Accelerationx1.99
Power of E minus 1ArcCosineRound to nearest integer
GNU libc35,559,414 calls/sGNU libc36,338,212 calls/sGNU libc186,619,046 calls/s
LinAsm102,977,477 calls/sLinAsm61,322,835 calls/sLinAsm630,066,152 calls/s
Accelerationx2.90Accelerationx1.69Accelerationx3.38
PowerArcTangentTruncate to integer
GNU libc12,458,586 calls/sGNU libc28,707,818 calls/sGNU libc217,375,092 calls/s
LinAsm25,872,407 calls/sLinAsm86,893,719 calls/sLinAsm711,336,124 calls/s
Accelerationx2.08Accelerationx3.03Accelerationx3.27
Logarithm to base 2ArcTangent2
GNU libc48,030,011 calls/sGNU libc16,523,095 calls/s
LinAsm71,724,235 calls/sLinAsm63,311,670 calls/s
Accelerationx1.49Accelerationx3.83
FunctionBenchmark valueFunctionBenchmark valueFunctionBenchmark value

Array library

Array library can improve trivial array operations and replace scalar code with vector code, written in assembly language, which uses SIMD extension of modern processors. To run this code your CPU should support SSE 4.2 extension (almost all modern processors were bought during last 5 years).

Array functions intensively use memory prefetch commands. It improves processing of really big portions of data, because of efficient use of cache memory. The main bottleneck of modern computers is memory subsystem, which is much slower than CPU, and require non trivial code optimization to get high performance. You also may see that single precision numbers take 2 times less CPU cycles to process. It is because of memory bandwidth. Single precision data are 2 times shorter than double precision and take 2 times less memory space.

Next benchmark generates 131,072 elements wide random arrays, and tests scalar and vector code for these arrays 10,000 times to get precise time values for scalar and vector code. Sort functions are ran only ones to sort random arrays in ascending sort order. To measure performance of integer operations was generated random array of bytes (uint8_t) and then scalar and vector operations were invoked.

Integer operations

FunctionBenchmark valueFunctionBenchmark valueFunctionBenchmark value
InitializationScalar bitwise XORFind key
Classic scalar code26,247,096,533 nums/sClassic scalar code1,458,059,974 nums/sClassic scalar code22,368,863,536 nums/s
LinAsm vector code25,523,404,363 nums/sLinAsm vector code22,431,021,895 nums/sLinAsm vector code16,624,115,820 nums/s
Accelerationx0.97Accelerationx15.38Accelerationx0.74
Copying elementsVector bitwise XORCount keys
Classic scalar code12,446,514,964 nums/sClassic scalar code968,334,933 nums/sClassic scalar code968,235,259 nums/s
LinAsm vector code15,538,309,629 nums/sLinAsm vector code14,115,773,423 nums/sLinAsm vector code13,217,213,210 nums/s
Accelerationx1.25Accelerationx14.58Accelerationx13.65
Moving elementsScalar additionReplace keys
Classic scalar code15,133,066,835 nums/sClassic scalar code1,458,167,989 nums/sClassic scalar code1,365,258,630 nums/s
LinAsm vector code15,462,039,920 nums/sLinAsm vector code22,630,209,340 nums/sLinAsm vector code7,337,564,794 nums/s
Accelerationx1.02Accelerationx15.52Accelerationx5.37
Bitwise NOTVector additionReversing elements order
Classic scalar code1,452,957,585 nums/sClassic scalar code968,668,665 nums/sClassic scalar code1,937,538,396 nums/s
LinAsm vector code20,958,282,749 nums/sLinAsm vector code14,104,830,273 nums/sLinAsm vector code21,802,399,684 nums/s
Accelerationx14.42Accelerationx14.56Accelerationx11.25
Scalar bitwise ANDScalar subtractionQuick sort
Classic scalar code1,458,618,396 nums/sClassic scalar code1,458,334,739 nums/sClassic scalar code5,459,417 nums/s
LinAsm vector code22,256,960,611 nums/sLinAsm vector code21,899,201,354 nums/sLinAsm vector code23,145,501 nums/s
Accelerationx15.26Accelerationx15.02Accelerationx4.24
Vector bitwise ANDVector subtractionRadix sort
Classic scalar code968,523,913 nums/sClassic scalar code965,818,843 nums/sClassic scalar code5,486,116 nums/s
LinAsm vector code14,114,702,070 nums/sLinAsm vector code14,148,484,301 nums/sLinAsm vector code304,006,272 nums/s
Accelerationx14.57Accelerationx14.65Accelerationx55.41
Scalar bitwise ORMinimum valueCompare
Classic scalar code1,459,029,787 nums/sClassic scalar code972,005,815 nums/sClassic scalar code14,908,584,089 nums/s
LinAsm vector code22,673,501,492 nums/sLinAsm vector code25,873,112,942 nums/sLinAsm vector code12,473,248,041 nums/s
Accelerationx15.54Accelerationx26.62Accelerationx0.84
Vector bitwise ORMaximum value
Classic scalar code968,519,403 nums/sClassic scalar code972,691,812 nums/s
LinAsm vector code14,120,594,696 nums/sLinAsm vector code25,832,753,615 nums/s
Accelerationx14.58Accelerationx26.56
FunctionBenchmark valueFunctionBenchmark valueFunctionBenchmark value

Single precision operations

FunctionBenchmark valueFunctionBenchmark valueFunctionBenchmark value
InitializationSquare rootSum of elements
Classic scalar code1,449,448,350 nums/sClassic scalar code96,666,381 nums/sClassic scalar code973,652,319 nums/s
LinAsm vector code5,314,771,813 nums/sLinAsm vector code673,151,614 nums/sLinAsm vector code5,721,688,843 nums/s
Accelerationx3.67Accelerationx6.96Accelerationx5.88
Copying elementsReversing elements orderSum of squared values
Classic scalar code2,583,627,939 nums/sClassic scalar code1,880,596,186 nums/sClassic scalar code962,859,536 nums/s
LinAsm vector code2,995,891,876 nums/sLinAsm vector code3,905,499,219 nums/sLinAsm vector code4,493,268,963 nums/s
Accelerationx1.16Accelerationx2.08Accelerationx4.67
Moving elementsMinimum valueSum of absolute values
Classic scalar code3,085,921,651 nums/sClassic scalar code729,849,891 nums/sClassic scalar code289,142,233 nums/s
LinAsm vector code3,076,979,620 nums/sLinAsm vector code4,910,519,391 nums/sLinAsm vector code5,069,867,899 nums/s
Accelerationx1.00Accelerationx6.73Accelerationx17.53
AdditionMaximum valueConvolution
Classic scalar code957,867,472 nums/sClassic scalar code730,042,489 nums/sClassic scalar code962,786,062 nums/s
LinAsm vector code2,806,874,647 nums/sLinAsm vector code4,966,436,955 nums/sLinAsm vector code3,018,673,549 nums/s
Accelerationx2.93Accelerationx6.80Accelerationx3.14
SubtractionMinimum absolute valueSum of squared differences
Classic scalar code953,232,282 nums/sClassic scalar code264,656,716 nums/sClassic scalar code957,652,131 nums/s
LinAsm vector code2,803,691,325 nums/sLinAsm vector code3,472,789,571 nums/sLinAsm vector code2,962,268,467 nums/s
Accelerationx2.94Accelerationx13.12Accelerationx3.09
MultiplicationMaximum absolute valueSum of absolute differences
Classic scalar code956,900,317 nums/sClassic scalar code264,173,973 nums/sClassic scalar code288,280,274 nums/s
LinAsm vector code2,795,946,245 nums/sLinAsm vector code3,468,637,424 nums/sLinAsm vector code2,979,043,493 nums/s
Accelerationx2.92Accelerationx13.13Accelerationx10.33
DivisionRound down (floor)Quick sort
Classic scalar code208,187,694 nums/sClassic scalar code320,398,717 nums/sClassic scalar code6,958,462 nums/s
LinAsm vector code830,988,071 nums/sLinAsm vector code4,600,269,931 nums/sLinAsm vector code13,721,876 nums/s
Accelerationx3.99Accelerationx14.36Accelerationx1.97
Negative valueRound up (ceil)Radix sort
Classic scalar code963,286,249 nums/sClassic scalar code320,696,628 nums/sClassic scalar code7,006,796 nums/s
LinAsm vector code4,644,524,991 nums/sLinAsm vector code4,602,883,966 nums/sLinAsm vector code67,867,415 nums/s
Accelerationx4.82Accelerationx14.35Accelerationx9.69
Absolute valueRound to nearest even integerCheck for infinite values
Classic scalar code319,869,503 nums/sClassic scalar code320,649,957 nums/sClassic scalar code287,970,298 nums/s
LinAsm vector code4,642,667,776 nums/sLinAsm vector code4,596,676,416 nums/sLinAsm vector code3,494,387,871 nums/s
Accelerationx14.51Accelerationx14.34Accelerationx12.13
Sign functionRound away from zeroCheck for NaN values
Classic scalar code186,030,029 nums/sClassic scalar code252,217,392 nums/sClassic scalar code319,586,719 nums/s
LinAsm vector code1,600,463,599 nums/sLinAsm vector code2,696,486,393 nums/sLinAsm vector code3,505,537,468 nums/s
Accelerationx8.60Accelerationx10.69Accelerationx10.97
Squared valueTruncate to integer
Classic scalar code964,552,912 nums/sClassic scalar code277,532,181 nums/s
LinAsm vector code4,597,414,611 nums/sLinAsm vector code4,594,490,375 nums/s
Accelerationx4.77Accelerationx16.55
FunctionBenchmark valueFunctionBenchmark valueFunctionBenchmark value

Double precision operations

FunctionBenchmark valueFunctionBenchmark valueFunctionBenchmark value
InitializationSquare rootSum of elements
Classic scalar code1,453,092,179 nums/sClassic scalar code100,327,992 nums/sClassic scalar code973,468,701 nums/s
LinAsm vector code2,365,891,956 nums/sLinAsm vector code226,840,676 nums/sLinAsm vector code2,872,380,716 nums/s
Accelerationx1.63Accelerationx2.26Accelerationx2.95
Copying elementsReversing elements orderSum of squared values
Classic scalar code797,545,076 nums/sClassic scalar code1,419,314,419 nums/sClassic scalar code962,659,125 nums/s
LinAsm vector code813,777,818 nums/sLinAsm vector code1,772,287,120 nums/sLinAsm vector code2,233,439,817 nums/s
Accelerationx1.02Accelerationx1.25Accelerationx2.32
Moving elementsMinimum valueSum of absolute values
Classic scalar code432,968,797 nums/sClassic scalar code729,150,031 nums/sClassic scalar code287,651,279 nums/s
LinAsm vector code817,853,959 nums/sLinAsm vector code2,223,770,794 nums/sLinAsm vector code2,531,620,705 nums/s
Accelerationx1.89Accelerationx3.05Accelerationx8.80
AdditionMaximum valueConvolution
Classic scalar code654,079,848 nums/sClassic scalar code729,136,347 nums/sClassic scalar code873,583,158 nums/s
LinAsm vector code794,801,582 nums/sLinAsm vector code2,233,193,107 nums/sLinAsm vector code1,318,485,879 nums/s
Accelerationx1.22Accelerationx3.06Accelerationx1.51
SubtractionMinimum absolute valueSum of squared differences
Classic scalar code653,894,985 nums/sClassic scalar code263,060,735 nums/sClassic scalar code861,577,026 nums/s
LinAsm vector code796,559,037 nums/sLinAsm vector code1,735,046,479 nums/sLinAsm vector code1,321,087,255 nums/s
Accelerationx1.22Accelerationx6.60Accelerationx1.53
MultiplicationMaximum absolute valueSum of absolute differences
Classic scalar code652,176,045 nums/sClassic scalar code262,765,035 nums/sClassic scalar code285,274,870 nums/s
LinAsm vector code794,471,750 nums/sLinAsm vector code1,733,836,038 nums/sLinAsm vector code1,317,029,569 nums/s
Accelerationx1.22Accelerationx6.60Accelerationx4.62
DivisionRound down (floor)Quick sort
Classic scalar code131,699,220 nums/sClassic scalar code322,062,155 nums/sClassic scalar code6,486,939 nums/s
LinAsm vector code263,163,353 nums/sLinAsm vector code2,110,355,631 nums/sLinAsm vector code13,645,567 nums/s
Accelerationx2.00Accelerationx6.55Accelerationx2.10
Negative valueRound up (ceil)Radix sort
Classic scalar code955,197,021 nums/sClassic scalar code321,974,921 nums/sClassic scalar code6,457,000 nums/s
LinAsm vector code2,122,434,728 nums/sLinAsm vector code2,103,546,245 nums/sLinAsm vector code25,712,316 nums/s
Accelerationx2.22Accelerationx6.53Accelerationx3.98
Absolute valueRound to nearest even integerCheck for infinite values
Classic scalar code321,395,283 nums/sClassic scalar code322,361,620 nums/sClassic scalar code951,890,346 nums/s
LinAsm vector code2,125,749,456 nums/sLinAsm vector code2,111,207,895 nums/sLinAsm vector code1,748,375,151 nums/s
Accelerationx6.61Accelerationx6.55Accelerationx1.84
Sign functionRound away from zeroCheck for NaN values
Classic scalar code174,778,443 nums/sClassic scalar code250,692,468 nums/sClassic scalar code961,272,404 nums/s
LinAsm vector code797,555,044 nums/sLinAsm vector code1,338,027,530 nums/sLinAsm vector code1,509,675,944 nums/s
Accelerationx4.56Accelerationx5.34Accelerationx1.57
Squared valueTruncate to integer
Classic scalar code942,158,938 nums/sClassic scalar code254,941,349 nums/s
LinAsm vector code2,111,742,458 nums/sLinAsm vector code2,105,281,502 nums/s
Accelerationx2.24Accelerationx8.26
FunctionBenchmark valueFunctionBenchmark valueFunctionBenchmark value

Statistics library

Statistics library implements basic statistical functions, that process empirical data and return such values as data mean, standard deviation, skewness, kurtosis and robust parameters as median and inter-quartile range. They take regular arrays and process them to compute appropriate functions.

For this benchmark were generated 131,072 elements wide random arrays. The tests were ran 10,000 times to get precise time values. You may see, that the bottleneck for this benchmark is memory bandwidth, because single precision data has almost the same process time (see scalar code) as double precision data. But the vector code took 2 times less CPU cycles for single precision data, because of type size.

Single precision operations

FunctionBenchmark valueFunctionBenchmark valueFunctionBenchmark value
MeanVarianceSkewness
Classic scalar code964,665,266 nums/sClassic scalar code959,956,922 nums/sClassic scalar code722,416,463 nums/s
LinAsm vector code5,785,564,798 nums/sLinAsm vector code3,956,627,930 nums/sLinAsm vector code2,998,994,331 nums/s
Accelerationx6.00Accelerationx4.12Accelerationx4.15
MedianStandard deviationKurtosis
Classic scalar code5,907,422 nums/sClassic scalar code960,055,218 nums/sClassic scalar code958,495,295 nums/s
LinAsm vector code565,367,611 nums/sLinAsm vector code3,984,589,193 nums/sLinAsm vector code3,196,202,715 nums/s
Accelerationx95.70Accelerationx4.15Accelerationx3.33
Lower quartileAbsolute deviationCovariance
Classic scalar code5,920,796 nums/sClassic scalar code287,672,263 nums/sClassic scalar code955,035,008 nums/s
LinAsm vector code600,245,462 nums/sLinAsm vector code4,458,969,358 nums/sLinAsm vector code2,975,861,423 nums/s
Accelerationx101.38Accelerationx15.50Accelerationx3.12
Upper quartileInter quartile rangeCorrelation
Classic scalar code5,944,117 nums/sClassic scalar code5,982,819 nums/sClassic scalar code554,712,477 nums/s
LinAsm vector code539,442,005 nums/sLinAsm vector code350,847,326 nums/sLinAsm vector code2,174,243,956 nums/s
Accelerationx90.75Accelerationx58.64Accelerationx3.92
Mid-rangeRange
Classic scalar code721,486,658 nums/sClassic scalar code723,473,804 nums/s
LinAsm vector code3,840,675,145 nums/sLinAsm vector code3,769,070,539 nums/s
Accelerationx5.32Accelerationx5.21
FunctionBenchmark valueFunctionBenchmark valueFunctionBenchmark value

Double precision operations

FunctionBenchmark valueFunctionBenchmark valueFunctionBenchmark value
MeanVarianceSkewness
Classic scalar code964,541,098 nums/sClassic scalar code950,759,341 nums/sClassic scalar code713,286,659 nums/s
LinAsm vector code2,886,192,727 nums/sLinAsm vector code1,885,310,148 nums/sLinAsm vector code1,367,836,971 nums/s
Accelerationx2.99Accelerationx1.98Accelerationx1.92
MedianStandard deviationKurtosis
Classic scalar code6,519,883 nums/sClassic scalar code951,093,533 nums/sClassic scalar code940,796,992 nums/s
LinAsm vector code306,118,541 nums/sLinAsm vector code1,876,669,024 nums/sLinAsm vector code1,619,898,830 nums/s
Accelerationx46.95Accelerationx1.97Accelerationx1.72
Lower quartileAbsolute deviationCovariance
Classic scalar code6,482,688 nums/sClassic scalar code287,994,116 nums/sClassic scalar code824,464,410 nums/s
LinAsm vector code330,074,692 nums/sLinAsm vector code2,438,455,976 nums/sLinAsm vector code1,200,790,940 nums/s
Accelerationx50.92Accelerationx8.47Accelerationx1.46
Upper quartileInter quartile rangeCorrelation
Classic scalar code6,651,860 nums/sClassic scalar code6,581,159 nums/sClassic scalar code511,943,406 nums/s
LinAsm vector code272,086,588 nums/sLinAsm vector code209,122,296 nums/sLinAsm vector code996,020,717 nums/s
Accelerationx40.90Accelerationx31.78Accelerationx1.95
Mid-rangeRange
Classic scalar code715,817,471 nums/sClassic scalar code715,803,692 nums/s
LinAsm vector code1,901,722,547 nums/sLinAsm vector code1,885,016,271 nums/s
Accelerationx2.66Accelerationx2.63
FunctionBenchmark valueFunctionBenchmark valueFunctionBenchmark value

Object pool library

If you allocate and release a lot of small objects in your algorithm, then the Object Pool is a good idea to replace the standard lib C "malloc" and "free" functions. Object pool allocates predefined amount of objects, which also can be aligned by power of 2, and then returns them when you need a new object, or puts them back in the pool when the object is released. This operation is very fast and do not call heavy memory management functions.

This benchmark creates 16,384 objects of different sizes (from 8 to 256 bytes) and then allocate them and release 1,000 times.

FunctionBenchmark valueFunctionBenchmark value
Allocation of 8 byte blocksAllocation of 64 byte blocks
Libc malloc57,733,384 blocks/sLibc malloc57,770,358 blocks/s
LinAsm alloc258,139,671 blocks/sLinAsm alloc128,817,026 blocks/s
Accelerationx4.47Accelerationx2.23
Releasing of 8 byte blocksReleasing of 64 byte blocks
Libc free61,711,620 blocks/sLibc free61,062,234 blocks/s
LinAsm free251,573,474 blocks/sLinAsm free187,952,006 blocks/s
Accelerationx4.08Accelerationx3.08
Allocation of 16 byte blocksAllocation of 128 byte blocks
Libc malloc58,366,364 blocks/sLibc malloc18,916,890 blocks/s
LinAsm alloc237,857,380 blocks/sLinAsm alloc97,431,883 blocks/s
Accelerationx4.08Accelerationx5.15
Releasing of 16 byte blocksReleasing of 128 byte blocks
Libc free62,101,706 blocks/sLibc free21,270,774 blocks/s
LinAsm free250,377,515 blocks/sLinAsm free130,613,856 blocks/s
Accelerationx4.03Accelerationx6.14
Allocation of 32 byte blocksAllocation of 256 byte blocks
Libc malloc58,015,214 blocks/sLibc malloc10,030,150 blocks/s
LinAsm alloc187,071,330 blocks/sLinAsm alloc33,110,384 blocks/s
Accelerationx3.22Accelerationx3.30
Releasing of 32 byte blocksReleasing of 256 byte blocks
Libc free61,837,841 blocks/sLibc free8,430,772 blocks/s
LinAsm free233,604,504 blocks/sLinAsm free48,075,760 blocks/s
Accelerationx3.78Accelerationx5.70
FunctionBenchmark valueFunctionBenchmark value

Vector ADT

Vector is the first and very simple abstract data type. It is most used STL container that holds elements and may replace standard arrays. LinAsm implements it own vector container that around 2 times faster for most operations than STL vector. It has non STL interface, but much more optimized and cache friendly code. Benchmark just shows the speed (keys per seconds) you may expects for data processing algorithms, which are based on that data type.

For this benchmark were created 1,048,576 random keys in range [0, 131,072), then they were pushed and popped from the vector in single and concurrent threads. All other functions were also invoked in 16 rounds for filled vector in single and concurrent threads (4 threads) to measure the time each operation takes.

FunctionBenchmark valueFunctionBenchmark valueFunctionBenchmark value
Push (single thread)Find unordered (concurrent thread)Max value (concurrent threads)
STL code29,715,893 keys/sSTL code45,471,847 keys/sSTL code44,273,055 keys/s
LinAsm code63,250,151 keys/sLinAsm code185,315,862 keys/sLinAsm code259,325,075 keys/s
Accelerationx2.13Accelerationx4.08Accelerationx5.86
Push (concurrent threads)Find differences (single thread)IsEqual (single thread)
STL code1,817,788 keys/sSTL code46,601,607 keys/sSTL code157,952,779 keys/s
LinAsm code1,271,922 keys/sLinAsm code204,334,978 keys/sLinAsm code182,894,778 keys/s
Accelerationx0.70Accelerationx4.38Accelerationx1.16
Sort (single thread)Find differences (concurrent threads)IsEqual (concurrent threads)
STL code2,857,597 keys/sSTL code37,444,917 keys/sSTL code60,911,410 keys/s
LinAsm code10,842,321 keys/sLinAsm code54,293,581 keys/sLinAsm code55,241,078 keys/s
Accelerationx3.79Accelerationx1.45Accelerationx0.91
Unique values (single thread)Binary search (single thread)Copy constructor (single thread)
STL code64,988,533 keys/sSTL code2,138,969 keys/sSTL code172,299,388 keys/s
LinAsm code264,948,881 keys/sLinAsm code2,737,777 keys/sLinAsm code88,494,842 keys/s
Accelerationx4.08Accelerationx1.28Accelerationx0.51
Get (single thread)Binary search (concurrent threads)Copy (single thread)
STL code106,187,535 keys/sSTL code1,741,045 keys/sSTL code186,212,964 keys/s
LinAsm code266,331,463 keys/sLinAsm code2,817,737 keys/sLinAsm code125,783,541 keys/s
Accelerationx2.51Accelerationx1.62Accelerationx0.68
Get (concurrent threads)Count (single thread)Move (single thread)
STL code65,293,858 keys/sSTL code44,753,761 keys/sSTL code175,354,957 keys/s
LinAsm code166,497,705 keys/sLinAsm code302,437,873 keys/sLinAsm code130,294,759 keys/s
Accelerationx2.55Accelerationx6.76Accelerationx0.74
Set (single thread)Count (concurrent threads)Merge (single thread)
STL code62,734,020 keys/sSTL code50,242,298 keys/sSTL code18,310,634 keys/s
LinAsm code102,647,798 keys/sLinAsm code246,367,666 keys/sLinAsm code41,598,234 keys/s
Accelerationx1.64Accelerationx4.90Accelerationx2.27
Find (single thread)Min value (single thread)Reverse (single thread)
STL code136,602,613 keys/sSTL code49,583,777 keys/sSTL code97,791,198 keys/s
LinAsm code415,570,580 keys/sLinAsm code314,031,598 keys/sLinAsm code184,070,078 keys/s
Accelerationx3.04Accelerationx6.33Accelerationx1.88
Find (concurrent threads)Min value (concurrent threads)Pop (single thread)
STL code90,086,609 keys/sSTL code45,403,514 keys/sSTL code221,638,135 keys/s
LinAsm code275,584,315 keys/sLinAsm code246,803,652 keys/sLinAsm code323,223,057 keys/s
Accelerationx3.06Accelerationx5.44Accelerationx1.46
Find unordered (single thread)Max value (single thread)Pop (concurrent threads)
STL code49,829,771 keys/sSTL code48,670,234 keys/sSTL code10,820,667 keys/s
LinAsm code377,676,183 keys/sLinAsm code346,499,125 keys/sLinAsm code10,605,514 keys/s
Accelerationx7.58Accelerationx7.12Accelerationx0.98
FunctionBenchmark valueFunctionBenchmark valueFunctionBenchmark value

Deque ADT

Double ended queue is the second most popular abstract data type. Typically it is implemented as circular buffer or double licked list. LinAsm uses automatically extended circular buffer for deque. It makes it as fast as vector adt and around 2 times faster than deque from STL library. Benchmark creates 1,048,576 random keys with 8 duplicates (average case) in range [0, 131,072), then push them into tail and pop them from head in single and concurrent threads manner. Each deque operation is done 16 times to get precise time measurement. The result table is shown below. You may see how fast (keys per second) each function is.

FunctionBenchmark valueFunctionBenchmark valueFunctionBenchmark value
Push (single thread)Find differences (concurrent threads)IsEqual (concurrent threads)
STL code34,043,608 keys/sSTL code33,233,777 keys/sSTL code39,850,563 keys/s
LinAsm code60,844,734 keys/sLinAsm code53,357,539 keys/sLinAsm code70,584,555 keys/s
Accelerationx1.79Accelerationx1.61Accelerationx1.77
Push (concurrent threads)Count (single thread)Copy constructor (single thread)
STL code1,835,312 keys/sSTL code67,167,365 keys/sSTL code54,997,119 keys/s
LinAsm code1,255,515 keys/sLinAsm code326,739,399 keys/sLinAsm code89,869,698 keys/s
Accelerationx0.68Accelerationx4.86Accelerationx1.63
Get (single thread)Count (concurrent threads)Copy (single thread)
STL code131,612,681 keys/sSTL code58,774,190 keys/sSTL code114,652,212 keys/s
LinAsm code231,374,949 keys/sLinAsm code264,880,608 keys/sLinAsm code123,030,915 keys/s
Accelerationx1.76Accelerationx4.51Accelerationx1.07
Get (concurrent threads)Min value (single thread)Move (single thread)
STL code70,822,764 keys/sSTL code54,333,054 keys/sSTL code124,217,242 keys/s
LinAsm code110,527,351 keys/sLinAsm code329,076,105 keys/sLinAsm code131,069,919 keys/s
Accelerationx1.56Accelerationx6.06Accelerationx1.06
Set (single thread)Min value (concurrent threads)Reverse (single thread)
STL code61,998,311 keys/sSTL code54,048,115 keys/sSTL code80,735,590 keys/s
LinAsm code101,717,805 keys/sLinAsm code233,531,497 keys/sLinAsm code177,912,187 keys/s
Accelerationx1.64Accelerationx4.32Accelerationx2.20
Find (single thread)Max value (single thread)Pop (single thread)
STL code136,830,816 keys/sSTL code52,242,222 keys/sSTL code50,592,871 keys/s
LinAsm code368,494,833 keys/sLinAsm code332,177,475 keys/sLinAsm code250,850,581 keys/s
Accelerationx2.69Accelerationx6.36Accelerationx4.96
Find (concurrent threads)Max value (concurrent threads)Pop (concurrent threads)
STL code94,349,763 keys/sSTL code48,113,043 keys/sSTL code7,151,952 keys/s
LinAsm code246,576,072 keys/sLinAsm code222,544,423 keys/sLinAsm code8,425,015 keys/s
Accelerationx2.61Accelerationx4.63Accelerationx1.18
Find differences (single thread)IsEqual (single thread)
STL code65,933,808 keys/sSTL code74,385,799 keys/s
LinAsm code169,010,997 keys/sLinAsm code171,553,277 keys/s
Accelerationx2.56Accelerationx2.31
FunctionBenchmark valueFunctionBenchmark valueFunctionBenchmark value

Heap ADT

MinHeap and MaxHeap (also known as priority_queue in STL) is used when you require prioritization inside your service. Typicaly it is implemented by binary tree, projected onto plain array (vector). STL does not implement container priority_queue, but only the adaptor, which can be based on standard vector or queue. LinAsm uses regular array to store the keys. Instead of classic binary tree, it uses N-ary tree (currently each root node has 4 children, which are less or equal to the root). Child node (cluster of child elements) is aligned by CPU cache line, and all 4 children can be prefetched by one memory reading operation. It improves heap speed in 2 - 3 times in compare with standard array based binary tree.

Benchmark creates 1,048,576 random keys with 8 duplicates (average case) in range [0, 131,072), and put them into the heap in single thread and in concurrent threads manner (see the table below). Then it uses standard vector operations to find, count and check keys into the container. Multiple thread tests were ran in 4 threads on 2 cores CPU with hyper threading, enabled for each core. Concurrent threads were run in 16 rounds, to be sure that threads have worked in the same time. All the tests proved that cache aligned key store algorithm give you significant performance benefit.

FunctionBenchmark valueFunctionBenchmark valueFunctionBenchmark value
Push (single thread)Find (single thread)Copy constructor (single thread)
STL code8,259,694 keys/sSTL code82,477,976 keys/sSTL code95,221,291 keys/s
LinAsm code35,224,409 keys/sLinAsm code346,445,662 keys/sLinAsm code64,783,887 keys/s
Accelerationx4.26Accelerationx4.20Accelerationx0.68
Push (concurrent threads)Find (concurrent threads)Merge (single thread)
STL code627,915 keys/sSTL code91,439,914 keys/sSTL code973,027 keys/s
LinAsm code1,203,838 keys/sLinAsm code222,634,538 keys/sLinAsm code51,585,673 keys/s
Accelerationx1.92Accelerationx2.43Accelerationx53.02
Get (single thread)Count (single thread)Pop (single thread)
STL code87,441,442 keys/sSTL code54,900,267 keys/sSTL code1,162,532 keys/s
LinAsm code277,155,138 keys/sLinAsm code364,647,128 keys/sLinAsm code3,829,158 keys/s
Accelerationx3.17Accelerationx6.64Accelerationx3.29
Get (concurrent threads)Count (concurrent threads)Pop (concurrent threads)
STL code64,787,259 keys/sSTL code50,692,254 keys/sSTL code650,119 keys/s
LinAsm code197,005,564 keys/sLinAsm code248,504,165 keys/sLinAsm code2,013,109 keys/s
Accelerationx3.04Accelerationx4.90Accelerationx3.10
FunctionBenchmark valueFunctionBenchmark valueFunctionBenchmark value

List ADT

List abstract data type uses big nodes, which hold more than one element into the node. Neighbor elements are grouped together to allow very fast access to previous and next elements. This improvement tries to optimize usage of CPU cache memory and accelerates standard list operations. This benchmark creates 1,048,576 of random elements, puts them into the list object and then tests speed of standard list operations. Each test shows how many keys per second were handled by the operation.

Multiple thread tests simultaneously invoke associated function in 4 threads, iterating all elements 16 times per thread, and show cumulative result (keys per second) for all of them. As you may see again, the memory speed is main limiter of all multiple thread algorithms. And show even less performance than single thread read access to data. It is because of CPU cache. All processes is trying to use CPU cache exclusively, and push out data of another threads, which downgrade the result. This can be fixed with computation intensive code among read operations.

FunctionBenchmark valueFunctionBenchmark valueFunctionBenchmark value
Insert (single thread)Find unordered (concurrent thread)IsEqual (concurrent threads)
STL code12,198,874 keys/sSTL code6,989,241 keys/sSTL code5,810,324 keys/s
LinAsm code40,128,990 keys/sLinAsm code73,150,735 keys/sLinAsm code27,199,091 keys/s
Accelerationx3.29Accelerationx10.47Accelerationx4.68
Insert (concurrent threads)Find differences (single thread)Copy constructor (single thread)
STL code908,216 keys/sSTL code7,037,253 keys/sSTL code5,234,321 keys/s
LinAsm code1,161,236 keys/sLinAsm code77,608,462 keys/sLinAsm code20,094,408 keys/s
Accelerationx1.28Accelerationx11.03Accelerationx3.84
Sort (single thread)Find differences (concurrent threads)Copy (single thread)
STL code1,087,356 keys/sSTL code5,938,818 keys/sSTL code6,786,415 keys/s
LinAsm code5,616,728 keys/sLinAsm code27,786,449 keys/sLinAsm code32,924,796 keys/s
Accelerationx5.17Accelerationx4.68Accelerationx4.85
Unique values (single thread)Count (single thread)Move (single thread)
STL code7,046,980 keys/sSTL code7,182,162 keys/sSTL code41,986,045 keys/s
LinAsm code86,307,595 keys/sLinAsm code89,658,058 keys/sLinAsm code28,848,947 keys/s
Accelerationx12.25Accelerationx12.48Accelerationx0.69
Get (single thread)Count (concurrent threads)Merge (single thread)
STL code7,371,889 keys/sSTL code7,310,894 keys/sSTL code6,532,960 keys/s
LinAsm code50,079,075 keys/sLinAsm code67,901,225 keys/sLinAsm code24,095,987 keys/s
Accelerationx6.79Accelerationx9.29Accelerationx3.69
Get (concurrent threads)Min value (single thread)Reverse (single thread)
STL code7,366,857 keys/sSTL code7,255,633 keys/sSTL code6,686,768 keys/s
LinAsm code34,691,303 keys/sLinAsm code80,566,254 keys/sLinAsm code34,113,523 keys/s
Accelerationx4.71Accelerationx11.10Accelerationx5.10
Set (single thread)Min value (concurrent threads)Remove (single thread)
STL code7,199,607 keys/sSTL code6,986,167 keys/sSTL code5,454,134 keys/s
LinAsm code35,812,405 keys/sLinAsm code59,496,855 keys/sLinAsm code17,444,377 keys/s
Accelerationx4.97Accelerationx8.52Accelerationx3.20
Find (single thread)Max value (single thread)Remove (concurrent threads)
STL code7,210,836 keys/sSTL code7,221,271 keys/sSTL code2,571,485 keys/s
LinAsm code110,623,916 keys/sLinAsm code80,208,078 keys/sLinAsm code2,537,423 keys/s
Accelerationx15.34Accelerationx11.11Accelerationx0.99
Find (concurrent threads)Max value (concurrent threads)
STL code7,299,470 keys/sSTL code7,194,653 keys/s
LinAsm code82,681,781 keys/sLinAsm code60,534,066 keys/s
Accelerationx11.33Accelerationx8.41
Find unordered (single thread)IsEqual (single thread)
STL code7,190,420 keys/sSTL code6,871,869 keys/s
LinAsm code107,913,483 keys/sLinAsm code75,142,472 keys/s
Accelerationx15.01Accelerationx10.93
FunctionBenchmark valueFunctionBenchmark valueFunctionBenchmark value

Hash table ADT

Hash table abstract data types (UniqueHash and MultiHash) provide the same functionality as STL "unordered_map" and "unordered_multimap" templates.

This benchmark creates 1,048,576 random keys in range [0, 131,072) (8 duplicates per key in average case) and put them into hash table. Then it invokes standard operations such as "find", "count", "unique", "remove" and iterates all the elements, using sequential iterators, to measure hash table performance. The result of each operation is represented as the number of keys divided by the total time, each operation has took.

Some functions were tested in both single and multiple thread modes. Actually they were ran in 4 simultaneous threads on 2 cores CPU, which support hyperthreading technology. Multiple thread tests used external iterators, which provide highly concurrent read only access to elements of hash table. Benchmark just cyclically invokes target functions for all hash table elements 8 times, to guarantee, that threads were run simultaneous all the time, and access the same memory blocks. As you may see the memory bandwidth is bottleneck for all multiple thread algorithms, if you do not sparse data access code with mathematical processing of found keys.

FunctionBenchmark valueFunctionBenchmark valueFunctionBenchmark value
Insert (single thread)Count (single thread)Copy constructor (single thread)
STL code1,978,687 keys/sSTL code682,946 keys/sSTL code4,698,680 keys/s
LinAsm code7,061,311 keys/sLinAsm code2,139,015 keys/sLinAsm code6,991,214 keys/s
Accelerationx3.57Accelerationx3.13Accelerationx1.49
Insert (concurrent threads)Count (concurrent threads)Unique values (single thread)
STL code238,547 keys/sSTL code643,413 keys/sSTL code5,473,898 keys/s
LinAsm code972,861 keys/sLinAsm code1,974,845 keys/sLinAsm code18,542,697 keys/s
Accelerationx4.08Accelerationx3.07Accelerationx3.39
Get (single thread)Min value (single thread)Remove (single thread)
STL code8,011,015 keys/sSTL code7,779,970 keys/sSTL code6,333,774 keys/s
LinAsm code17,632,010 keys/sLinAsm code110,819,782 keys/sLinAsm code13,552,480 keys/s
Accelerationx2.20Accelerationx14.24Accelerationx2.14
Get (concurrent threads)Min value (concurrent threads)Remove (concurrent threads)
STL code8,391,381 keys/sSTL code7,791,679 keys/sSTL code2,413,195 keys/s
LinAsm code13,733,900 keys/sLinAsm code51,965,267 keys/sLinAsm code4,407,341 keys/s
Accelerationx1.64Accelerationx6.67Accelerationx1.83
Find (single thread)Max value (single thread)
STL code3,102,324 keys/sSTL code7,772,445 keys/s
LinAsm code9,717,402 keys/sLinAsm code68,036,839 keys/s
Accelerationx3.13Accelerationx8.75
Find (concurrent threads)Max value (concurrent threads)
STL code3,155,613 keys/sSTL code8,030,079 keys/s
LinAsm code9,691,096 keys/sLinAsm code27,504,652 keys/s
Accelerationx3.07Accelerationx3.43
FunctionBenchmark valueFunctionBenchmark valueFunctionBenchmark value

Tree ADT

Tree ADTs, are provided by the LinAsm library (UniqueTree and MultiTree), are equivalent to STL "map" and "multimap" classes. They are based on in-memory B-trees, instead of red-black trees, are used by the STL. B-trees hold neighbor keys grouped into one big node (cluster of keys). This trick reduces tree height and optimize cache memory usage. It also speed up sequential iteration procedure, because next and previous elements are grouped together, and do not require addition pointer operations, except for the first/last element into the node. This benchmark shows how many keys per seconds each operation can handle if you insert elements in ascending, descending or random order. All tests were ran for 1,048,576 elements wide trees.

Multiple thread tests do the same as multiple thread tests for hash table and list ADTs. Each benchmark was run in 4 threads in 8 rounds, to be sure that all process were running simultaneously. The result shows cumulative speed of all the processes. And again we see speed downgrade, because of cache concurrency. You can be sure in that fact, because STL implementation shows the same behavior in multiple thread mode. And again it may be fixed by mathematically intensive code among data access code.

Ascending keys

FunctionBenchmark valueFunctionBenchmark valueFunctionBenchmark value
Insert (single thread)Find differences (single thread)Copy constructor (single thread)
STL code1,593,449 keys/sSTL code12,591,531 keys/sSTL code9,024,989 keys/s
LinAsm code8,598,128 keys/sLinAsm code38,047,892 keys/sLinAsm code7,388,359 keys/s
Accelerationx5.40Accelerationx3.02Accelerationx0.82
Insert (concurrent threads)Find differences (concurrent threads)Copy (single thread)
STL code191,149 keys/sSTL code7,063,691 keys/sSTL code3,230,083 keys/s
LinAsm code644,367 keys/sLinAsm code14,329,357 keys/sLinAsm code8,454,930 keys/s
Accelerationx3.37Accelerationx2.03Accelerationx2.62
Get (single thread)Count (single thread)Move (single thread)
STL code26,112,486 keys/sSTL code1,952,249 keys/sSTL code1,470,107 keys/s
LinAsm code63,837,577 keys/sLinAsm code3,348,980 keys/sLinAsm code6,154,643 keys/s
Accelerationx2.44Accelerationx1.72Accelerationx4.19
Get (concurrent threads)Count (concurrent threads)Unique values (single thread)
STL code22,968,257 keys/sSTL code715,721 keys/sSTL code1,617,892 keys/s
LinAsm code47,149,885 keys/sLinAsm code1,166,816 keys/sLinAsm code8,298,756 keys/s
Accelerationx2.05Accelerationx1.63Accelerationx5.13
Find (single thread)IsEqual (single thread)Remove (single thread)
STL code2,345,714 keys/sSTL code14,737,635 keys/sSTL code14,390,451 keys/s
LinAsm code7,178,074 keys/sLinAsm code37,713,888 keys/sLinAsm code30,083,192 keys/s
Accelerationx3.06Accelerationx2.56Accelerationx2.09
Find (concurrent threads)IsEqual (concurrent threads)Remove (concurrent threads)
STL code773,197 keys/sSTL code7,216,353 keys/sSTL code2,863,612 keys/s
LinAsm code1,927,927 keys/sLinAsm code14,312,060 keys/sLinAsm code3,927,950 keys/s
Accelerationx2.49Accelerationx1.98Accelerationx1.37
FunctionBenchmark valueFunctionBenchmark valueFunctionBenchmark value

Descending keys

FunctionBenchmark valueFunctionBenchmark valueFunctionBenchmark value
Insert (single thread)Find differences (single thread)Copy constructor (single thread)
STL code1,582,465 keys/sSTL code11,224,204 keys/sSTL code9,012,617 keys/s
LinAsm code6,276,257 keys/sLinAsm code45,869,312 keys/sLinAsm code7,187,087 keys/s
Accelerationx3.97Accelerationx4.09Accelerationx0.80
Insert (concurrent threads)Find differences (concurrent threads)Copy (single thread)
STL code192,800 keys/sSTL code6,332,801 keys/sSTL code2,947,764 keys/s
LinAsm code602,489 keys/sLinAsm code13,799,683 keys/sLinAsm code8,408,320 keys/s
Accelerationx3.12Accelerationx2.18Accelerationx2.85
Get (single thread)Count (single thread)Move (single thread)
STL code20,096,205 keys/sSTL code1,950,123 keys/sSTL code1,496,511 keys/s
LinAsm code71,926,952 keys/sLinAsm code3,387,828 keys/sLinAsm code6,484,944 keys/s
Accelerationx3.58Accelerationx1.74Accelerationx4.33
Get (concurrent threads)Count (concurrent threads)Unique values (single thread)
STL code18,257,647 keys/sSTL code389,913 keys/sSTL code1,561,946 keys/s
LinAsm code50,921,640 keys/sLinAsm code1,073,920 keys/sLinAsm code8,532,335 keys/s
Accelerationx2.79Accelerationx2.75Accelerationx5.46
Find (single thread)IsEqual (single thread)Remove (single thread)
STL code2,132,875 keys/sSTL code11,478,024 keys/sSTL code19,836,599 keys/s
LinAsm code7,119,795 keys/sLinAsm code45,275,934 keys/sLinAsm code34,423,465 keys/s
Accelerationx3.34Accelerationx3.94Accelerationx1.74
Find (concurrent threads)IsEqual (concurrent threads)Remove (concurrent threads)
STL code538,100 keys/sSTL code6,002,319 keys/sSTL code2,884,925 keys/s
LinAsm code1,815,265 keys/sLinAsm code14,333,414 keys/sLinAsm code3,821,561 keys/s
Accelerationx3.37Accelerationx2.39Accelerationx1.32
FunctionBenchmark valueFunctionBenchmark valueFunctionBenchmark value

Random keys

FunctionBenchmark valueFunctionBenchmark valueFunctionBenchmark value
Insert (single thread)Find differences (single thread)Copy constructor (single thread)
STL code705,638 keys/sSTL code5,307,753 keys/sSTL code4,509,760 keys/s
LinAsm code2,253,828 keys/sLinAsm code34,806,530 keys/sLinAsm code7,058,120 keys/s
Accelerationx3.19Accelerationx6.56Accelerationx1.57
Insert (concurrent threads)Find differences (concurrent threads)Copy (single thread)
STL code149,931 keys/sSTL code2,625,713 keys/sSTL code1,832,863 keys/s
LinAsm code310,099 keys/sLinAsm code12,955,101 keys/sLinAsm code8,013,033 keys/s
Accelerationx2.07Accelerationx4.93Accelerationx4.37
Get (single thread)Count (single thread)Move (single thread)
STL code6,678,392 keys/sSTL code366,567 keys/sSTL code1,615,666 keys/s
LinAsm code48,064,646 keys/sLinAsm code1,105,134 keys/sLinAsm code6,413,899 keys/s
Accelerationx7.20Accelerationx3.01Accelerationx3.97
Get (concurrent threads)Count (concurrent threads)Unique values (single thread)
STL code6,473,388 keys/sSTL code348,122 keys/sSTL code5,188,184 keys/s
LinAsm code31,488,789 keys/sLinAsm code842,178 keys/sLinAsm code31,438,531 keys/s
Accelerationx4.86Accelerationx2.42Accelerationx6.06
Find (single thread)IsEqual (single thread)Remove (single thread)
STL code626,833 keys/sSTL code6,315,680 keys/sSTL code5,788,973 keys/s
LinAsm code1,907,373 keys/sLinAsm code35,808,412 keys/sLinAsm code24,653,006 keys/s
Accelerationx3.04Accelerationx5.67Accelerationx4.26
Find (concurrent threads)IsEqual (concurrent threads)Remove (concurrent threads)
STL code586,922 keys/sSTL code2,851,763 keys/sSTL code2,814,248 keys/s
LinAsm code1,442,942 keys/sLinAsm code12,821,653 keys/sLinAsm code3,888,592 keys/s
Accelerationx2.46Accelerationx4.50Accelerationx1.38
FunctionBenchmark valueFunctionBenchmark valueFunctionBenchmark value

Afterword and links

You may see that vectorized code and cache friendly algorithms may significantly improve performance of C and C++ programs, which do intensive calculations, or memory operations. The main reason of code slowdown on modern CPUs is slow memory access. And it is the place where we should do some kind of optimization, to get speed improvement. LinAsm can accelerate CPU intensive and memory intensive programs, using fast algorithms for common functions and optimized requests for memory reading/writing operations.

The source codes of all benchmarks and latest LinAsm release you may download by the links below. To reproduce these tests on your PC, a CPU should support SSE 4.2 operations. In other case you will get Invalid opcode error message.

Copyright 2012-2016 Jack Black. All rights reserved.