LinAsm benchmarks
LinAsm benchmarks were made to demonstrate how fast the libraries are. Benchmarks source and latest LinAsm release may be downloaded by the links at the bottom of this page. You may compile and run these tests on your own PC to get personal results and find is there any space for speed improvement for your platform and the code.
Following tests use vector extensions of x86 processors from first SSE version to SSE 4.2 instructions set. They do not use AVX extension for now. I will try to fix this and make AVX version in future. But I cannot provide any deadline for this.
Benchmarks compare LinAsm functions with scalar code and analogous functions from GNU libc and C++ STL. Not all LinAsm functions have analogs in C library, because the main goal of this project is not to replace C library or STL, but reimplement some widely used algorithms from different libraries and improve their speed. This is very critical for engineering calculations, data analysis and multimedia software.
- Testing platform
- Numbers conversion
- Time conversion
- Math library
- Scalar functions
- Vectorized functions (single precision)
- Vectorized functions (double precision)
- Array library
- Integer operations
- Single precision operations
- Double precision operations
- Statistics library
- Single precision operations
- Double precision operations
- Object pool library
- Vector ADT
- Deque ADT
- Heap ADT
- List ADT
- Hash table ADT
- Tree ADT
- Ascending keys
- Descending keys
- Random keys
- Afterword and links
Testing platform
All tests were made for single thread (except tests for ADTs, which have single and concurrent threads benchmarks) on Intel Core i5 CPU (2 physical cores with hyper-threading technology) using a laptop with 4Gb of RAM, running 64-bit Arch Linux live CD. No other tasks were running during the tests. All functions in LinAsm libraries are thread safe, except the abstract data types (which use access predicates to provide thread safety). So you may expect linear speed improvement for multithreading programs with no memory conflicts. CPU specification of the laptop is provided below. Its peak performance is around 6 GFlops per each core.
processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 37 model name : Intel(R) Core(TM) i5 CPU M 480 @ 2.67GHz stepping : 5 microcode : 0x2 cpu MHz : 2667.000 cache size : 3072 KB physical id : 0 siblings : 4 core id : 0 cpu cores : 2 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 popcnt lahf_lm ida arat dtherm tpr_shadow vnmi flexpriority ept vpid bogomips : 5321.30 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management:
All benchmarks were compiled by GCC (version 5.3.0) with following optimization flags:
-std=gnu++11 -O2 -march=native -mtune=native -fomit-frame-pointer -fno-builtin -fno-inline
Numbers conversion
Numbers conversion library is the finite-state automaton that extracts number values from string, convert them to binary representation, and check numbers for overflow and underflow conditions. This is common task for importing data from CSV and TSV files.
For this test were generated 16,384 random numbers. They were converted to string format and then were extracted 1000 times from strings to get precise results. As you may see in following table, "sscanf" functions are very slow because they should recognize number format and then call associated core function to extract number value. "strtoul" and "strtod" work faster, but they are also not so very fast as assembly written variants, which can be much more effective for floating-point numbers.
Function | Benchmark value | Function | Benchmark value |
---|---|---|---|
Octal int numbers conversion | Decimal int numbers conversion | ||
sscanf | 3,737,705 nums/s | strtoul | 7,091,779 nums/s |
LinAsm | 31,240,336 nums/s | LinAsm | 33,405,495 nums/s |
Acceleration | x8.36 | Acceleration | x4.71 |
Octal int numbers conversion | Hexadecimal flt numbers conversion | ||
strtoul | 12,371,567 nums/s | sscanf | 2,055,056 nums/s |
LinAsm | 31,584,624 nums/s | LinAsm | 18,666,089 nums/s |
Acceleration | x2.55 | Acceleration | x9.08 |
Hexadecimal int numbers conversion | Hexadecimal flt numbers conversion | ||
sscanf | 3,398,960 nums/s | strtod | 4,266,261 nums/s |
LinAsm | 20,195,693 nums/s | LinAsm | 18,725,130 nums/s |
Acceleration | x5.94 | Acceleration | x4.39 |
Hexadecimal int numbers conversion | Decimal flt numbers conversion | ||
strtoul | 8,044,625 nums/s | sscanf | 2,953,008 nums/s |
LinAsm | 20,199,339 nums/s | LinAsm | 44,920,448 nums/s |
Acceleration | x2.51 | Acceleration | x15.21 |
Decimal int numbers conversion | Decimal flt numbers conversion | ||
sscanf | 3,155,849 nums/s | strtod | 5,935,734 nums/s |
LinAsm | 33,511,791 nums/s | LinAsm | 45,040,590 nums/s |
Acceleration | x10.62 | Acceleration | x7.59 |
Function | Benchmark value | Function | Benchmark value |
Time conversion
Time conversion is not most used algorithm in opposite to other functions described in these benchmarks. When I was processing time series, I found that lot of time were spent in time stamp extraction procedure. It took too much time, so I decide to improve such procedure. I developed time conversion library, which provides such functional to time and date values and do it very fast.
Following benchmark generates 131,072 time stamps and then converts them 100 times from Unix time format to Gregorian time and date format and vice versa. LinAsm functions beat standard lib C time functions in many times.
Function | Benchmark value | Function | Benchmark value | Function | Benchmark value |
---|---|---|---|---|---|
Unix time to UTC Gregorian date conversion | Unix time to local Gregorian date conversion | Gregorian date to Unix time conversion | |||
gmtime | 16,662,701 calls/s | localtime | 1,070,848 calls/s | mktime | 1,004,177 calls/s |
LinAsm | 28,851,300 calls/s | LinAsm | 15,235,056 calls/s | LinAsm | 66,723,597 calls/s |
Acceleration | x1.73 | Acceleration | x14.23 | Acceleration | x66.45 |
Function | Benchmark value | Function | Benchmark value | Function | Benchmark value |
Math library
Math library is excellent place for optimizations, especially if you work with data analysis algorithms and create models of real processes. Currently the Math library implements only basic math functions, I used for my projects. But I have plans to extend its functionality and add more functions to it.
This benchmark generates 131,072 random numbers and use them to calculate standard math functions, operating with these values as parameters. Each test runs in 1,000 round to get precise time, each function takes for calculation. Math functions guarantee precision loss no more than 2 less significant bits. It is enough for multimedia and data processing.
Scalar functions
Function | Benchmark value | Function | Benchmark value | Function | Benchmark value |
---|---|---|---|---|---|
Absolute value | Natural logarithm | Hyperbolic ArcSine | |||
GNU libc | 320,583,799 calls/s | GNU libc | 27,791,520 calls/s | GNU libc | 16,468,853 calls/s |
LinAsm | 319,642,808 calls/s | LinAsm | 45,552,691 calls/s | LinAsm | 19,853,076 calls/s |
Acceleration | x1.00 | Acceleration | x1.64 | Acceleration | x1.21 |
Minimum value | Natural logarithm of value plus 1 | Hyperbolic ArcCosine | |||
GNU libc | 288,547,315 calls/s | GNU libc | 40,535,751 calls/s | GNU libc | 17,313,269 calls/s |
LinAsm | 357,292,558 calls/s | LinAsm | 44,171,955 calls/s | LinAsm | 26,042,793 calls/s |
Acceleration | x1.24 | Acceleration | x1.09 | Acceleration | x1.50 |
Maximum value | Hypotenuse | Hyperbolic ArcTangent | |||
GNU libc | 287,757,970 calls/s | GNU libc | 48,949,533 calls/s | GNU libc | 24,095,564 calls/s |
LinAsm | 359,123,015 calls/s | LinAsm | 91,293,740 calls/s | LinAsm | 27,917,050 calls/s |
Acceleration | x1.25 | Acceleration | x1.87 | Acceleration | x1.16 |
Minimum absolute value | Sine | Round down (floor) | |||
GNU libc | 114,655,067 calls/s | GNU libc | 22,022,897 calls/s | GNU libc | 322,160,945 calls/s |
LinAsm | 317,098,355 calls/s | LinAsm | 32,057,435 calls/s | LinAsm | 320,740,804 calls/s |
Acceleration | x2.77 | Acceleration | x1.46 | Acceleration | x1.00 |
Maximum absolute value | Cosine | Round up (ceil) | |||
GNU libc | 114,862,062 calls/s | GNU libc | 21,873,853 calls/s | GNU libc | 321,413,014 calls/s |
LinAsm | 317,156,505 calls/s | LinAsm | 31,470,238 calls/s | LinAsm | 290,366,772 calls/s |
Acceleration | x2.76 | Acceleration | x1.44 | Acceleration | x0.90 |
Square root | Sine and cosine | Round to nearest even integer | |||
GNU libc | 103,597,540 calls/s | GNU libc | 14,070,492 calls/s | GNU libc | 321,748,469 calls/s |
LinAsm | 149,873,082 calls/s | LinAsm | 26,016,987 calls/s | LinAsm | 321,621,637 calls/s |
Acceleration | x1.45 | Acceleration | x1.85 | Acceleration | x1.00 |
Scale by power of 2 | Tangent | Round to nearest integer | |||
GNU libc | 119,157,634 calls/s | GNU libc | 16,292,607 calls/s | GNU libc | 246,542,534 calls/s |
LinAsm | 287,824,038 calls/s | LinAsm | 25,387,566 calls/s | LinAsm | 287,037,558 calls/s |
Acceleration | x2.42 | Acceleration | x1.56 | Acceleration | x1.16 |
Power of 2 | ArcSine | Truncate to integer | |||
GNU libc | 49,415,264 calls/s | GNU libc | 35,767,892 calls/s | GNU libc | 253,612,955 calls/s |
LinAsm | 70,695,737 calls/s | LinAsm | 40,013,384 calls/s | LinAsm | 321,065,964 calls/s |
Acceleration | x1.43 | Acceleration | x1.12 | Acceleration | x1.27 |
Power of 10 | ArcCosine | Check for normal value | |||
GNU libc | 17,334,208 calls/s | GNU libc | 36,330,003 calls/s | GNU libc | 290,038,870 calls/s |
LinAsm | 56,934,765 calls/s | LinAsm | 37,917,505 calls/s | LinAsm | 318,984,655 calls/s |
Acceleration | x3.28 | Acceleration | x1.04 | Acceleration | x1.10 |
Power of E | ArcTangent | Check for finite value | |||
GNU libc | 36,930,198 calls/s | GNU libc | 28,743,095 calls/s | GNU libc | 473,825,840 calls/s |
LinAsm | 56,865,350 calls/s | LinAsm | 47,107,414 calls/s | LinAsm | 318,069,504 calls/s |
Acceleration | x1.54 | Acceleration | x1.64 | Acceleration | x0.67 |
Power of E minus 1 | ArcTangent2 | Check for infinite value | |||
GNU libc | 40,208,793 calls/s | GNU libc | 16,759,420 calls/s | GNU libc | 475,281,035 calls/s |
LinAsm | 57,936,776 calls/s | LinAsm | 32,529,029 calls/s | LinAsm | 357,350,738 calls/s |
Acceleration | x1.44 | Acceleration | x1.94 | Acceleration | x0.75 |
Power | Hyperbolic sine | Check for NaN value | |||
GNU libc | 17,057,306 calls/s | GNU libc | 23,814,110 calls/s | GNU libc | 478,990,996 calls/s |
LinAsm | 34,880,151 calls/s | LinAsm | 38,867,188 calls/s | LinAsm | 320,172,910 calls/s |
Acceleration | x2.04 | Acceleration | x1.63 | Acceleration | x0.67 |
Logarithm to base 2 | Hyperbolic cosine | ||||
GNU libc | 48,285,716 calls/s | GNU libc | 33,884,353 calls/s | ||
LinAsm | 45,554,278 calls/s | LinAsm | 38,644,402 calls/s | ||
Acceleration | x0.94 | Acceleration | x1.14 | ||
Logarithm to base 10 | Hyperbolic tangent | ||||
GNU libc | 23,791,664 calls/s | GNU libc | 25,536,869 calls/s | ||
LinAsm | 45,552,268 calls/s | LinAsm | 39,367,433 calls/s | ||
Acceleration | x1.91 | Acceleration | x1.54 | ||
Function | Benchmark value | Function | Benchmark value | Function | Benchmark value |
Vectorized functions (single precision)
Function | Benchmark value | Function | Benchmark value | Function | Benchmark value |
---|---|---|---|---|---|
Absolute value | Logarithm to base 10 | Hyperbolic sine | |||
GNU libc | 324,308,110 calls/s | GNU libc | 42,342,596 calls/s | GNU libc | 27,369,086 calls/s |
LinAsm | 1,295,451,434 calls/s | LinAsm | 228,321,086 calls/s | LinAsm | 198,967,638 calls/s |
Acceleration | x3.99 | Acceleration | x5.39 | Acceleration | x7.27 |
Minimum value | Natural logarithm | Hyperbolic cosine | |||
GNU libc | 265,006,842 calls/s | GNU libc | 62,844,510 calls/s | GNU libc | 56,774,535 calls/s |
LinAsm | 1,165,630,415 calls/s | LinAsm | 228,444,821 calls/s | LinAsm | 202,957,564 calls/s |
Acceleration | x4.40 | Acceleration | x3.64 | Acceleration | x3.57 |
Maximum value | Natural logarithm of value plus 1 | Hyperbolic tangent | |||
GNU libc | 265,314,531 calls/s | GNU libc | 51,105,329 calls/s | GNU libc | 29,514,099 calls/s |
LinAsm | 1,165,481,765 calls/s | LinAsm | 237,639,816 calls/s | LinAsm | 202,698,417 calls/s |
Acceleration | x4.39 | Acceleration | x4.65 | Acceleration | x6.87 |
Minimum absolute value | Hypotenuse | Hyperbolic ArcSine | |||
GNU libc | 100,371,575 calls/s | GNU libc | 91,315,223 calls/s | GNU libc | 30,934,969 calls/s |
LinAsm | 1,141,266,159 calls/s | LinAsm | 595,154,150 calls/s | LinAsm | 94,050,661 calls/s |
Acceleration | x11.37 | Acceleration | x6.52 | Acceleration | x3.04 |
Maximum absolute value | Sine | Hyperbolic ArcCosine | |||
GNU libc | 98,612,418 calls/s | GNU libc | 74,064,899 calls/s | GNU libc | 34,980,431 calls/s |
LinAsm | 1,131,885,254 calls/s | LinAsm | 42,163,212 calls/s | LinAsm | 113,234,823 calls/s |
Acceleration | x11.48 | Acceleration | x0.57 | Acceleration | x3.24 |
Square root | Cosine | Hyperbolic ArcTangent | |||
GNU libc | 98,377,251 calls/s | GNU libc | 73,978,695 calls/s | GNU libc | 30,142,314 calls/s |
LinAsm | 674,817,836 calls/s | LinAsm | 42,213,056 calls/s | LinAsm | 159,531,687 calls/s |
Acceleration | x6.86 | Acceleration | x0.57 | Acceleration | x5.29 |
Power of 2 | Sine and cosine | Round down (floor) | |||
GNU libc | 52,858,140 calls/s | GNU libc | 65,247,241 calls/s | GNU libc | 321,852,478 calls/s |
LinAsm | 352,005,180 calls/s | LinAsm | 41,650,174 calls/s | LinAsm | 1,281,182,366 calls/s |
Acceleration | x6.66 | Acceleration | x0.64 | Acceleration | x3.98 |
Power of 10 | Tangent | Round up (ceil) | |||
GNU libc | 34,224,983 calls/s | GNU libc | 29,651,034 calls/s | GNU libc | 322,019,933 calls/s |
LinAsm | 262,794,537 calls/s | LinAsm | 41,811,619 calls/s | LinAsm | 1,277,422,152 calls/s |
Acceleration | x7.68 | Acceleration | x1.41 | Acceleration | x3.97 |
Power of E | ArcSine | Round to nearest even integer | |||
GNU libc | 76,305,289 calls/s | GNU libc | 70,338,925 calls/s | GNU libc | 321,658,149 calls/s |
LinAsm | 263,001,376 calls/s | LinAsm | 166,928,615 calls/s | LinAsm | 1,275,920,059 calls/s |
Acceleration | x3.45 | Acceleration | x2.37 | Acceleration | x3.97 |
Power of E minus 1 | ArcCosine | Round to nearest integer | |||
GNU libc | 41,108,198 calls/s | GNU libc | 65,136,500 calls/s | GNU libc | 191,374,697 calls/s |
LinAsm | 267,795,355 calls/s | LinAsm | 169,324,706 calls/s | LinAsm | 1,146,311,871 calls/s |
Acceleration | x6.51 | Acceleration | x2.60 | Acceleration | x5.99 |
Power | ArcTangent | Truncate to integer | |||
GNU libc | 9,932,465 calls/s | GNU libc | 55,373,646 calls/s | GNU libc | 207,982,695 calls/s |
LinAsm | 33,038,060 calls/s | LinAsm | 262,521,262 calls/s | LinAsm | 1,275,589,500 calls/s |
Acceleration | x3.33 | Acceleration | x4.74 | Acceleration | x6.13 |
Logarithm to base 2 | ArcTangent2 | ||||
GNU libc | 57,154,090 calls/s | GNU libc | 24,149,571 calls/s | ||
LinAsm | 228,205,961 calls/s | LinAsm | 189,508,590 calls/s | ||
Acceleration | x3.99 | Acceleration | x7.85 | ||
Function | Benchmark value | Function | Benchmark value | Function | Benchmark value |
Vectorized functions (double precision)
Function | Benchmark value | Function | Benchmark value | Function | Benchmark value |
---|---|---|---|---|---|
Absolute value | Logarithm to base 10 | Hyperbolic sine | |||
GNU libc | 359,388,560 calls/s | GNU libc | 23,953,735 calls/s | GNU libc | 23,796,774 calls/s |
LinAsm | 697,567,363 calls/s | LinAsm | 71,724,842 calls/s | LinAsm | 74,368,091 calls/s |
Acceleration | x1.94 | Acceleration | x2.99 | Acceleration | x3.13 |
Minimum value | Natural logarithm | Hyperbolic cosine | |||
GNU libc | 261,926,679 calls/s | GNU libc | 27,622,351 calls/s | GNU libc | 33,538,382 calls/s |
LinAsm | 628,559,020 calls/s | LinAsm | 70,711,205 calls/s | LinAsm | 74,089,859 calls/s |
Acceleration | x2.40 | Acceleration | x2.56 | Acceleration | x2.21 |
Maximum value | Natural logarithm of value plus 1 | Hyperbolic tangent | |||
GNU libc | 262,361,131 calls/s | GNU libc | 40,496,201 calls/s | GNU libc | 25,308,355 calls/s |
LinAsm | 570,279,381 calls/s | LinAsm | 73,550,247 calls/s | LinAsm | 78,192,188 calls/s |
Acceleration | x2.17 | Acceleration | x1.82 | Acceleration | x3.09 |
Minimum absolute value | Hypotenuse | Hyperbolic ArcSine | |||
GNU libc | 106,739,601 calls/s | GNU libc | 46,784,021 calls/s | GNU libc | 16,465,227 calls/s |
LinAsm | 568,036,107 calls/s | LinAsm | 176,954,234 calls/s | LinAsm | 31,500,673 calls/s |
Acceleration | x5.32 | Acceleration | x3.78 | Acceleration | x1.91 |
Maximum absolute value | Sine | Hyperbolic ArcCosine | |||
GNU libc | 106,794,055 calls/s | GNU libc | 22,453,277 calls/s | GNU libc | 17,414,551 calls/s |
LinAsm | 567,515,968 calls/s | LinAsm | 32,942,922 calls/s | LinAsm | 39,087,076 calls/s |
Acceleration | x5.31 | Acceleration | x1.47 | Acceleration | x2.24 |
Square root | Cosine | Hyperbolic ArcTangent | |||
GNU libc | 105,112,995 calls/s | GNU libc | 22,226,302 calls/s | GNU libc | 24,041,115 calls/s |
LinAsm | 226,788,703 calls/s | LinAsm | 33,508,911 calls/s | LinAsm | 52,987,191 calls/s |
Acceleration | x2.16 | Acceleration | x1.51 | Acceleration | x2.20 |
Power of 2 | Sine and cosine | Round down (floor) | |||
GNU libc | 49,244,566 calls/s | GNU libc | 14,274,131 calls/s | GNU libc | 360,992,071 calls/s |
LinAsm | 124,966,988 calls/s | LinAsm | 31,691,947 calls/s | LinAsm | 713,763,090 calls/s |
Acceleration | x2.54 | Acceleration | x2.22 | Acceleration | x1.98 |
Power of 10 | Tangent | Round up (ceil) | |||
GNU libc | 17,057,982 calls/s | GNU libc | 16,601,831 calls/s | GNU libc | 360,542,288 calls/s |
LinAsm | 100,099,548 calls/s | LinAsm | 31,780,370 calls/s | LinAsm | 715,045,670 calls/s |
Acceleration | x5.87 | Acceleration | x1.91 | Acceleration | x1.98 |
Power of E | ArcSine | Round to nearest even integer | |||
GNU libc | 37,026,020 calls/s | GNU libc | 35,287,695 calls/s | GNU libc | 360,795,385 calls/s |
LinAsm | 102,103,679 calls/s | LinAsm | 60,062,778 calls/s | LinAsm | 719,332,379 calls/s |
Acceleration | x2.76 | Acceleration | x1.70 | Acceleration | x1.99 |
Power of E minus 1 | ArcCosine | Round to nearest integer | |||
GNU libc | 35,559,414 calls/s | GNU libc | 36,338,212 calls/s | GNU libc | 186,619,046 calls/s |
LinAsm | 102,977,477 calls/s | LinAsm | 61,322,835 calls/s | LinAsm | 630,066,152 calls/s |
Acceleration | x2.90 | Acceleration | x1.69 | Acceleration | x3.38 |
Power | ArcTangent | Truncate to integer | |||
GNU libc | 12,458,586 calls/s | GNU libc | 28,707,818 calls/s | GNU libc | 217,375,092 calls/s |
LinAsm | 25,872,407 calls/s | LinAsm | 86,893,719 calls/s | LinAsm | 711,336,124 calls/s |
Acceleration | x2.08 | Acceleration | x3.03 | Acceleration | x3.27 |
Logarithm to base 2 | ArcTangent2 | ||||
GNU libc | 48,030,011 calls/s | GNU libc | 16,523,095 calls/s | ||
LinAsm | 71,724,235 calls/s | LinAsm | 63,311,670 calls/s | ||
Acceleration | x1.49 | Acceleration | x3.83 | ||
Function | Benchmark value | Function | Benchmark value | Function | Benchmark value |
Array library
Array library can improve trivial array operations and replace scalar code with vector code, written in assembly language, which uses SIMD extension of modern processors. To run this code your CPU should support SSE 4.2 extension (almost all modern processors were bought during last 5 years).
Array functions intensively use memory prefetch commands. It improves processing of really big portions of data, because of efficient use of cache memory. The main bottleneck of modern computers is memory subsystem, which is much slower than CPU, and require non trivial code optimization to get high performance. You also may see that single precision numbers take 2 times less CPU cycles to process. It is because of memory bandwidth. Single precision data are 2 times shorter than double precision and take 2 times less memory space.
Next benchmark generates 131,072 elements wide random arrays, and tests scalar and vector code for these arrays 10,000 times to get precise time values for scalar and vector code. Sort functions are ran only ones to sort random arrays in ascending sort order. To measure performance of integer operations was generated random array of bytes (uint8_t) and then scalar and vector operations were invoked.
Integer operations
Function | Benchmark value | Function | Benchmark value | Function | Benchmark value |
---|---|---|---|---|---|
Initialization | Scalar bitwise XOR | Find key | |||
Classic scalar code | 26,247,096,533 nums/s | Classic scalar code | 1,458,059,974 nums/s | Classic scalar code | 22,368,863,536 nums/s |
LinAsm vector code | 25,523,404,363 nums/s | LinAsm vector code | 22,431,021,895 nums/s | LinAsm vector code | 16,624,115,820 nums/s |
Acceleration | x0.97 | Acceleration | x15.38 | Acceleration | x0.74 |
Copying elements | Vector bitwise XOR | Count keys | |||
Classic scalar code | 12,446,514,964 nums/s | Classic scalar code | 968,334,933 nums/s | Classic scalar code | 968,235,259 nums/s |
LinAsm vector code | 15,538,309,629 nums/s | LinAsm vector code | 14,115,773,423 nums/s | LinAsm vector code | 13,217,213,210 nums/s |
Acceleration | x1.25 | Acceleration | x14.58 | Acceleration | x13.65 |
Moving elements | Scalar addition | Replace keys | |||
Classic scalar code | 15,133,066,835 nums/s | Classic scalar code | 1,458,167,989 nums/s | Classic scalar code | 1,365,258,630 nums/s |
LinAsm vector code | 15,462,039,920 nums/s | LinAsm vector code | 22,630,209,340 nums/s | LinAsm vector code | 7,337,564,794 nums/s |
Acceleration | x1.02 | Acceleration | x15.52 | Acceleration | x5.37 |
Bitwise NOT | Vector addition | Reversing elements order | |||
Classic scalar code | 1,452,957,585 nums/s | Classic scalar code | 968,668,665 nums/s | Classic scalar code | 1,937,538,396 nums/s |
LinAsm vector code | 20,958,282,749 nums/s | LinAsm vector code | 14,104,830,273 nums/s | LinAsm vector code | 21,802,399,684 nums/s |
Acceleration | x14.42 | Acceleration | x14.56 | Acceleration | x11.25 |
Scalar bitwise AND | Scalar subtraction | Quick sort | |||
Classic scalar code | 1,458,618,396 nums/s | Classic scalar code | 1,458,334,739 nums/s | Classic scalar code | 5,459,417 nums/s |
LinAsm vector code | 22,256,960,611 nums/s | LinAsm vector code | 21,899,201,354 nums/s | LinAsm vector code | 23,145,501 nums/s |
Acceleration | x15.26 | Acceleration | x15.02 | Acceleration | x4.24 |
Vector bitwise AND | Vector subtraction | Radix sort | |||
Classic scalar code | 968,523,913 nums/s | Classic scalar code | 965,818,843 nums/s | Classic scalar code | 5,486,116 nums/s |
LinAsm vector code | 14,114,702,070 nums/s | LinAsm vector code | 14,148,484,301 nums/s | LinAsm vector code | 304,006,272 nums/s |
Acceleration | x14.57 | Acceleration | x14.65 | Acceleration | x55.41 |
Scalar bitwise OR | Minimum value | Compare | |||
Classic scalar code | 1,459,029,787 nums/s | Classic scalar code | 972,005,815 nums/s | Classic scalar code | 14,908,584,089 nums/s |
LinAsm vector code | 22,673,501,492 nums/s | LinAsm vector code | 25,873,112,942 nums/s | LinAsm vector code | 12,473,248,041 nums/s |
Acceleration | x15.54 | Acceleration | x26.62 | Acceleration | x0.84 |
Vector bitwise OR | Maximum value | ||||
Classic scalar code | 968,519,403 nums/s | Classic scalar code | 972,691,812 nums/s | ||
LinAsm vector code | 14,120,594,696 nums/s | LinAsm vector code | 25,832,753,615 nums/s | ||
Acceleration | x14.58 | Acceleration | x26.56 | ||
Function | Benchmark value | Function | Benchmark value | Function | Benchmark value |
Single precision operations
Function | Benchmark value | Function | Benchmark value | Function | Benchmark value |
---|---|---|---|---|---|
Initialization | Square root | Sum of elements | |||
Classic scalar code | 1,449,448,350 nums/s | Classic scalar code | 96,666,381 nums/s | Classic scalar code | 973,652,319 nums/s |
LinAsm vector code | 5,314,771,813 nums/s | LinAsm vector code | 673,151,614 nums/s | LinAsm vector code | 5,721,688,843 nums/s |
Acceleration | x3.67 | Acceleration | x6.96 | Acceleration | x5.88 |
Copying elements | Reversing elements order | Sum of squared values | |||
Classic scalar code | 2,583,627,939 nums/s | Classic scalar code | 1,880,596,186 nums/s | Classic scalar code | 962,859,536 nums/s |
LinAsm vector code | 2,995,891,876 nums/s | LinAsm vector code | 3,905,499,219 nums/s | LinAsm vector code | 4,493,268,963 nums/s |
Acceleration | x1.16 | Acceleration | x2.08 | Acceleration | x4.67 |
Moving elements | Minimum value | Sum of absolute values | |||
Classic scalar code | 3,085,921,651 nums/s | Classic scalar code | 729,849,891 nums/s | Classic scalar code | 289,142,233 nums/s |
LinAsm vector code | 3,076,979,620 nums/s | LinAsm vector code | 4,910,519,391 nums/s | LinAsm vector code | 5,069,867,899 nums/s |
Acceleration | x1.00 | Acceleration | x6.73 | Acceleration | x17.53 |
Addition | Maximum value | Convolution | |||
Classic scalar code | 957,867,472 nums/s | Classic scalar code | 730,042,489 nums/s | Classic scalar code | 962,786,062 nums/s |
LinAsm vector code | 2,806,874,647 nums/s | LinAsm vector code | 4,966,436,955 nums/s | LinAsm vector code | 3,018,673,549 nums/s |
Acceleration | x2.93 | Acceleration | x6.80 | Acceleration | x3.14 |
Subtraction | Minimum absolute value | Sum of squared differences | |||
Classic scalar code | 953,232,282 nums/s | Classic scalar code | 264,656,716 nums/s | Classic scalar code | 957,652,131 nums/s |
LinAsm vector code | 2,803,691,325 nums/s | LinAsm vector code | 3,472,789,571 nums/s | LinAsm vector code | 2,962,268,467 nums/s |
Acceleration | x2.94 | Acceleration | x13.12 | Acceleration | x3.09 |
Multiplication | Maximum absolute value | Sum of absolute differences | |||
Classic scalar code | 956,900,317 nums/s | Classic scalar code | 264,173,973 nums/s | Classic scalar code | 288,280,274 nums/s |
LinAsm vector code | 2,795,946,245 nums/s | LinAsm vector code | 3,468,637,424 nums/s | LinAsm vector code | 2,979,043,493 nums/s |
Acceleration | x2.92 | Acceleration | x13.13 | Acceleration | x10.33 |
Division | Round down (floor) | Quick sort | |||
Classic scalar code | 208,187,694 nums/s | Classic scalar code | 320,398,717 nums/s | Classic scalar code | 6,958,462 nums/s |
LinAsm vector code | 830,988,071 nums/s | LinAsm vector code | 4,600,269,931 nums/s | LinAsm vector code | 13,721,876 nums/s |
Acceleration | x3.99 | Acceleration | x14.36 | Acceleration | x1.97 |
Negative value | Round up (ceil) | Radix sort | |||
Classic scalar code | 963,286,249 nums/s | Classic scalar code | 320,696,628 nums/s | Classic scalar code | 7,006,796 nums/s |
LinAsm vector code | 4,644,524,991 nums/s | LinAsm vector code | 4,602,883,966 nums/s | LinAsm vector code | 67,867,415 nums/s |
Acceleration | x4.82 | Acceleration | x14.35 | Acceleration | x9.69 |
Absolute value | Round to nearest even integer | Check for infinite values | |||
Classic scalar code | 319,869,503 nums/s | Classic scalar code | 320,649,957 nums/s | Classic scalar code | 287,970,298 nums/s |
LinAsm vector code | 4,642,667,776 nums/s | LinAsm vector code | 4,596,676,416 nums/s | LinAsm vector code | 3,494,387,871 nums/s |
Acceleration | x14.51 | Acceleration | x14.34 | Acceleration | x12.13 |
Sign function | Round away from zero | Check for NaN values | |||
Classic scalar code | 186,030,029 nums/s | Classic scalar code | 252,217,392 nums/s | Classic scalar code | 319,586,719 nums/s |
LinAsm vector code | 1,600,463,599 nums/s | LinAsm vector code | 2,696,486,393 nums/s | LinAsm vector code | 3,505,537,468 nums/s |
Acceleration | x8.60 | Acceleration | x10.69 | Acceleration | x10.97 |
Squared value | Truncate to integer | ||||
Classic scalar code | 964,552,912 nums/s | Classic scalar code | 277,532,181 nums/s | ||
LinAsm vector code | 4,597,414,611 nums/s | LinAsm vector code | 4,594,490,375 nums/s | ||
Acceleration | x4.77 | Acceleration | x16.55 | ||
Function | Benchmark value | Function | Benchmark value | Function | Benchmark value |
Double precision operations
Function | Benchmark value | Function | Benchmark value | Function | Benchmark value |
---|---|---|---|---|---|
Initialization | Square root | Sum of elements | |||
Classic scalar code | 1,453,092,179 nums/s | Classic scalar code | 100,327,992 nums/s | Classic scalar code | 973,468,701 nums/s |
LinAsm vector code | 2,365,891,956 nums/s | LinAsm vector code | 226,840,676 nums/s | LinAsm vector code | 2,872,380,716 nums/s |
Acceleration | x1.63 | Acceleration | x2.26 | Acceleration | x2.95 |
Copying elements | Reversing elements order | Sum of squared values | |||
Classic scalar code | 797,545,076 nums/s | Classic scalar code | 1,419,314,419 nums/s | Classic scalar code | 962,659,125 nums/s |
LinAsm vector code | 813,777,818 nums/s | LinAsm vector code | 1,772,287,120 nums/s | LinAsm vector code | 2,233,439,817 nums/s |
Acceleration | x1.02 | Acceleration | x1.25 | Acceleration | x2.32 |
Moving elements | Minimum value | Sum of absolute values | |||
Classic scalar code | 432,968,797 nums/s | Classic scalar code | 729,150,031 nums/s | Classic scalar code | 287,651,279 nums/s |
LinAsm vector code | 817,853,959 nums/s | LinAsm vector code | 2,223,770,794 nums/s | LinAsm vector code | 2,531,620,705 nums/s |
Acceleration | x1.89 | Acceleration | x3.05 | Acceleration | x8.80 |
Addition | Maximum value | Convolution | |||
Classic scalar code | 654,079,848 nums/s | Classic scalar code | 729,136,347 nums/s | Classic scalar code | 873,583,158 nums/s |
LinAsm vector code | 794,801,582 nums/s | LinAsm vector code | 2,233,193,107 nums/s | LinAsm vector code | 1,318,485,879 nums/s |
Acceleration | x1.22 | Acceleration | x3.06 | Acceleration | x1.51 |
Subtraction | Minimum absolute value | Sum of squared differences | |||
Classic scalar code | 653,894,985 nums/s | Classic scalar code | 263,060,735 nums/s | Classic scalar code | 861,577,026 nums/s |
LinAsm vector code | 796,559,037 nums/s | LinAsm vector code | 1,735,046,479 nums/s | LinAsm vector code | 1,321,087,255 nums/s |
Acceleration | x1.22 | Acceleration | x6.60 | Acceleration | x1.53 |
Multiplication | Maximum absolute value | Sum of absolute differences | |||
Classic scalar code | 652,176,045 nums/s | Classic scalar code | 262,765,035 nums/s | Classic scalar code | 285,274,870 nums/s |
LinAsm vector code | 794,471,750 nums/s | LinAsm vector code | 1,733,836,038 nums/s | LinAsm vector code | 1,317,029,569 nums/s |
Acceleration | x1.22 | Acceleration | x6.60 | Acceleration | x4.62 |
Division | Round down (floor) | Quick sort | |||
Classic scalar code | 131,699,220 nums/s | Classic scalar code | 322,062,155 nums/s | Classic scalar code | 6,486,939 nums/s |
LinAsm vector code | 263,163,353 nums/s | LinAsm vector code | 2,110,355,631 nums/s | LinAsm vector code | 13,645,567 nums/s |
Acceleration | x2.00 | Acceleration | x6.55 | Acceleration | x2.10 |
Negative value | Round up (ceil) | Radix sort | |||
Classic scalar code | 955,197,021 nums/s | Classic scalar code | 321,974,921 nums/s | Classic scalar code | 6,457,000 nums/s |
LinAsm vector code | 2,122,434,728 nums/s | LinAsm vector code | 2,103,546,245 nums/s | LinAsm vector code | 25,712,316 nums/s |
Acceleration | x2.22 | Acceleration | x6.53 | Acceleration | x3.98 |
Absolute value | Round to nearest even integer | Check for infinite values | |||
Classic scalar code | 321,395,283 nums/s | Classic scalar code | 322,361,620 nums/s | Classic scalar code | 951,890,346 nums/s |
LinAsm vector code | 2,125,749,456 nums/s | LinAsm vector code | 2,111,207,895 nums/s | LinAsm vector code | 1,748,375,151 nums/s |
Acceleration | x6.61 | Acceleration | x6.55 | Acceleration | x1.84 |
Sign function | Round away from zero | Check for NaN values | |||
Classic scalar code | 174,778,443 nums/s | Classic scalar code | 250,692,468 nums/s | Classic scalar code | 961,272,404 nums/s |
LinAsm vector code | 797,555,044 nums/s | LinAsm vector code | 1,338,027,530 nums/s | LinAsm vector code | 1,509,675,944 nums/s |
Acceleration | x4.56 | Acceleration | x5.34 | Acceleration | x1.57 |
Squared value | Truncate to integer | ||||
Classic scalar code | 942,158,938 nums/s | Classic scalar code | 254,941,349 nums/s | ||
LinAsm vector code | 2,111,742,458 nums/s | LinAsm vector code | 2,105,281,502 nums/s | ||
Acceleration | x2.24 | Acceleration | x8.26 | ||
Function | Benchmark value | Function | Benchmark value | Function | Benchmark value |
Statistics library
Statistics library implements basic statistical functions, that process empirical data and return such values as data mean, standard deviation, skewness, kurtosis and robust parameters as median and inter-quartile range. They take regular arrays and process them to compute appropriate functions.
For this benchmark were generated 131,072 elements wide random arrays. The tests were ran 10,000 times to get precise time values. You may see, that the bottleneck for this benchmark is memory bandwidth, because single precision data has almost the same process time (see scalar code) as double precision data. But the vector code took 2 times less CPU cycles for single precision data, because of type size.
Single precision operations
Function | Benchmark value | Function | Benchmark value | Function | Benchmark value |
---|---|---|---|---|---|
Mean | Variance | Skewness | |||
Classic scalar code | 964,665,266 nums/s | Classic scalar code | 959,956,922 nums/s | Classic scalar code | 722,416,463 nums/s |
LinAsm vector code | 5,785,564,798 nums/s | LinAsm vector code | 3,956,627,930 nums/s | LinAsm vector code | 2,998,994,331 nums/s |
Acceleration | x6.00 | Acceleration | x4.12 | Acceleration | x4.15 |
Median | Standard deviation | Kurtosis | |||
Classic scalar code | 5,907,422 nums/s | Classic scalar code | 960,055,218 nums/s | Classic scalar code | 958,495,295 nums/s |
LinAsm vector code | 565,367,611 nums/s | LinAsm vector code | 3,984,589,193 nums/s | LinAsm vector code | 3,196,202,715 nums/s |
Acceleration | x95.70 | Acceleration | x4.15 | Acceleration | x3.33 |
Lower quartile | Absolute deviation | Covariance | |||
Classic scalar code | 5,920,796 nums/s | Classic scalar code | 287,672,263 nums/s | Classic scalar code | 955,035,008 nums/s |
LinAsm vector code | 600,245,462 nums/s | LinAsm vector code | 4,458,969,358 nums/s | LinAsm vector code | 2,975,861,423 nums/s |
Acceleration | x101.38 | Acceleration | x15.50 | Acceleration | x3.12 |
Upper quartile | Inter quartile range | Correlation | |||
Classic scalar code | 5,944,117 nums/s | Classic scalar code | 5,982,819 nums/s | Classic scalar code | 554,712,477 nums/s |
LinAsm vector code | 539,442,005 nums/s | LinAsm vector code | 350,847,326 nums/s | LinAsm vector code | 2,174,243,956 nums/s |
Acceleration | x90.75 | Acceleration | x58.64 | Acceleration | x3.92 |
Mid-range | Range | ||||
Classic scalar code | 721,486,658 nums/s | Classic scalar code | 723,473,804 nums/s | ||
LinAsm vector code | 3,840,675,145 nums/s | LinAsm vector code | 3,769,070,539 nums/s | ||
Acceleration | x5.32 | Acceleration | x5.21 | ||
Function | Benchmark value | Function | Benchmark value | Function | Benchmark value |
Double precision operations
Function | Benchmark value | Function | Benchmark value | Function | Benchmark value |
---|---|---|---|---|---|
Mean | Variance | Skewness | |||
Classic scalar code | 964,541,098 nums/s | Classic scalar code | 950,759,341 nums/s | Classic scalar code | 713,286,659 nums/s |
LinAsm vector code | 2,886,192,727 nums/s | LinAsm vector code | 1,885,310,148 nums/s | LinAsm vector code | 1,367,836,971 nums/s |
Acceleration | x2.99 | Acceleration | x1.98 | Acceleration | x1.92 |
Median | Standard deviation | Kurtosis | |||
Classic scalar code | 6,519,883 nums/s | Classic scalar code | 951,093,533 nums/s | Classic scalar code | 940,796,992 nums/s |
LinAsm vector code | 306,118,541 nums/s | LinAsm vector code | 1,876,669,024 nums/s | LinAsm vector code | 1,619,898,830 nums/s |
Acceleration | x46.95 | Acceleration | x1.97 | Acceleration | x1.72 |
Lower quartile | Absolute deviation | Covariance | |||
Classic scalar code | 6,482,688 nums/s | Classic scalar code | 287,994,116 nums/s | Classic scalar code | 824,464,410 nums/s |
LinAsm vector code | 330,074,692 nums/s | LinAsm vector code | 2,438,455,976 nums/s | LinAsm vector code | 1,200,790,940 nums/s |
Acceleration | x50.92 | Acceleration | x8.47 | Acceleration | x1.46 |
Upper quartile | Inter quartile range | Correlation | |||
Classic scalar code | 6,651,860 nums/s | Classic scalar code | 6,581,159 nums/s | Classic scalar code | 511,943,406 nums/s |
LinAsm vector code | 272,086,588 nums/s | LinAsm vector code | 209,122,296 nums/s | LinAsm vector code | 996,020,717 nums/s |
Acceleration | x40.90 | Acceleration | x31.78 | Acceleration | x1.95 |
Mid-range | Range | ||||
Classic scalar code | 715,817,471 nums/s | Classic scalar code | 715,803,692 nums/s | ||
LinAsm vector code | 1,901,722,547 nums/s | LinAsm vector code | 1,885,016,271 nums/s | ||
Acceleration | x2.66 | Acceleration | x2.63 | ||
Function | Benchmark value | Function | Benchmark value | Function | Benchmark value |
Object pool library
If you allocate and release a lot of small objects in your algorithm, then the Object Pool is a good idea to replace the standard lib C "malloc" and "free" functions. Object pool allocates predefined amount of objects, which also can be aligned by power of 2, and then returns them when you need a new object, or puts them back in the pool when the object is released. This operation is very fast and do not call heavy memory management functions.
This benchmark creates 16,384 objects of different sizes (from 8 to 256 bytes) and then allocate them and release 1,000 times.
Function | Benchmark value | Function | Benchmark value |
---|---|---|---|
Allocation of 8 byte blocks | Allocation of 64 byte blocks | ||
Libc malloc | 57,733,384 blocks/s | Libc malloc | 57,770,358 blocks/s |
LinAsm alloc | 258,139,671 blocks/s | LinAsm alloc | 128,817,026 blocks/s |
Acceleration | x4.47 | Acceleration | x2.23 |
Releasing of 8 byte blocks | Releasing of 64 byte blocks | ||
Libc free | 61,711,620 blocks/s | Libc free | 61,062,234 blocks/s |
LinAsm free | 251,573,474 blocks/s | LinAsm free | 187,952,006 blocks/s |
Acceleration | x4.08 | Acceleration | x3.08 |
Allocation of 16 byte blocks | Allocation of 128 byte blocks | ||
Libc malloc | 58,366,364 blocks/s | Libc malloc | 18,916,890 blocks/s |
LinAsm alloc | 237,857,380 blocks/s | LinAsm alloc | 97,431,883 blocks/s |
Acceleration | x4.08 | Acceleration | x5.15 |
Releasing of 16 byte blocks | Releasing of 128 byte blocks | ||
Libc free | 62,101,706 blocks/s | Libc free | 21,270,774 blocks/s |
LinAsm free | 250,377,515 blocks/s | LinAsm free | 130,613,856 blocks/s |
Acceleration | x4.03 | Acceleration | x6.14 |
Allocation of 32 byte blocks | Allocation of 256 byte blocks | ||
Libc malloc | 58,015,214 blocks/s | Libc malloc | 10,030,150 blocks/s |
LinAsm alloc | 187,071,330 blocks/s | LinAsm alloc | 33,110,384 blocks/s |
Acceleration | x3.22 | Acceleration | x3.30 |
Releasing of 32 byte blocks | Releasing of 256 byte blocks | ||
Libc free | 61,837,841 blocks/s | Libc free | 8,430,772 blocks/s |
LinAsm free | 233,604,504 blocks/s | LinAsm free | 48,075,760 blocks/s |
Acceleration | x3.78 | Acceleration | x5.70 |
Function | Benchmark value | Function | Benchmark value |
Vector ADT
Vector is the first and very simple abstract data type. It is most used STL container that holds elements and may replace standard arrays. LinAsm implements it own vector container that around 2 times faster for most operations than STL vector. It has non STL interface, but much more optimized and cache friendly code. Benchmark just shows the speed (keys per seconds) you may expects for data processing algorithms, which are based on that data type.
For this benchmark were created 1,048,576 random keys in range [0, 131,072), then they were pushed and popped from the vector in single and concurrent threads. All other functions were also invoked in 16 rounds for filled vector in single and concurrent threads (4 threads) to measure the time each operation takes.
Function | Benchmark value | Function | Benchmark value | Function | Benchmark value |
---|---|---|---|---|---|
Push (single thread) | Find unordered (concurrent thread) | Max value (concurrent threads) | |||
STL code | 29,715,893 keys/s | STL code | 45,471,847 keys/s | STL code | 44,273,055 keys/s |
LinAsm code | 63,250,151 keys/s | LinAsm code | 185,315,862 keys/s | LinAsm code | 259,325,075 keys/s |
Acceleration | x2.13 | Acceleration | x4.08 | Acceleration | x5.86 |
Push (concurrent threads) | Find differences (single thread) | IsEqual (single thread) | |||
STL code | 1,817,788 keys/s | STL code | 46,601,607 keys/s | STL code | 157,952,779 keys/s |
LinAsm code | 1,271,922 keys/s | LinAsm code | 204,334,978 keys/s | LinAsm code | 182,894,778 keys/s |
Acceleration | x0.70 | Acceleration | x4.38 | Acceleration | x1.16 |
Sort (single thread) | Find differences (concurrent threads) | IsEqual (concurrent threads) | |||
STL code | 2,857,597 keys/s | STL code | 37,444,917 keys/s | STL code | 60,911,410 keys/s |
LinAsm code | 10,842,321 keys/s | LinAsm code | 54,293,581 keys/s | LinAsm code | 55,241,078 keys/s |
Acceleration | x3.79 | Acceleration | x1.45 | Acceleration | x0.91 |
Unique values (single thread) | Binary search (single thread) | Copy constructor (single thread) | |||
STL code | 64,988,533 keys/s | STL code | 2,138,969 keys/s | STL code | 172,299,388 keys/s |
LinAsm code | 264,948,881 keys/s | LinAsm code | 2,737,777 keys/s | LinAsm code | 88,494,842 keys/s |
Acceleration | x4.08 | Acceleration | x1.28 | Acceleration | x0.51 |
Get (single thread) | Binary search (concurrent threads) | Copy (single thread) | |||
STL code | 106,187,535 keys/s | STL code | 1,741,045 keys/s | STL code | 186,212,964 keys/s |
LinAsm code | 266,331,463 keys/s | LinAsm code | 2,817,737 keys/s | LinAsm code | 125,783,541 keys/s |
Acceleration | x2.51 | Acceleration | x1.62 | Acceleration | x0.68 |
Get (concurrent threads) | Count (single thread) | Move (single thread) | |||
STL code | 65,293,858 keys/s | STL code | 44,753,761 keys/s | STL code | 175,354,957 keys/s |
LinAsm code | 166,497,705 keys/s | LinAsm code | 302,437,873 keys/s | LinAsm code | 130,294,759 keys/s |
Acceleration | x2.55 | Acceleration | x6.76 | Acceleration | x0.74 |
Set (single thread) | Count (concurrent threads) | Merge (single thread) | |||
STL code | 62,734,020 keys/s | STL code | 50,242,298 keys/s | STL code | 18,310,634 keys/s |
LinAsm code | 102,647,798 keys/s | LinAsm code | 246,367,666 keys/s | LinAsm code | 41,598,234 keys/s |
Acceleration | x1.64 | Acceleration | x4.90 | Acceleration | x2.27 |
Find (single thread) | Min value (single thread) | Reverse (single thread) | |||
STL code | 136,602,613 keys/s | STL code | 49,583,777 keys/s | STL code | 97,791,198 keys/s |
LinAsm code | 415,570,580 keys/s | LinAsm code | 314,031,598 keys/s | LinAsm code | 184,070,078 keys/s |
Acceleration | x3.04 | Acceleration | x6.33 | Acceleration | x1.88 |
Find (concurrent threads) | Min value (concurrent threads) | Pop (single thread) | |||
STL code | 90,086,609 keys/s | STL code | 45,403,514 keys/s | STL code | 221,638,135 keys/s |
LinAsm code | 275,584,315 keys/s | LinAsm code | 246,803,652 keys/s | LinAsm code | 323,223,057 keys/s |
Acceleration | x3.06 | Acceleration | x5.44 | Acceleration | x1.46 |
Find unordered (single thread) | Max value (single thread) | Pop (concurrent threads) | |||
STL code | 49,829,771 keys/s | STL code | 48,670,234 keys/s | STL code | 10,820,667 keys/s |
LinAsm code | 377,676,183 keys/s | LinAsm code | 346,499,125 keys/s | LinAsm code | 10,605,514 keys/s |
Acceleration | x7.58 | Acceleration | x7.12 | Acceleration | x0.98 |
Function | Benchmark value | Function | Benchmark value | Function | Benchmark value |
Deque ADT
Double ended queue is the second most popular abstract data type. Typically it is implemented as circular buffer or double licked list. LinAsm uses automatically extended circular buffer for deque. It makes it as fast as vector adt and around 2 times faster than deque from STL library. Benchmark creates 1,048,576 random keys with 8 duplicates (average case) in range [0, 131,072), then push them into tail and pop them from head in single and concurrent threads manner. Each deque operation is done 16 times to get precise time measurement. The result table is shown below. You may see how fast (keys per second) each function is.
Function | Benchmark value | Function | Benchmark value | Function | Benchmark value |
---|---|---|---|---|---|
Push (single thread) | Find differences (concurrent threads) | IsEqual (concurrent threads) | |||
STL code | 34,043,608 keys/s | STL code | 33,233,777 keys/s | STL code | 39,850,563 keys/s |
LinAsm code | 60,844,734 keys/s | LinAsm code | 53,357,539 keys/s | LinAsm code | 70,584,555 keys/s |
Acceleration | x1.79 | Acceleration | x1.61 | Acceleration | x1.77 |
Push (concurrent threads) | Count (single thread) | Copy constructor (single thread) | |||
STL code | 1,835,312 keys/s | STL code | 67,167,365 keys/s | STL code | 54,997,119 keys/s |
LinAsm code | 1,255,515 keys/s | LinAsm code | 326,739,399 keys/s | LinAsm code | 89,869,698 keys/s |
Acceleration | x0.68 | Acceleration | x4.86 | Acceleration | x1.63 |
Get (single thread) | Count (concurrent threads) | Copy (single thread) | |||
STL code | 131,612,681 keys/s | STL code | 58,774,190 keys/s | STL code | 114,652,212 keys/s |
LinAsm code | 231,374,949 keys/s | LinAsm code | 264,880,608 keys/s | LinAsm code | 123,030,915 keys/s |
Acceleration | x1.76 | Acceleration | x4.51 | Acceleration | x1.07 |
Get (concurrent threads) | Min value (single thread) | Move (single thread) | |||
STL code | 70,822,764 keys/s | STL code | 54,333,054 keys/s | STL code | 124,217,242 keys/s |
LinAsm code | 110,527,351 keys/s | LinAsm code | 329,076,105 keys/s | LinAsm code | 131,069,919 keys/s |
Acceleration | x1.56 | Acceleration | x6.06 | Acceleration | x1.06 |
Set (single thread) | Min value (concurrent threads) | Reverse (single thread) | |||
STL code | 61,998,311 keys/s | STL code | 54,048,115 keys/s | STL code | 80,735,590 keys/s |
LinAsm code | 101,717,805 keys/s | LinAsm code | 233,531,497 keys/s | LinAsm code | 177,912,187 keys/s |
Acceleration | x1.64 | Acceleration | x4.32 | Acceleration | x2.20 |
Find (single thread) | Max value (single thread) | Pop (single thread) | |||
STL code | 136,830,816 keys/s | STL code | 52,242,222 keys/s | STL code | 50,592,871 keys/s |
LinAsm code | 368,494,833 keys/s | LinAsm code | 332,177,475 keys/s | LinAsm code | 250,850,581 keys/s |
Acceleration | x2.69 | Acceleration | x6.36 | Acceleration | x4.96 |
Find (concurrent threads) | Max value (concurrent threads) | Pop (concurrent threads) | |||
STL code | 94,349,763 keys/s | STL code | 48,113,043 keys/s | STL code | 7,151,952 keys/s |
LinAsm code | 246,576,072 keys/s | LinAsm code | 222,544,423 keys/s | LinAsm code | 8,425,015 keys/s |
Acceleration | x2.61 | Acceleration | x4.63 | Acceleration | x1.18 |
Find differences (single thread) | IsEqual (single thread) | ||||
STL code | 65,933,808 keys/s | STL code | 74,385,799 keys/s | ||
LinAsm code | 169,010,997 keys/s | LinAsm code | 171,553,277 keys/s | ||
Acceleration | x2.56 | Acceleration | x2.31 | ||
Function | Benchmark value | Function | Benchmark value | Function | Benchmark value |
Heap ADT
MinHeap and MaxHeap (also known as priority_queue in STL) is used when you require prioritization inside your service. Typicaly it is implemented by binary tree, projected onto plain array (vector). STL does not implement container priority_queue, but only the adaptor, which can be based on standard vector or queue. LinAsm uses regular array to store the keys. Instead of classic binary tree, it uses N-ary tree (currently each root node has 4 children, which are less or equal to the root). Child node (cluster of child elements) is aligned by CPU cache line, and all 4 children can be prefetched by one memory reading operation. It improves heap speed in 2 - 3 times in compare with standard array based binary tree.
Benchmark creates 1,048,576 random keys with 8 duplicates (average case) in range [0, 131,072), and put them into the heap in single thread and in concurrent threads manner (see the table below). Then it uses standard vector operations to find, count and check keys into the container. Multiple thread tests were ran in 4 threads on 2 cores CPU with hyper threading, enabled for each core. Concurrent threads were run in 16 rounds, to be sure that threads have worked in the same time. All the tests proved that cache aligned key store algorithm give you significant performance benefit.
Function | Benchmark value | Function | Benchmark value | Function | Benchmark value |
---|---|---|---|---|---|
Push (single thread) | Find (single thread) | Copy constructor (single thread) | |||
STL code | 8,259,694 keys/s | STL code | 82,477,976 keys/s | STL code | 95,221,291 keys/s |
LinAsm code | 35,224,409 keys/s | LinAsm code | 346,445,662 keys/s | LinAsm code | 64,783,887 keys/s |
Acceleration | x4.26 | Acceleration | x4.20 | Acceleration | x0.68 |
Push (concurrent threads) | Find (concurrent threads) | Merge (single thread) | |||
STL code | 627,915 keys/s | STL code | 91,439,914 keys/s | STL code | 973,027 keys/s |
LinAsm code | 1,203,838 keys/s | LinAsm code | 222,634,538 keys/s | LinAsm code | 51,585,673 keys/s |
Acceleration | x1.92 | Acceleration | x2.43 | Acceleration | x53.02 |
Get (single thread) | Count (single thread) | Pop (single thread) | |||
STL code | 87,441,442 keys/s | STL code | 54,900,267 keys/s | STL code | 1,162,532 keys/s |
LinAsm code | 277,155,138 keys/s | LinAsm code | 364,647,128 keys/s | LinAsm code | 3,829,158 keys/s |
Acceleration | x3.17 | Acceleration | x6.64 | Acceleration | x3.29 |
Get (concurrent threads) | Count (concurrent threads) | Pop (concurrent threads) | |||
STL code | 64,787,259 keys/s | STL code | 50,692,254 keys/s | STL code | 650,119 keys/s |
LinAsm code | 197,005,564 keys/s | LinAsm code | 248,504,165 keys/s | LinAsm code | 2,013,109 keys/s |
Acceleration | x3.04 | Acceleration | x4.90 | Acceleration | x3.10 |
Function | Benchmark value | Function | Benchmark value | Function | Benchmark value |
List ADT
List abstract data type uses big nodes, which hold more than one element into the node. Neighbor elements are grouped together to allow very fast access to previous and next elements. This improvement tries to optimize usage of CPU cache memory and accelerates standard list operations. This benchmark creates 1,048,576 of random elements, puts them into the list object and then tests speed of standard list operations. Each test shows how many keys per second were handled by the operation.
Multiple thread tests simultaneously invoke associated function in 4 threads, iterating all elements 16 times per thread, and show cumulative result (keys per second) for all of them. As you may see again, the memory speed is main limiter of all multiple thread algorithms. And show even less performance than single thread read access to data. It is because of CPU cache. All processes is trying to use CPU cache exclusively, and push out data of another threads, which downgrade the result. This can be fixed with computation intensive code among read operations.
Function | Benchmark value | Function | Benchmark value | Function | Benchmark value |
---|---|---|---|---|---|
Insert (single thread) | Find unordered (concurrent thread) | IsEqual (concurrent threads) | |||
STL code | 12,198,874 keys/s | STL code | 6,989,241 keys/s | STL code | 5,810,324 keys/s |
LinAsm code | 40,128,990 keys/s | LinAsm code | 73,150,735 keys/s | LinAsm code | 27,199,091 keys/s |
Acceleration | x3.29 | Acceleration | x10.47 | Acceleration | x4.68 |
Insert (concurrent threads) | Find differences (single thread) | Copy constructor (single thread) | |||
STL code | 908,216 keys/s | STL code | 7,037,253 keys/s | STL code | 5,234,321 keys/s |
LinAsm code | 1,161,236 keys/s | LinAsm code | 77,608,462 keys/s | LinAsm code | 20,094,408 keys/s |
Acceleration | x1.28 | Acceleration | x11.03 | Acceleration | x3.84 |
Sort (single thread) | Find differences (concurrent threads) | Copy (single thread) | |||
STL code | 1,087,356 keys/s | STL code | 5,938,818 keys/s | STL code | 6,786,415 keys/s |
LinAsm code | 5,616,728 keys/s | LinAsm code | 27,786,449 keys/s | LinAsm code | 32,924,796 keys/s |
Acceleration | x5.17 | Acceleration | x4.68 | Acceleration | x4.85 |
Unique values (single thread) | Count (single thread) | Move (single thread) | |||
STL code | 7,046,980 keys/s | STL code | 7,182,162 keys/s | STL code | 41,986,045 keys/s |
LinAsm code | 86,307,595 keys/s | LinAsm code | 89,658,058 keys/s | LinAsm code | 28,848,947 keys/s |
Acceleration | x12.25 | Acceleration | x12.48 | Acceleration | x0.69 |
Get (single thread) | Count (concurrent threads) | Merge (single thread) | |||
STL code | 7,371,889 keys/s | STL code | 7,310,894 keys/s | STL code | 6,532,960 keys/s |
LinAsm code | 50,079,075 keys/s | LinAsm code | 67,901,225 keys/s | LinAsm code | 24,095,987 keys/s |
Acceleration | x6.79 | Acceleration | x9.29 | Acceleration | x3.69 |
Get (concurrent threads) | Min value (single thread) | Reverse (single thread) | |||
STL code | 7,366,857 keys/s | STL code | 7,255,633 keys/s | STL code | 6,686,768 keys/s |
LinAsm code | 34,691,303 keys/s | LinAsm code | 80,566,254 keys/s | LinAsm code | 34,113,523 keys/s |
Acceleration | x4.71 | Acceleration | x11.10 | Acceleration | x5.10 |
Set (single thread) | Min value (concurrent threads) | Remove (single thread) | |||
STL code | 7,199,607 keys/s | STL code | 6,986,167 keys/s | STL code | 5,454,134 keys/s |
LinAsm code | 35,812,405 keys/s | LinAsm code | 59,496,855 keys/s | LinAsm code | 17,444,377 keys/s |
Acceleration | x4.97 | Acceleration | x8.52 | Acceleration | x3.20 |
Find (single thread) | Max value (single thread) | Remove (concurrent threads) | |||
STL code | 7,210,836 keys/s | STL code | 7,221,271 keys/s | STL code | 2,571,485 keys/s |
LinAsm code | 110,623,916 keys/s | LinAsm code | 80,208,078 keys/s | LinAsm code | 2,537,423 keys/s |
Acceleration | x15.34 | Acceleration | x11.11 | Acceleration | x0.99 |
Find (concurrent threads) | Max value (concurrent threads) | ||||
STL code | 7,299,470 keys/s | STL code | 7,194,653 keys/s | ||
LinAsm code | 82,681,781 keys/s | LinAsm code | 60,534,066 keys/s | ||
Acceleration | x11.33 | Acceleration | x8.41 | ||
Find unordered (single thread) | IsEqual (single thread) | ||||
STL code | 7,190,420 keys/s | STL code | 6,871,869 keys/s | ||
LinAsm code | 107,913,483 keys/s | LinAsm code | 75,142,472 keys/s | ||
Acceleration | x15.01 | Acceleration | x10.93 | ||
Function | Benchmark value | Function | Benchmark value | Function | Benchmark value |
Hash table ADT
Hash table abstract data types (UniqueHash and MultiHash) provide the same functionality as STL "unordered_map" and "unordered_multimap" templates.
This benchmark creates 1,048,576 random keys in range [0, 131,072) (8 duplicates per key in average case) and put them into hash table. Then it invokes standard operations such as "find", "count", "unique", "remove" and iterates all the elements, using sequential iterators, to measure hash table performance. The result of each operation is represented as the number of keys divided by the total time, each operation has took.
Some functions were tested in both single and multiple thread modes. Actually they were ran in 4 simultaneous threads on 2 cores CPU, which support hyperthreading technology. Multiple thread tests used external iterators, which provide highly concurrent read only access to elements of hash table. Benchmark just cyclically invokes target functions for all hash table elements 8 times, to guarantee, that threads were run simultaneous all the time, and access the same memory blocks. As you may see the memory bandwidth is bottleneck for all multiple thread algorithms, if you do not sparse data access code with mathematical processing of found keys.
Function | Benchmark value | Function | Benchmark value | Function | Benchmark value |
---|---|---|---|---|---|
Insert (single thread) | Count (single thread) | Copy constructor (single thread) | |||
STL code | 1,978,687 keys/s | STL code | 682,946 keys/s | STL code | 4,698,680 keys/s |
LinAsm code | 7,061,311 keys/s | LinAsm code | 2,139,015 keys/s | LinAsm code | 6,991,214 keys/s |
Acceleration | x3.57 | Acceleration | x3.13 | Acceleration | x1.49 |
Insert (concurrent threads) | Count (concurrent threads) | Unique values (single thread) | |||
STL code | 238,547 keys/s | STL code | 643,413 keys/s | STL code | 5,473,898 keys/s |
LinAsm code | 972,861 keys/s | LinAsm code | 1,974,845 keys/s | LinAsm code | 18,542,697 keys/s |
Acceleration | x4.08 | Acceleration | x3.07 | Acceleration | x3.39 |
Get (single thread) | Min value (single thread) | Remove (single thread) | |||
STL code | 8,011,015 keys/s | STL code | 7,779,970 keys/s | STL code | 6,333,774 keys/s |
LinAsm code | 17,632,010 keys/s | LinAsm code | 110,819,782 keys/s | LinAsm code | 13,552,480 keys/s |
Acceleration | x2.20 | Acceleration | x14.24 | Acceleration | x2.14 |
Get (concurrent threads) | Min value (concurrent threads) | Remove (concurrent threads) | |||
STL code | 8,391,381 keys/s | STL code | 7,791,679 keys/s | STL code | 2,413,195 keys/s |
LinAsm code | 13,733,900 keys/s | LinAsm code | 51,965,267 keys/s | LinAsm code | 4,407,341 keys/s |
Acceleration | x1.64 | Acceleration | x6.67 | Acceleration | x1.83 |
Find (single thread) | Max value (single thread) | ||||
STL code | 3,102,324 keys/s | STL code | 7,772,445 keys/s | ||
LinAsm code | 9,717,402 keys/s | LinAsm code | 68,036,839 keys/s | ||
Acceleration | x3.13 | Acceleration | x8.75 | ||
Find (concurrent threads) | Max value (concurrent threads) | ||||
STL code | 3,155,613 keys/s | STL code | 8,030,079 keys/s | ||
LinAsm code | 9,691,096 keys/s | LinAsm code | 27,504,652 keys/s | ||
Acceleration | x3.07 | Acceleration | x3.43 | ||
Function | Benchmark value | Function | Benchmark value | Function | Benchmark value |
Tree ADT
Tree ADTs, are provided by the LinAsm library (UniqueTree and MultiTree), are equivalent to STL "map" and "multimap" classes. They are based on in-memory B-trees, instead of red-black trees, are used by the STL. B-trees hold neighbor keys grouped into one big node (cluster of keys). This trick reduces tree height and optimize cache memory usage. It also speed up sequential iteration procedure, because next and previous elements are grouped together, and do not require addition pointer operations, except for the first/last element into the node. This benchmark shows how many keys per seconds each operation can handle if you insert elements in ascending, descending or random order. All tests were ran for 1,048,576 elements wide trees.
Multiple thread tests do the same as multiple thread tests for hash table and list ADTs. Each benchmark was run in 4 threads in 8 rounds, to be sure that all process were running simultaneously. The result shows cumulative speed of all the processes. And again we see speed downgrade, because of cache concurrency. You can be sure in that fact, because STL implementation shows the same behavior in multiple thread mode. And again it may be fixed by mathematically intensive code among data access code.
Ascending keys
Function | Benchmark value | Function | Benchmark value | Function | Benchmark value |
---|---|---|---|---|---|
Insert (single thread) | Find differences (single thread) | Copy constructor (single thread) | |||
STL code | 1,593,449 keys/s | STL code | 12,591,531 keys/s | STL code | 9,024,989 keys/s |
LinAsm code | 8,598,128 keys/s | LinAsm code | 38,047,892 keys/s | LinAsm code | 7,388,359 keys/s |
Acceleration | x5.40 | Acceleration | x3.02 | Acceleration | x0.82 |
Insert (concurrent threads) | Find differences (concurrent threads) | Copy (single thread) | |||
STL code | 191,149 keys/s | STL code | 7,063,691 keys/s | STL code | 3,230,083 keys/s |
LinAsm code | 644,367 keys/s | LinAsm code | 14,329,357 keys/s | LinAsm code | 8,454,930 keys/s |
Acceleration | x3.37 | Acceleration | x2.03 | Acceleration | x2.62 |
Get (single thread) | Count (single thread) | Move (single thread) | |||
STL code | 26,112,486 keys/s | STL code | 1,952,249 keys/s | STL code | 1,470,107 keys/s |
LinAsm code | 63,837,577 keys/s | LinAsm code | 3,348,980 keys/s | LinAsm code | 6,154,643 keys/s |
Acceleration | x2.44 | Acceleration | x1.72 | Acceleration | x4.19 |
Get (concurrent threads) | Count (concurrent threads) | Unique values (single thread) | |||
STL code | 22,968,257 keys/s | STL code | 715,721 keys/s | STL code | 1,617,892 keys/s |
LinAsm code | 47,149,885 keys/s | LinAsm code | 1,166,816 keys/s | LinAsm code | 8,298,756 keys/s |
Acceleration | x2.05 | Acceleration | x1.63 | Acceleration | x5.13 |
Find (single thread) | IsEqual (single thread) | Remove (single thread) | |||
STL code | 2,345,714 keys/s | STL code | 14,737,635 keys/s | STL code | 14,390,451 keys/s |
LinAsm code | 7,178,074 keys/s | LinAsm code | 37,713,888 keys/s | LinAsm code | 30,083,192 keys/s |
Acceleration | x3.06 | Acceleration | x2.56 | Acceleration | x2.09 |
Find (concurrent threads) | IsEqual (concurrent threads) | Remove (concurrent threads) | |||
STL code | 773,197 keys/s | STL code | 7,216,353 keys/s | STL code | 2,863,612 keys/s |
LinAsm code | 1,927,927 keys/s | LinAsm code | 14,312,060 keys/s | LinAsm code | 3,927,950 keys/s |
Acceleration | x2.49 | Acceleration | x1.98 | Acceleration | x1.37 |
Function | Benchmark value | Function | Benchmark value | Function | Benchmark value |
Descending keys
Function | Benchmark value | Function | Benchmark value | Function | Benchmark value |
---|---|---|---|---|---|
Insert (single thread) | Find differences (single thread) | Copy constructor (single thread) | |||
STL code | 1,582,465 keys/s | STL code | 11,224,204 keys/s | STL code | 9,012,617 keys/s |
LinAsm code | 6,276,257 keys/s | LinAsm code | 45,869,312 keys/s | LinAsm code | 7,187,087 keys/s |
Acceleration | x3.97 | Acceleration | x4.09 | Acceleration | x0.80 |
Insert (concurrent threads) | Find differences (concurrent threads) | Copy (single thread) | |||
STL code | 192,800 keys/s | STL code | 6,332,801 keys/s | STL code | 2,947,764 keys/s |
LinAsm code | 602,489 keys/s | LinAsm code | 13,799,683 keys/s | LinAsm code | 8,408,320 keys/s |
Acceleration | x3.12 | Acceleration | x2.18 | Acceleration | x2.85 |
Get (single thread) | Count (single thread) | Move (single thread) | |||
STL code | 20,096,205 keys/s | STL code | 1,950,123 keys/s | STL code | 1,496,511 keys/s |
LinAsm code | 71,926,952 keys/s | LinAsm code | 3,387,828 keys/s | LinAsm code | 6,484,944 keys/s |
Acceleration | x3.58 | Acceleration | x1.74 | Acceleration | x4.33 |
Get (concurrent threads) | Count (concurrent threads) | Unique values (single thread) | |||
STL code | 18,257,647 keys/s | STL code | 389,913 keys/s | STL code | 1,561,946 keys/s |
LinAsm code | 50,921,640 keys/s | LinAsm code | 1,073,920 keys/s | LinAsm code | 8,532,335 keys/s |
Acceleration | x2.79 | Acceleration | x2.75 | Acceleration | x5.46 |
Find (single thread) | IsEqual (single thread) | Remove (single thread) | |||
STL code | 2,132,875 keys/s | STL code | 11,478,024 keys/s | STL code | 19,836,599 keys/s |
LinAsm code | 7,119,795 keys/s | LinAsm code | 45,275,934 keys/s | LinAsm code | 34,423,465 keys/s |
Acceleration | x3.34 | Acceleration | x3.94 | Acceleration | x1.74 |
Find (concurrent threads) | IsEqual (concurrent threads) | Remove (concurrent threads) | |||
STL code | 538,100 keys/s | STL code | 6,002,319 keys/s | STL code | 2,884,925 keys/s |
LinAsm code | 1,815,265 keys/s | LinAsm code | 14,333,414 keys/s | LinAsm code | 3,821,561 keys/s |
Acceleration | x3.37 | Acceleration | x2.39 | Acceleration | x1.32 |
Function | Benchmark value | Function | Benchmark value | Function | Benchmark value |
Random keys
Function | Benchmark value | Function | Benchmark value | Function | Benchmark value |
---|---|---|---|---|---|
Insert (single thread) | Find differences (single thread) | Copy constructor (single thread) | |||
STL code | 705,638 keys/s | STL code | 5,307,753 keys/s | STL code | 4,509,760 keys/s |
LinAsm code | 2,253,828 keys/s | LinAsm code | 34,806,530 keys/s | LinAsm code | 7,058,120 keys/s |
Acceleration | x3.19 | Acceleration | x6.56 | Acceleration | x1.57 |
Insert (concurrent threads) | Find differences (concurrent threads) | Copy (single thread) | |||
STL code | 149,931 keys/s | STL code | 2,625,713 keys/s | STL code | 1,832,863 keys/s |
LinAsm code | 310,099 keys/s | LinAsm code | 12,955,101 keys/s | LinAsm code | 8,013,033 keys/s |
Acceleration | x2.07 | Acceleration | x4.93 | Acceleration | x4.37 |
Get (single thread) | Count (single thread) | Move (single thread) | |||
STL code | 6,678,392 keys/s | STL code | 366,567 keys/s | STL code | 1,615,666 keys/s |
LinAsm code | 48,064,646 keys/s | LinAsm code | 1,105,134 keys/s | LinAsm code | 6,413,899 keys/s |
Acceleration | x7.20 | Acceleration | x3.01 | Acceleration | x3.97 |
Get (concurrent threads) | Count (concurrent threads) | Unique values (single thread) | |||
STL code | 6,473,388 keys/s | STL code | 348,122 keys/s | STL code | 5,188,184 keys/s |
LinAsm code | 31,488,789 keys/s | LinAsm code | 842,178 keys/s | LinAsm code | 31,438,531 keys/s |
Acceleration | x4.86 | Acceleration | x2.42 | Acceleration | x6.06 |
Find (single thread) | IsEqual (single thread) | Remove (single thread) | |||
STL code | 626,833 keys/s | STL code | 6,315,680 keys/s | STL code | 5,788,973 keys/s |
LinAsm code | 1,907,373 keys/s | LinAsm code | 35,808,412 keys/s | LinAsm code | 24,653,006 keys/s |
Acceleration | x3.04 | Acceleration | x5.67 | Acceleration | x4.26 |
Find (concurrent threads) | IsEqual (concurrent threads) | Remove (concurrent threads) | |||
STL code | 586,922 keys/s | STL code | 2,851,763 keys/s | STL code | 2,814,248 keys/s |
LinAsm code | 1,442,942 keys/s | LinAsm code | 12,821,653 keys/s | LinAsm code | 3,888,592 keys/s |
Acceleration | x2.46 | Acceleration | x4.50 | Acceleration | x1.38 |
Function | Benchmark value | Function | Benchmark value | Function | Benchmark value |
Afterword and links
You may see that vectorized code and cache friendly algorithms may significantly improve performance of C and C++ programs, which do intensive calculations, or memory operations. The main reason of code slowdown on modern CPUs is slow memory access. And it is the place where we should do some kind of optimization, to get speed improvement. LinAsm can accelerate CPU intensive and memory intensive programs, using fast algorithms for common functions and optimized requests for memory reading/writing operations.
The source codes of all benchmarks and latest LinAsm release you may download by the links below. To reproduce these tests on your PC, a CPU should support SSE 4.2 operations. In other case you will get Invalid opcode error message.