Welcome
Welcome to Linux Assembly libraries project, which provides very fast and SIMD optimized assembly libraries for common and wide used algorithms. Most of them were specially developed for data analysis and time series processing, but now they implement much more mathematical and general algorithms and abstraction data types (ADT). These libraries are completely written in flat assembly language (FASM) for x86-64 Linux, to be used inside C and C++ programs, and make them very fast on that platform.
What does typical program do?
What is the reason to use assembly in programs? May be we shouldn't do this?
Almost all programmers believe that recent C/C++ compilers are so smart to produce very fast and optimized code by themselves, and you should not use assembly at all. A compiler will do all hard work for you. So relax, and don't worry about the code.
It is old myth for lazy programmers. I did the tests and found that most popular compilers are still dummy as many years before. They waste lot of program execution time for data transfers among processor’s registers, RAM and program stack, instead of real computing that algorithms should do. Some scientists even joking that modern processors have efficiency factor of steam engine (20% - 25%) running engineering programs. I'm not sure if this is true. May be they are good scientists, but bad programmers. Anyway all above sounds not so good if you have big computation task.
Actually almost all C/C++ compilers can’t provide good usage of SIMD extensions of modern processors and even can’t parallel vector operations automatically and correctly. Programmers should use code hacks to make this available. Usually, they write SIMD operations using so called "intrinsic functions" or "built-in functions" like this one.
# include <stdio.h> // Definition of vector data types typedef float v4sf __attribute__ ((vector_size (16))); union v4sfu { float array[4]; v4sf vector; }; int main (void) { // Sample data arrays v4sfu a = {1.0, 2.0, 3.0, 4.0}; v4sfu b = {5.0, 6.0, 7.0, 8.0}; v4sfu c; // Get convolution of two arrays c.vector = a.vector * b.vector; c.vector = __builtin_ia32_haddps (c.vector, c.vector); c.vector = __builtin_ia32_haddps (c.vector, c.vector); // Print the result printf ("The convolution of two arrays is = %f\n", c.array[0]); return 0; }
Who has an idea what this non transparent instruction "__builtin_ia32_haddps" is doing? Do we work with the SIMD registers or reading memory blocks somewhere into RAM? In the second case we should read data from CPU cache or extract vector from memory banks. If data are not available immediately, then CPU should wait till they are transferred from RAM to processor’s cache or into registers.
To find what actual code is, you should compile the code you wrote and then revise its assembly output (usually with gcc -S). If you do everything right and has smart compiler, then you will find that both operands are already placed in XMM registers and then appropriate SIMD instruction "HADDPS" is called.
This small example just show you what is going on under the cover of high level programming languages, and the tasks that your compiler should do to optimize the machine code. Not all of them work good, and result programs just waste lot of the time for non required data transfers and executing sequences of extra commands instead of calling one CPU instruction provided by hardware. It is because you can't control the code output and rely on compiler intellect, which is far away from perfect. In such case an assembly language is good choice for software optimization.
Make it fast and simple
The good way to optimize software is to split its code into several parts. The first part is low level and very fast assembly written primitives which use all the power of your hardware: vector processing, register’s operands, optimal CPU cache usage and processor's conveyer loading. The second part is high level C/C++ code, which just calls assembly libraries to implement the main algorithm idea.
In that case you don’t need to write strange and rich of intrinsic functions code. You just write your algorithm like very simple code below. Everything works very fast, because source code of library is optimized for best CPU usage. It is small, holds all the data close to execution units. No additional memory reading/writing operations and code has minimum conveyer drops. Also your programs stay clean of mambo-jumbo code like intrinsic functions and dirty hacks to tell compiler how to vectorize data, which meaning is not clear for most programmers, who are not familiar with CPU’s instructions set.
# include <stdio.h> # include <Array.h> int main (void) { // Sample data arrays const size_t size = 4; float a[size] = {1.0, 2.0, 3.0, 4.0}; float b[size] = {5.0, 6.0, 7.0, 8.0}; // Get convolution of two arrays (sum of multiplied elements) float c = Array::SumMul (a, b, size); // Print the result printf ("The convolution of two arrays is = %f\n", c); return 0; }
By the way, the above code is also scalable, and stays the same for variable size arrays. All platform specific code and SIMD instructions are handled by assembly library, remaining the main idea of program very clear.
If you wonder how fast assembly code can be, then just take a look on these benchmarks..
The goal of LinAsm project
The main purpose of Linux Assembly libraries is to provide very fast and optimized code for programs which are written in high level languages (mostly in C/C++). The project doesn't concentrate on writing pure assembly programs (it is not efficient), but only on assembly versions of common and widely used algorithms, math functions and abstract data types. The libraries can be used in any programming languages which support linking with shared libraries (*.so files) and can understand C language function names.
To get more information about the project, please read "About" section.
Detailed information about the libraries and functions they provide is covered in project documentation.