Welcome to Linux Assembly libraries project, which provides very fast and optimized assembly libraries for common and wide used algorithms.‭ ‬Most of them were specially developed for data analysis and time series processing,‭ ‬but now they implement much more mathematical and general algorithms.‭ ‬These libraries are completely written in assembly language for x86-64‭ ‬Linux,‭ ‬to be used inside C and C++‭ ‬programs, and‭ ‬make them very fast on that platform.

What does typical program do?


What is the reason to use assembly in programs‭? ‬May be we shouldn't do this‭?

Almost all programmers believe that recent C/C++‭ ‬compilers are so smart to produce very fast and optimized code by themselves,‭ ‬and you should not use assembly at all.‭ ‬Compiler will do all hard work for you.‭ ‬So relax,‭ ‬and don't worry about the code.‭

‬Has anyone checked this myth‭? ‬I did the tests and found that most popular compilers are still dummy as many years before.‭ ‬They waste lot of program execution time for data transfers among processor’s registers,‭ ‬RAM and program stack,‭ ‬instead of real computing as main algorithm should do.‭ ‬Some scientists even joking that modern processors have efficiency factor of steam engine‭ (‬20%‭ ‬-‭ ‬25%‭) ‬running their programs.‭ ‬I'm not sure if this is true.‭ ‬May be they are good scientists,‭ ‬but bad programmers.‭ ‬Anyway all above sounds not so good.

Actually almost all C/C++‭ ‬compilers can’t provide good usage of SIMD extensions of modern processors and even can’t parallel vector operations automatically and correctly.‭ ‬Programmers should use code hacks to make this available.‭ ‬Usually,‭ ‬they write SIMD operations using so called‭ "‬intrinsic functions‭" ‬or‭ "‬built-in functions‭" ‬like this one.

# include <stdio.h>

// Definition of vector data types
typedef float v4sf __attribute__ ((vector_size (16)));
union v4sfu
    float array[4];
    v4sf  vector;

int main (void)
    // Sample data arrays
    v4sfu a = {1.0, 2.0, 3.0, 4.0};
    v4sfu b = {5.0, 6.0, 7.0, 8.0};
    v4sfu c;

    //‭ ‬Get convolution of two arrays‭
    c.vector = a.vector * b.vector;
    c.vector = __builtin_ia32_haddps (c.vector, c.vector);
    c.vector = __builtin_ia32_haddps (c.vector, c.vector);

    // Print the result
    printf ("The convolution of two arrays is = %f\n", c.array[0]);
    return 0;

Who can tell me immediately what this non transparent instruction do‭? ‬And where are the operands‭? ‬Are they already placed in XMM registers or they are in the main memory‭? ‬In the second case we will get memory reading overhead.‭ ‬Data are not available immediately and CPU should wait till they are transferred from RAM to processor’s cache or registers.


To find what actual code is,‭ ‬you should compile the code you wrote and then revise its assembly output‭ (‬usually with gcc‭ ‬-S‭)‬.‭ ‬If you do everything right and regularly praying to computer's gods,‭ ‬then you will find that both operands are already placed in XMM registers and then appropriate SIMD instruction‭ "‬HADDPS‭" ‬was called.

Of course this is an imaginary example,‭ ‬and real programs much complex and optimized.‭ ‬I just would like to show you what is going on under the cover of high level programming languages.‭ ‬If you spend some time for code research,‭ ‬then you will find,‭ ‬that typical programs just waste lot of the time for non required data transfers and executing sequences of extra commands instead of calling one CPU instruction provided by hardware.‭ ‬It is because you can't control the code and rely on compiler intellect,‭ ‬which is far away from perfect.

Most modern compilers can't read your mind and guess what you need.‭ ‬They just optimize the code they can understand.‭ ‬And not all of them can understand that a code you wrote is vector code.‭ ‬You should give them a very big hint,‭ ‬like above,‭ ‬to utilize hardware abilities and new instructions of you processor.‭ ‬The function‭ "__‬builtin_ia32‭_‬haddps‭" (‬which is not real C++‭ ‬function‭) ‬is that hint.‭ ‬This entire story turns too bad when you need extreme speed for your software,‭ ‬or when you work with very bid data banks or building embed systems without powerful processors.

Make it fast and simple


The good way to optimize software is to split its code into several parts.‭ ‬The first part is low level and very fast assembly written primitives which use all the power of your hardware:‭ ‬vector processing,‭ ‬register’s operands,‭ ‬optimal CPU cache usage and conveyer loading.‭ ‬The second part is high level C/C++‭ ‬code,‭ ‬which just calls assembly libraries to implement the main algorithm idea.

In that case you don’t need to write strange and rich for intrinsic functions code.‭ ‬You just write your algorithm like very simple code below.‭ ‬Everything works very fast,‭ ‬because source code of library is optimized for best CPU usage.‭ ‬It is small,‭ ‬holds all the data close to execution units.‭ ‬No memory reading/writing overheads and code has minimum conveyer drops.‭ ‬Also your programs stay clean of mambo-jumbo code like intrinsic functions and dirty hacks to tell compiler how to vectorize data,‭ ‬which meaning is not clear for most programmers,‭ ‬who are not familiar with CPU’s instructions set.

# include <stdio.h>
# include <Array.h>

int main (void)
    // Sample data arrays
    const size_t size = 4;
    float a[size] = {1.0, 2.0, 3.0, 4.0};
    float b[size] = {5.0, 6.0, 7.0, 8.0};

    //‭ ‬Get convolution of two arrays‭ (‬sum of multiplied elements)
    float c = Array::SumMul (a, b, size);

    // Print the result
    printf ("The convolution of two arrays is = %f\n", c);
    return 0;

By the way,‭ ‬the above code is also scalable,‭ ‬and stays the same for variable size arrays.‭ ‬All platform specific code and SIMD instructions are handled by assembly library,‭ ‬remaining the main idea of program very clear.

If you wonder how fast assembly code can be, then just take a look on these benchmarks..

Assembly is not portable, but very fast


Some people also say:‭ "‬Assembly is not portable.‭ ‬That is why almost all programs are written in C/C++‭ ‬or in other programming languages.‭"

You will be surprised,‭ ‬but intrinsic functions are not portable too.‭ ‬Even their names are different among different compilers.‭ ‬The compiler will just kick them off if target CPU doesn't support them,‭ ‬and will provide emulation code to implement their functionality or even break program building.‭ ‬But I agree,‭ ‬that it is a good idea to write all the code in portable programming language like C/C++‭ ‬or may be even in Python.‭ ‬Today lots of programs for Linux are written in Python‭ (‬mainly for Gnome DE‭)‬.

Now let’s talk about program execution speed and memory usage.‭ This theme may be not so popular for regular programs, like a text editors, mail clients and audio players. But this is critical for scientific, HD multimedia,‭ financial and medical applications, embed system and computer games.

What do you think if your brand new program will be slow and memory consumed,‭ ‬but portable among different Operation systems and CPUs‭? ‬I guess,‭ ‬that a target user of final software product will prefer best graphic and gaming process speed instead of portability.‭ ‬It is because he doesn't care about portability.‭ ‬He just needs a solution for his tasks.‭ ‬And if you can provide it in the best way,‭ ‬than‭ ‬he will choose your program.


A scientist,‭ ‬who writes a math code,‭ ‬would like to wait for a few hours till program processes the data,‭ ‬and have no portable code,‭ ‬instead of waiting for a few days and having completely portable code.‭ ‬Almost all super computers are unique products.‭ ‬And that is why portable math libraries are not widely supported in super computers industry.‭ ‬Most of them use platform specific code.‭ ‬I know some projects that are out of this statistics.‭ ‬Their count is not enough to speak about wide practice.‭ ‬Medical software is also unique.

Regular users prefer fast reacted applications too.‭ ‬They wouldn’t like to wait even‭ for ‬1‭ ‬second till program processes their request.‭ ‬They prefer immediate response for actions and really hate any slow moving progress bars.‭

Actually,‭ ‬since UNIX epoch stated the sense of portability was...

In the Unix world,‭ "‬portability‭" ‬refers to porting to different Unix versions.‭ ‬For a GNU program,‭ ‬this kind of portability is desirable,‭ ‬but not paramount.
The primary purpose of GNU software is to run on top of the GNU kernel,‭ ‬compiled with the GNU C compiler,‭ ‬on various types of CPU.‭ ‬So the kinds of portability that are absolutely necessary are quite limited.‭ ‬But it is important to support Linux-based GNU systems,‭ ‬since they are the form of GNU that is popular.
Beyond that,‭ ‬it is good to support the other free operating systems‭ (*‬BSD‭)‬,‭ ‬and it is nice to support other Unix-like systems if you want to.‭ ‬Supporting a variety of Unix-like systems is desirable,‭ ‬although not paramount.‭ ‬It is usually not too hard,‭ ‬so you may as well do it.‭ ‬But you don’t have to consider it an obligation,‭ ‬if it does turn out to be hard.
GNU Coding Standards (April 7, 2012)
Richard Stallman

That is.‭ ‬Portability is not obligatory,‭ ‬but desirable.‭ ‬And some programs will never run on something else than their target platform.‭ ‬Portability is good thing for a widely used applications like‭ "‬coreutils‭"‬,‭ ‬but not for resource consumption modules.‭ ‬Programs usually spend a lot of running time in critical sections processing.‭ ‬And they will be the bottleneck of your algorithm.‭ ‬That is why these sections need some kind of software optimization to be done before the final release will be made.

The‭ "‬MPlayer‭" ‬and‭ "‬FFmpeg‭" ‬projects use assembly modules for video streams accelerations.‭ "‬Mpg123‭" ‬does the same for audio streams and‭ "‬libjpeg-turbo‭" ‬for image processing.‭ ‬Their also can be launch on other platforms,‭ ‬and use CPU specific code for them too.‭ ‬LinAsm do the same,‭ ‬but for one platform which is x86-64‭ ‬Linux.

The goal of LinAsm project


The main purpose of Linux Assembly libraries is to provide very fast and optimized code to be invoked from programs which are written in high level languages‭ (‬mostly in C/C++‭)‬.‭ ‬The project doesn't concentrate on writing pure assembly programs‭ (‬it is not efficient‭)‬,‭ ‬but only on assembly libraries for common and widely used algorithms.‭ ‬The libraries can be used in any programming languages which support linking with third part statically libraries‭ (*‬.o files‭) ‬and can understand C language ‬function names.

To get more information about the project,‭ ‬please read‭ "About‭" ‬section.‭ ‬Detailed information about‭ ‬the‭ ‬libraries and functions they provide is covered in‭ ‬project documentation.