Welcome

Welcome to Linux Assembly libraries project, which provides very fast and optimized assembly libraries for common and wide used algorithms.‭ ‬Most of them were specially developed for data analysis and time series processing,‭ ‬but now they implement much more mathematical and general algorithms.‭ ‬These libraries are completely written in assembly language for x86-64‭ ‬Linux,‭ ‬to be used inside C and C++‭ ‬programs, and‭ ‬make them very fast on that platform.

What does typical program do?

typical_program.png

What is the reason to use assembly in programs‭? ‬May be we shouldn't do this‭?

Almost all programmers believe that recent C/C++‭ ‬compilers are so smart to produce very fast and optimized code by themselves,‭ ‬and you should not use assembly at all.‭ ‬Compiler will do all hard work for you.‭ ‬So relax,‭ ‬and don't worry about the code.‭

‬Has anyone checked this myth‭? ‬I did the tests and found that most popular compilers are still dummy as many years before.‭ ‬They waste lot of program execution time for data transfers among processor’s registers,‭ ‬RAM and program stack,‭ ‬instead of real computing as main algorithm should do.‭ ‬Some scientists even joking that modern processors have efficiency factor of steam engine‭ (‬20%‭ ‬-‭ ‬25%‭) ‬running their programs.‭ ‬I'm not sure if this is true.‭ ‬May be they are good scientists,‭ ‬but bad programmers.‭ ‬Anyway all above sounds not so good.

Actually almost all C/C++‭ ‬compilers can’t provide good usage of SIMD extensions of modern processors and even can’t parallel vector operations automatically and correctly.‭ ‬Programmers should use code hacks to make this available.‭ ‬Usually,‭ ‬they write SIMD operations using so called‭ "‬intrinsic functions‭" ‬or‭ "‬built-in functions‭" ‬like this one.

# include <stdio.h>

// Definition of vector data types
typedef float v4sf __attribute__ ((vector_size (16)));
union v4sfu
{
    float array[4];
    v4sf  vector;
};

int main (void)
{
    // Sample data arrays
    v4sfu a = {1.0, 2.0, 3.0, 4.0};
    v4sfu b = {5.0, 6.0, 7.0, 8.0};
    v4sfu c;

    //‭ ‬Get convolution of two arrays‭
    c.vector = a.vector * b.vector;
    c.vector = __builtin_ia32_haddps (c.vector, c.vector);
    c.vector = __builtin_ia32_haddps (c.vector, c.vector);

    // Print the result
    printf ("The convolution of two arrays is = %f\n", c.array[0]);
    return 0;
}

Who can tell me immediately what this non transparent instruction do‭? ‬And where are the operands‭? ‬Are they already placed in XMM registers or they are in the main memory‭? ‬In the second case we will get memory reading overhead.‭ ‬Data are not available immediately and CPU should wait till they are transferred from RAM to processor’s cache or registers.

check_code.png

To find what actual code is,‭ ‬you should compile the code you wrote and then revise its assembly output‭ (‬usually with gcc‭ ‬-S‭)‬.‭ ‬If you do everything right and regularly praying to computer's gods,‭ ‬then you will find that both operands are already placed in XMM registers and then appropriate SIMD instruction‭ "‬HADDPS‭" ‬was called.

Of course this is an imaginary example,‭ ‬and real programs much complex and optimized.‭ ‬I just would like to show you what is going on under the cover of high level programming languages.‭ ‬If you spend some time for code research,‭ ‬then you will find,‭ ‬that typical programs just waste lot of the time for non required data transfers and executing sequences of extra commands instead of calling one CPU instruction provided by hardware.‭ ‬It is because you can't control the code and rely on compiler intellect,‭ ‬which is far away from perfect.

Most modern compilers can't read your mind and guess what you need.‭ ‬They just optimize the code they can understand.‭ ‬And not all of them can understand that a code you wrote is vector code.‭ ‬You should give them a very big hint,‭ ‬like above,‭ ‬to utilize hardware abilities and new instructions of you processor.‭ ‬The function‭ "__‬builtin_ia32‭_‬haddps‭" (‬which is not real C++‭ ‬function‭) ‬is that hint.‭ ‬This entire story turns too bad when you need extreme speed for your software,‭ ‬or when you work with very bid data banks or building embed systems without powerful processors.

Make it fast and simple

fast_code.png

The good way to optimize software is to split its code into several parts.‭ ‬The first part is low level and very fast assembly written primitives which use all the power of your hardware:‭ ‬vector processing,‭ ‬register’s operands,‭ ‬optimal CPU cache usage and conveyer loading.‭ ‬The second part is high level C/C++‭ ‬code,‭ ‬which just calls assembly libraries to implement the main algorithm idea.

In that case you don’t need to write strange and rich for intrinsic functions code.‭ ‬You just write your algorithm like very simple code below.‭ ‬Everything works very fast,‭ ‬because source code of library is optimized for best CPU usage.‭ ‬It is small,‭ ‬holds all the data close to execution units.‭ ‬No memory reading/writing overheads and code has minimum conveyer drops.‭ ‬Also your programs stay clean of mambo-jumbo code like intrinsic functions and dirty hacks to tell compiler how to vectorize data,‭ ‬which meaning is not clear for most programmers,‭ ‬who are not familiar with CPU’s instructions set.

# include <stdio.h>
# include <Array.h>

int main (void)
{
    // Sample data arrays
    const size_t size = 4;
    float a[size] = {1.0, 2.0, 3.0, 4.0};
    float b[size] = {5.0, 6.0, 7.0, 8.0};

    //‭ ‬Get convolution of two arrays‭ (‬sum of multiplied elements)
    float c = Array::SumMul (a, b, size);

    // Print the result
    printf ("The convolution of two arrays is = %f\n", c);
    return 0;
}

By the way,‭ ‬the above code is also scalable,‭ ‬and stays the same for variable size arrays.‭ ‬All platform specific code and SIMD instructions are handled by assembly library,‭ ‬remaining the main idea of program very clear.

If you wonder how fast assembly code can be, then just take a look on these benchmarks..

Assembly is not portable, but very fast

program_port.png

Some people also say:‭ "‬Assembly is not portable.‭ ‬That is why almost all programs are written in C/C++‭ ‬or in other programming languages.‭"

You will be surprised,‭ ‬but intrinsic functions are not portable too.‭ ‬Even their names are different among different compilers.‭ ‬The compiler will just kick them off if target CPU doesn't support them,‭ ‬and will provide emulation code to implement their functionality or even break program building.‭ ‬But I agree,‭ ‬that it is a good idea to write all the code in portable programming language like C/C++‭ ‬or may be even in Python.‭ ‬Today lots of programs for Linux are written in Python‭ (‬mainly for Gnome DE‭)‬.

Now let’s talk about program execution speed and memory usage.‭ This theme may be not so popular for regular programs, like a text editors, mail clients and audio players. But this is critical for scientific, HD multimedia,‭ financial and medical applications, embed system and computer games.

What do you think if your brand new program will be slow and memory consumed,‭ ‬but portable among different Operation systems and CPUs‭? ‬I guess,‭ ‬that a target user of final software product will prefer best graphic and gaming process speed instead of portability.‭ ‬It is because he doesn't care about portability.‭ ‬He just needs a solution for his tasks.‭ ‬And if you can provide it in the best way,‭ ‬than‭ ‬he will choose your program.

critical_sections.png

A scientist,‭ ‬who writes a math code,‭ ‬would like to wait for a few hours till program processes the data,‭ ‬and have no portable code,‭ ‬instead of waiting for a few days and having completely portable code.‭ ‬Almost all super computers are unique products.‭ ‬And that is why portable math libraries are not widely supported in super computers industry.‭ ‬Most of them use platform specific code.‭ ‬I know some projects that are out of this statistics.‭ ‬Their count is not enough to speak about wide practice.‭ ‬Medical software is also unique.

Regular users prefer fast reacted applications too.‭ ‬They wouldn’t like to wait even‭ for ‬1‭ ‬second till program processes their request.‭ ‬They prefer immediate response for actions and really hate any slow moving progress bars.‭

Actually,‭ ‬since UNIX epoch stated the sense of portability was...

In the Unix world,‭ "‬portability‭" ‬refers to porting to different Unix versions.‭ ‬For a GNU program,‭ ‬this kind of portability is desirable,‭ ‬but not paramount.
The primary purpose of GNU software is to run on top of the GNU kernel,‭ ‬compiled with the GNU C compiler,‭ ‬on various types of CPU.‭ ‬So the kinds of portability that are absolutely necessary are quite limited.‭ ‬But it is important to support Linux-based GNU systems,‭ ‬since they are the form of GNU that is popular.
Beyond that,‭ ‬it is good to support the other free operating systems‭ (*‬BSD‭)‬,‭ ‬and it is nice to support other Unix-like systems if you want to.‭ ‬Supporting a variety of Unix-like systems is desirable,‭ ‬although not paramount.‭ ‬It is usually not too hard,‭ ‬so you may as well do it.‭ ‬But you don’t have to consider it an obligation,‭ ‬if it does turn out to be hard.
GNU Coding Standards (April 7, 2012)
Richard Stallman

That is.‭ ‬Portability is not obligatory,‭ ‬but desirable.‭ ‬And some programs will never run on something else than their target platform.‭ ‬Portability is good thing for a widely used applications like‭ "‬coreutils‭"‬,‭ ‬but not for resource consumption modules.‭ ‬Programs usually spend a lot of running time in critical sections processing.‭ ‬And they will be the bottleneck of your algorithm.‭ ‬That is why these sections need some kind of software optimization to be done before the final release will be made.

The‭ "‬MPlayer‭" ‬and‭ "‬FFmpeg‭" ‬projects use assembly modules for video streams accelerations.‭ "‬Mpg123‭" ‬does the same for audio streams and‭ "‬libjpeg-turbo‭" ‬for image processing.‭ ‬Their also can be launch on other platforms,‭ ‬and use CPU specific code for them too.‭ ‬LinAsm do the same,‭ ‬but for one platform which is x86-64‭ ‬Linux.

The goal of LinAsm project

project_goal.png

The main purpose of Linux Assembly libraries is to provide very fast and optimized code to be invoked from programs which are written in high level languages‭ (‬mostly in C/C++‭)‬.‭ ‬The project doesn't concentrate on writing pure assembly programs‭ (‬it is not efficient‭)‬,‭ ‬but only on assembly libraries for common and widely used algorithms.‭ ‬The libraries can be used in any programming languages which support linking with third part statically libraries‭ (*‬.o files‭) ‬and can understand C language ‬function names.

To get more information about the project,‭ ‬please read‭ "About‭" ‬section.‭ ‬Detailed information about‭ ‬the‭ ‬libraries and functions they provide is covered in‭ ‬project documentation.