How to Choose the right Algorithm and Compiler Options

Software Optimization Dangers

Software Optimization can make a 10x difference in how fast your code runs, but these optimizations don’t always keep the integrity of your program.

For example,
If you have a program that stores a set of random numbers in an array and then ends:

If you compile with a high optimization flag like -O3 (which says “Optimize even if it’s risky”), the compiler will notice that all you are doing is storing a bunch of numbers and never using them again. Because those values are never accessed again anywhere in the code, the compiler can decide to not even create the random values or the array at all!

In this article, we’ll look at the performance difference between:
  1. multiplication 500 milion times
  2. multiplication x milion times, where x typed in by user on the command line.
  1. Lookup table, that pre-calculates all the possible multiplications at compile time. At run time the value is just looked up.
  2. Lookup table and casting the negative indexes instead of multiplying by -1 (removes a multiplication, therefore theoretically faster)
  3. Lookup table, and casting, but not storing duplicates in the table.

Multiplication at Run Time

Volume Change Example

Real life case: You are listening to music. When you dial the volume up on your laptop, what’s the calculation for that wave’s new frequency? What sound should come out of the speakers?

(The original frequency) X (The current volume)

Let’s set (The current volume) to be a number between 0.0 - 1.0. You can look at it like a percentage out of 100%.

To simulate this, we can use an array of frequencies (integers) instead of feeding in a song:

Some interesting observations:

  • Multiplication at Run Time Algorithm on X86_64 is faster at -O1 and -O2, not -O3, even though -O3 is theoretically supposed to be optimized to use less resources (time and memory).
  • Multiplication at Run Time Algorithm on X86_64 at -O0, -O2, and -O3 consume the least amount of memory. Even though we might guess that -O1 would use less memory than -O0 because it is faster, this did not turn out true.
  • Multiplication at Run Time Algorithm is slowest (8.844 sec) and the fastest algorithm (0.791 sec), depending on the architecture and the optimization flag.

Multiplication at Run Time with Command Line Arg

Instead of defining MAX as a constant that can be swapped out at compile time, we are getting user input and replacing that wherever the value is used.

Some interesting observations:

  • Using a command line argument instead of a constant increases the processing time because it is another task the program must defer to run time.

Lookup Table - Do All Possible Multiplications at Compile Time

Volume Change Example

Looking at the same volume example, we can create a lookup table to access all the values.
In the example below:
  • The lookup table is: multSoundFreqsTable
  • The values are stored in: soundFreqs
  • The user has turned the volume dial to 50% (VOLUME = 0.5)
We can lookup every quotient (soundFreqs[i] *= VOLUME) in the lookup table instead of calculating it at run time.

It might be easy to assume that this lookup table algorithm would run faster, but this is only true in certain conditions. For example, if you examine the table a little ways below, you will find that with -O3 optimization causes the Multiplication at Runtime algorithm to be faster on both X86 and AARCH64!

Lookup Table + Casting

Here, instead of multiplying by -1 twice to turn our random value into an positive-value index, we can use casting. Casting to a unsigned int will always give us positive values. The numbers that were originally negative are indexed starting from the max value of a uint on your system.

Some interesting observations:

  • On both architectures, this algorithm is only second to Multiplication at Run Time Algorithm

Lookup Table + Casting + No Duplicate Storage

Here, we're simply checking if this calculation has been done before. If it has, we just use that value instead of recalculating it.

Some interesting observations:

  • Og compilation is slower than O3 on AARCH64, but faster on X86_64

Table of Run times and Memory Usage

Some General interesting observations:

  • There was not much difference between -O1, 2, 3 or g for either algorithms, but there was a big jump from -O0 to any of the others.
  • In ¾ cases above, using -Og decreased the performance
  • Music usually has a sample rate of 88200 samples/second. If this many samples are being read from the CD onto the computer (44100 samples per second x 2 channels), even our slowest algorithm (44 630 902 frequencies/second) can keep up.

Distribution of Data

What if our samples were distributed differently? Would this affect any algorithm’s performance? First let’s look at some definitions:
  1. Normally Distributed Data: The mean and median in the set of numbers will be the same or similar
  2. Standard Deviation: How far individual observations differ from the mean

Realistically - More Clustered Values

The distribution of the data was random in all three of our algorithms, because we were generating random values with rand(). But if we were analyzing real music, the distribution of data would probably be normal, ie. they would be closer to each other, and at more consistent intervals. If we could narrow the frequencies used in the song to a smaller range, we would only have to do a fraction of the calculation.

In our example:
  1. Our range of values would be smaller than - 32768 to + 32767 (eg. - 20000 to + 20000)
  2. Our lookup table would be smaller
Therefore, only the Lookup Table Algorithm would be affected by differently distributed data.

Only Negative or Positive Values

If the range of values was from 0 to +32767, we could remove our logic for adding the sign to the value after it’s been looked up in the table, decreasing run time.


You never really know what algorithm or optimization flag will result in faster code, or code that uses less memory until you test it!

No comments:

Post a Comment