Thoughts from the front lines

home

Hardware impications of Software

20 Aug 2014

As software engineers we want to make code run faster. It's often only an issue when it's an issue in production.

When we write software we look for things that will make it run slow like: - Create objects in loops - Recomputing values multiple times - Database cache misses

Something that we don't think about is hardware implications of code. Look at the following methods, what would you expect the runtime of the two to be (as a constant compared to the other).

void loopOverSecondIndex() {
    for(int i = 1; i < SIZE; i++){
        for(int j = 1; j < SIZE; j++) {
            largeArray[i][j] += largeArray[i][j-1];
        }
    }
}

void loopOverFirstIndex() {
    for(int j = 1; j < SIZE; j++) {
        for(int i = 1; i < SIZE; i++){
            largeArray[i][j] += largeArray[i][j-1];
        }
    }
}

I would say the ratio is 1:1 from a quick glance. However running the following program we got some supprising results.

#include <stdio.h>
#include <time.h> 

#define SIZE 2000
#define TEST_SIZE 100

int largeArray[SIZE][SIZE];

void setup();
long runTest(void (*funPointer)(void));
void loopOverSecondIndex();
void loopOverFirstIndex();
void loopOverAndModifyFirstIndex();
void printStats(char* testRun, long* results);

int main() {
    long firstIndexResults[TEST_SIZE];
    long firstIndexAndModifyResults[TEST_SIZE];
    long secondIndexResults[TEST_SIZE];

    for(int i = 0; i < TEST_SIZE; i++) {
        firstIndexResults[i] = runTest(loopOverFirstIndex);
    }
    printStats("First Index [i+1][j]", firstIndexResults);
    
    for(int i = 0; i < TEST_SIZE; i++) {
        firstIndexAndModifyResults[i] = runTest(loopOverAndModifyFirstIndex);
    }

    printStats("Second Index [i+1][j] += [i][j]", firstIndexAndModifyResults);
    for(int i = 0; i < TEST_SIZE; i++) {
        secondIndexResults[i] = runTest(loopOverSecondIndex);
    }
    printStats("Second Index [i][j+1]", secondIndexResults);

}

void printStats(char* testRun, long* results) {
    long long averageTime = 0;
    long min = results[0];
    long max = results[0];
    for(int i = 0; i < TEST_SIZE; i++) {
        averageTime += results[i];
        if( results[i] < min ) {
            min = results[i];
        }
        if( results[i] > max ) {
            max = results[i];
        }
    }

    printf("%s\n", testRun);
    printf("---- Stats for run ----\n");
    printf("Max Runtime:        %06ld\n", max);
    printf("Average Runtime:    %06ld\n", (long)(averageTime / TEST_SIZE));
    printf("Min Runtime:        %06ld\n\n", min);
}

void loopOverSecondIndex() {
    for(int i = 1; i < SIZE; i++){
        for(int j = 1; j < SIZE; j++) {
            largeArray[i][j] += largeArray[i][j-1];
        }
    }
}

void loopOverFirstIndex() {
    for(int j = 1; j < SIZE; j++) {
        for(int i = 1; i < SIZE; i++){
            largeArray[i][j] += largeArray[i][j-1];
        }
    }
}
void loopOverAndModifyFirstIndex() {
    for(int j = 1; j < SIZE; j++) {
        for(int i = 1; i < SIZE; i++){
            largeArray[i][j] += largeArray[i-1][j];
        }
    }
}

long runTest( void (*funPointer)(void) ) {
    setup();
    clock_t startTime, endTime;
    startTime = clock();

    (*funPointer)();
    
    endTime = clock();
    //return (endTime - startTime)/(double)CLOCKS_PER_SEC;
    return (endTime - startTime);
}

void setup() {
    for(int i = 0; i < SIZE; i++) {
        for(int j = 0; j < SIZE; j++) {
            largeArray[i][j] = 1;
        }
    }
}

To run it we run (from _include/software/hardware-runtime):

$> gradle mainExecutable && ./build/binaries/mainExecutable/main                          
:mainCExtractHeaders UP-TO-DATE
:compileMainExecutableMainC
:linkMainExecutable
:mainExecutable

BUILD SUCCESSFUL

Total time: 0.671 secs
First Index [i+1][j]
---- Stats for run ----
Max Runtime:        038741
Average Runtime:    032946
Min Runtime:        031661

Second Index [i][j+1]
---- Stats for run ----
Max Runtime:        013336
Average Runtime:    009877
Min Runtime:        009419

The loopOverSecondIndex method is almost 3.3 times as fast as loopOverFirstIndex. Wow... Thats a supprising result.

Dig into the results

So we need to go over some definitions:

So what does this mean to us as engineers?

The faster example ([i][j+1]) uses Locality to improve performance. When you use a varable like foo[0][0], that isn't the only data that gets pulled into the cache, the surrounding memory comes with it. For example you will have foo[0][0] through foo[0][3] in the cache. In this example ever 4 requests will only have 1 cache miss.

In the slower example ([i+1][j]) you pull in foo[0][0] though foo[0][3] into the cache. Instead of the next request being foo[0][1] it's foo[1][0]. This means that there will be a cache miss EVERY time. Forcing you to go out to L2 or RAM to get the value.

Something that processor also do for you, is try to pre-fetch values from memory in an attempt to speed up the runtime.

Sooo.... What does this mean?

If you care about runtime, you should thing about how things are layed out in memory and HOW you access things. If you have to loop over the column vs a row, you may want to transform the data to be where you can run along the row.