The complete guide to memory management in C++ for CFD

Memory management is crucial in C++; it determines how quickly (or slowly) our code will execute. Poor memory management will destroy any advantage you gain by using a compiled language such as C++ and it is entirely possible that a slower, interpreted language will perform better here (see also the article on Compiled vs. Interpreted languages). However, memory management is really not that difficult and we only need to understand a few basic concepts. C++ leaves memory management entirely up to us (we will later see that using the standard template library will actually take a fair bit, if not all, of that responsibility away from us), so knowing how it works, we can use that to our advantage to write performant code. This article will explore the role of memory in C++ (and programming in general, for that matter) and will equip you with the knowledge you need to master efficient memory management.

In this series

In this article

The Stack and the Heap

I’d say understanding memory in C++ and then knowing how to manage it is probably about 50\% of the language. Well, not in terms of keywords and features, but in terms of knowing how to get around things. I would recommend the article on What Every Programmer Should Know About Memory by Ulrich Drepper, but it is a 114-page document, and it has been on my reading list for years, so I wouldn’t expect you to read it in full either. Be that as it may, understanding the basics of memory is important, and we should have a shared understanding before we move on, otherwise certain concepts will not make sense at all.

Let’s start with the two most important building blocks of memory, the stack and the heap. When you run your application, it will have access to your random access memory (RAM), and this is split into these two regions. The stack is relatively small when compared to the heap but of the order of 8 Megabytes or so (it will vary based on the operating system’s settings, and you can change that, but there is typically no need for that). The heap takes up as much space as it wants on your RAM, or rather, as much as is available, there is no limit to it. What is the difference? Well, the stack is small and fast and the heap is large and a bit slower.

If you want to have an analogy, think about parking a car. You have two options, either you park in a multi-storey car park or you use a valet parking service. If you use the multi-storey car park, you know exactly where you are parked. You know the level and, if we assume each parking box is marked, you know the location on that level as well. This means finding your car is fast. If you use a valet parking service, you hand over your car, and it will be parked somewhere, but you have no idea where, all you get is a number or ticket, which you can use to retrieve your car. So you have no idea where the car is, but you have a mechanism by which you can obtain your car back (i.e. showing your ticket). We can probably agree that, on average, using a multi-story car park will be faster to locate your car, compared to valet parking. In this analogy, the multi-story car park is our stack, and the valet parking service is our heap.

So the stack is a small island of memory where we can retrieve data from quickly, but we are limited in how much space we have available and how much data we can store there. The heap, on the other hand, is vast, and we can store large amounts of data. However, if we wanted to retrieve it, we first need to find it via a reference and this takes a bit longer.

So when should we use the stack and the heap? We use the heap for our largest data arrays, that is coordinates like x, y, and z, solution arrays for velocity, pressure, temperature, etc., and anything else that has the potential to grow in size. For example, you may only be concerned, for now, with small 2D simulations, however, you should even then put all of the above-mentioned arrays on the heap to avoid a stack overflow (a scenario when your data grows too much on the stack that there is no more space available). Everything else goes on the stack.

We said the stack is small, but that is only true in comparison to the heap, it can still store a good amount of data. Typically, the stack holds information that we frequently access, in the context of CFD, these are our solver settings like the number of iterations, the time-step size, CFL number, and the list goes on. So if a variable holds a single value (like a number, a string, or a Boolean), then put it on the stack, but if the variable holds several values of the same type, i.e. an array of numbers, strings, Booleans, etc., then it should go on the heap, in general.

If this sounds overwhelming, don’t worry, even if you have no idea about the stack and heap, as long as you use the standard template library (which we will cover shortly), you pretty much use the stack and the heap correctly automatically. But let’s look at some code to see how we can use each of them.

#include <string>

int main() {
  // variables allocated on the stack
  int         a_stack = 27;
  float       b_stack = 2.71;
  double      c_stack = 3.14;
  bool        d_stack = true;
  std::string e_stack("some text");
  
  // same variables, now allocated on the heap
  int         *a_heap = new int[1];
  float       *b_heap = new float[1];
  double      *c_heap = new double[1];
  bool        *d_heap = new bool[1];
  std::string *e_heap = new std::string[1];
  
  // memory has been allocated on heap, now assign values
  *a_heap = 27;
  *b_heap = 2.71;
  *c_heap = 3.14;
  *d_heap = true;
  *e_heap = "some text";
  
  // we used the heap, we have to manually clean up after ourselves
  delete [] a_heap;
  delete [] b_heap;
  delete [] c_heap;
  delete [] d_heap;
  delete [] e_heap;
  
  return 0;
}

On lines 5-9, we allocate variables on the stack and on lines 12-16 on the heap. We notice that we first have to allocate memory on the heap (using the new keyword). This will return a memory address (think about it as a parking box for our valet parking system), and we store this memory address in a pointer, indicated by the leading asterisk, i.e. *a_heap. If we were to print the value of a_heap, i.e. std::cout << a_heap << std::endl; (don’t forget #include <iostream> to use std::cout and std::endl), then we get a memory address back which may look something like 0x22cb2b0. This is a hexadecimal number, we can convert it to base 10 (if you want to follow along, you have to remove the leading 0x, which is just C++ telling us this is a hexadecimal number). If you do that, you get 36483760, i.e. the 36483760th location in our memory, roughly speaking. You see where the name random in RAM comes from now (in fact, if you run the code above, it is highly unlikely that you’ll get the same memory as me!). We can use this memory address to locate it, and then assign values to it. We do that on lines 19-23, i.e. first by finding the value that is stored in the memory location (done through something called pointer dereferencing, this is done by putting the leading asterisk, i.e. *, in-front of our pointer) and then by assigning a value to that specific memory address.

If you understand this, then you understand pretty much everything there is to pointers, a concept that is typically very confusing and frustrating when you are starting out with C++. Returning to our valet parking system analogy, when we drop off our car, we get a number or receipt. This number or receipt will have no meaning to us, i.e. it will not state where the car is actually parked and we can’t use it to retrieve the car ourselves. However, if we show our number or receipt to the person getting our car, they will be able to dereference the location of our car (and get it for us) based on that number or receipt. A pointer is nothing else, just an intermediary between you, the programmer, and your memory.

Whenever we use the heap, we have to clean up after ourselves. This is related to the general concept of garbage collection, and C++ is one language where garbage collection is not done automatically for us. Python is an example of a language where we don’t have to think about memory and data on the heap is automatically deleted for us, at the cost of a slight runtime overhead. In other words, every time we make a call to the new operator, we have to have a matching call to the delete operator, otherwise we are going to leak memory and this can lead to nasty bugs that are difficult to track down. We can see that we do call the delete operator on lines 26-30 in our example above.

Why is heap memory then so important? Well, the above example does not really give it the credit it deserves, the heap only becomes important for larger objects as alluded to before (i.e. arrays of coordinates, solutions, etc.). Let’s look at another example.

#include <iostream>

int main() {
  // parameters go on the stack
  double domainStart = 0.0;
  double domainEnd = 1.0;
  int numberOfCells = 1000;
  double spacing = (domainEnd - domainStart) / numberOfCells;

  // large arrays go on the heap
  double *xCoordinate = new double[numberOfCells];

  // assign values
  for (int i = 0; i < numberOfCells; ++i) {
    // calculate the center (centroid) position in x
    xCoordinate[i] = spacing * i + spacing / 2.0;
    // or, we could have also used
    // *(xCoordinate + i) = spacing * i + spacing / 2.0;
  }

  std::cout << *(xCoordinate+0) << std::endl; // prints 0.0005
  std::cout << *(xCoordinate+1) << std::endl; // prints 0.0015
  std::cout << *(xCoordinate+2) << std::endl; // prints 0.0025

  std::cout << xCoordinate[0] << std::endl; // prints 0.0005
  std::cout << xCoordinate[1] << std::endl; // prints 0.0015
  std::cout << xCoordinate[2] << std::endl; // prints 0.0025

  // clean up the heap
  delete [] xCoordinate;

  return 0;
}

In this example, we are now storing several values in our pointer called xCoordinate, to be precise, the pointer now points to 1000 elements. If we are using more than one element, then we can simply access it with an extended pointer dereferencing technique of the form

*(pointer + offset)

where the offset is an integer and describes which element of the array we want to access. On line 18, it is shown how to use this new dereferencing technique to assign values and then on lines 21-23 how to print values. Because this looks clumsy, supposedly, C++ provides us with a different notation, and we can use the square brackets to directly access the data of the array at a specific location, as shown on line 16 for assigning values and lines 25-27 for retrieving data. On line 30, we clean up after ourselves again, but only values stored on the heap, not the stack (see lines 5-8, we don’t need to delete them).

Let’s look at the memory management in this case and add a few lines. Specifically, on line 28, put the following code

std::cout << &xCoordinate[0] << std::endl;
std::cout << &xCoordinate[1] << std::endl;
std::cout << &xCoordinate[2] << std::endl;
std::cout << sizeof(double) << std::endl;

We are now using the so-called address-of operator, i.e. &. This operator will now print the memory location where we store our variables. Also, we print the size of a double, which should be 8 (bytes), unless you have a very exotic PC. If you look at the output of the first 3 print statements, you may get something like 0x10422b0, 0x10422b8, and 0x10422c0. The last print statement returns 8 on my machine, i.e. as mentioned above, a double is 8 bytes large. Translating the hexadecimal numbers into base 10 results in 17048240, 17048248, and 17048256, i.e. there is an offset of 8 between each number. From this, we can see that while the first address (location) may be random, all subsequent memory locations are contiguous in memory, i.e. array values which are next to each other are laid out next to each other in memory as well.

And this also explains, really, why the first value is random. When you request to store an array of 1000 doubles, as we do in this case, your system has to check where it can place 1000 doubles next to each other in memory. It looks for the first place where there is sufficient memory available, and then it will lock this memory range for your application, and it will use it to store your doubles there. Keep in mind that you have not just your application, but also other applications and the operating system running at the same time, and each application is using the heap, so the system has to provide access to memory for all applications at the same time, hence your memory address is random, because your memory allocation is dynamic and changes over time, based on the other applications and your operating system.

At this point, I want to be very clear about the examples above though, they are just that, examples to help us understand the concept of the heap and stack. However, in this day and age, you should never use the new and delete keyword yourself, unless you have very specific reasons to do so! Why is that? Well, the examples above are simplified, more complex codes will have more complex memory management requirements, and you are running the risk of introducing unwanted memory leaks and subsequently, some bugs, and these are typically difficult to track down and cost time to fix. C++ provides us with smart pointers, which handle memory management for us, and we will look at them later in more detail.

However, more to the point, we don’t want to use the new and delete keyword because it is not really C++-like, at least in the example shown above (and this is a typical use case). C++ provides us with so-called containers (more on them later as well), and these will handle memory for us automatically, without having to worry about the storage location. These containers are optimised for speed and memory usage, so we should make every effort to use them whenever we can.

One of these containers is called vector and this container stores all its entries on the heap for us. It takes care of memory allocation and garbage collection. The same example shown above can be achieved in a much simpler form with the vector container, as shown in the following:

#include <iostream>
#include <vector>
int main() {
  // parameters go on the stack
  double domainStart = 0.0;
  double domainEnd = 1.0;
  int numberOfCells = 1000;
  double spacing = (domainEnd - domainStart) / numberOfCells;

  // large arrays go on the heap
  std::vector<double> xCoordinate(numberOfCells);

  for (int i = 0; i < numberOfCells; ++i) {
    xCoordinate[i] = spacing * i + spacing / 2.0;
  }

  std::cout << xCoordinate[0] << std::endl;
  std::cout << xCoordinate[1] << std::endl;
  std::cout << xCoordinate[2] << std::endl;

  return 0;
}

Notice how the only change we had to make was to include the vector header on line 2 and then change the variable declaration on line 11. We were able to get rid of the delete keyword, as the vector class handles all memory for us. Using a vector feels just like working with a pointer in terms of assigning values and accessing them, except that the vector class has many more features that we can use (but which we are skipping over here for brevity). We will come back to the containers later and see what the vector container can do for us.

This website exists to create a community of like-minded CFD enthusiasts and I’d love to start a discussion with you. If you would like to be part of it, sign up using the link below and you will receive my OpenFOAM quick reference guide, as well as my guide on Tools every CFD developer needs for free.

Join now

Call by Value and Reference

Another related concept that you need to master is the two phrases: call by value and call by reference, if you are starting out with C++, you will come across these terms eventually, they may make little to no sense but with your new knowledge on memory management these should become clear now. Let’s jump straight into an example to see both cases.

#include <iostream>
#include <vector>

void someFunctionCallByValue(std::vector<double> x) {
  std::cout << "Call by value, address: " << &x[0] << std::endl;
}

void someFunctionCallByReference(std::vector<double> &x) {
  std::cout << "Call by reference, address: " << &x[0] << std::endl;
}

int main() {
  int numberOfCells = 1000;
  std::vector<double> x(numberOfCells);

  std::cout << "original memory location: " << &x[0] << std::endl;

  someFunctionCallByValue(x);
  someFunctionCallByReference(x);

  return 0;
}

What we are doing here, is to create some vector x on line 14 which we then pass to two separate functions. One is call by value, and one is call by reference. How can you tell? The call by reference version accepts one function argument, but this is prefixed with the address-of operator, i.e. & on line 8. If you run the above code, a possible output could be the following:

original memory location: 0x6252b0

Call by value, address: 0x628210

Call by reference, address: 0x6252b0

Notice how the original memory location (for the first element in the vector) and the memory location within the call by reference function are the same. However, in the call by value function, it is different. This really tells us all we need; if we pass a variable to a function by reference, that is, if we prefix it with the & operator within the function signature, then we will simply pass the memory (of the first element) to the function. Since the memory is laid out contiguously, and we know the size of the vector (this is one of the properties the vector class can tell us), we can then access all the other elements in the vector. If we do not use the address-of operator, i.e. if we do not prefix our variables with the & operator (see function signature on line 4), then we simply pass a copy of that variable to the function.

We obviously want to avoid copying large amounts of data between functions, and so using call by reference allows you to keep your memory footprint to a minimum. In general terms, if your variable is located on the heap, you want to pass it by reference to functions, if your variable is located on the stack, passing by value is a good default choice. There are good reasons to deviate from this advice from time to time, depending on your problem, though in most cases you should be fine if you follow this approach.

Memory caches

The last concept I want to talk about is caches. If you ever bought a PC and tried to decipher what hardware it has, you’ll likely read about the CPU and looked at the clock speed and number of cores. These days, CPUs typically have around 4-6 cores, each with a clock speed of around 3 GHz. A value 3 GHz, or 3·109 Hz, means that you can execute 3·109 instructions on each core per second. Each instruction is counted in clock cycles, and your CPU (or core) gets to do exactly one thing per clock cycle. A CPU works by repeating the following 3 stages: fetch, decode, and execute, each consuming 1 clock cycle. During the fetch stage, the CPU is simply retrieving an instruction of what it should do, this could be adding numbers together, reading memory, writing something to a file, etc. The decode stage will then actually check what needs to be done. The execute stage will then perform the task, and this cycle then repeats.

For CFD applications, most of the time is spent on processing a large amount of data. We either add, subtract, multiply, or divide values from arrays or matrices, which means we need to know which operation to perform (this is what the fetch and decode stage will tell us). Additionally, we need to know which numbers to add together, i.e., at which location within the array we should manipulate data. So we also need to have data available to each core before we can do anything.

Memory is brought into the CPU through a memory hierarchy. On the CPU itself, you have registers, of which you only have a handful. Memory is not loaded into the registers directly from your main memory, but rather through the hierarchy I alluded to above. Next to the registers, you have your level 1, or L1 cache. This one is significantly larger than your registers. To give you some idea, in your registers, you may be able to hold about 8 variables with a double data type, while your L1 cache can hold somewhere between 4096 and 8192 doubles, typically. Next, you have a level 2 and level 3 (or L2 and L3) cache, each increasing in size, and finally, you have your main memory or RAM followed by your hard disk. This concept is also shown in the picture below and taken from here:

The smaller the storage capacity of your caches, the faster they can serve memory to your CPU. When your CPU needs data, it will first check if that is available in the registers. If not, it will check the L1 cache and load data from there if it exists. This is then known as a cache hit. If the L1 cache does not contain the data you need (this is called a cache miss), it will be requested from the L2 cache and, if it doesn’t exist there, from the L3 cache. If the L3 cache does not have the data, it is loaded from memory (RAM), where it is guaranteed to exist, otherwise, your application will terminate ungracefully. This is also the reason why you need to have sufficient RAM to run simulations, as everything is stored here, e.g. your mesh, and your solution data, until you dump it to your hard disk. If you don’t have enough space here, you simply can’t run your simulations (and will likely get a segmentation-fault type of message, indicating that due to memory issues, your simulation could not run).

Theoretically, you can use your hard disk as RAM. With hard disks in the region of Terrabytes, this would be very lucrative. In fact, if you use a UNIX distribution such as Ubuntu, you get asked at the installation stage if you want to set up a swap space, which is essentially allocating some portion of your hard disk that can be used as RAM. If you are using an older solid-state drive, we can see that it can take around 3 milliseconds to access data. Compare that with the 62.9 nanoseconds, and you quickly see why you don’t want to use your hard disk as RAM. If you use your hard disk as RAM, you will increase your computational times by a factor of approximately 48 times (3ms / 62.9 ns).

So why are we spending this much time on the CPU and its caches? Because CFD applications are memory-hungry applications, and we need to manage that efficiently. Consider the following two scenarios:

// loop 1, loop order: i, j, k
for (int i = 0; i < 1000; ++i)
  for (int j = 0; j < 1000; ++j)
    for (int k = 0; k < 1000; ++k)
      mat[i][j][k] = ... // actually not important

and

// loop 2, loop order: k, j, i
for (int k = 0; k < 1000; ++k)
  for (int j = 0; j < 1000; ++j)
    for (int i = 0; i < 1000; ++i)
      mat[i][j][k] = ... // actually not important

Which of these two loops will perform faster? The first one will, why? In C/C++, memory is laid out in memory in such a way that if you have more than one dimension in your array (here we have a three-dimensional array), the last dimension (here the third, i.e. values accessed by k) are closest together. Whenever you load memory, you bring in a so-called cache line, which is not just one value of your array but a number of values which are next to each other in the array. So, when we then proceed to load data, we actually get more than we asked for (when you update one register, all the other registers are updated as well and memory that is close to each other is loaded into the other registers).

If we loop over that data and we ensure that we loop in such a way that the third index is changing fastest, i.e., it is the last loop, not the first as in the second loop example, then we benefit from the already loaded data, and we have a lot fewer cache misses. To give you an idea, loading in memory from the L1 cache into the registers takes about 2-3 clock cycles, loading memory from L2 into L1 about 10 clock cycles, and from L3 into L2 about 20-30 clock cycles. If we keep in mind that the calculation we want to perform, i.e., adding or multiplying numbers together, actually only takes 1 clock cycle, we really want to make sure that memory access is optimised in our code.

The roofline memory model

Most people will not know about caches, and this is fine, but we as CFD developers need to understand memory at least in its basics, and now you do. Here is another little secret; most applications are actually memory-bound, i.e. if you were to measure the performance of your application (typically done in GFLOPS, or, (giga) floating-point operations per second, i.e. how many floating-point operations are you able to do per second), you wouldn’t be operating at the maximum clock speed of your CPU but rather be limited by the amount of memory you are loading. This is also illustrated in the figure below (taken from here) and due to the shape of this graph, it is referred to as the roofline model.

In this figure, we can see that we only get peak performance if we have sufficient (floating-point) operations per second (FLOPS) for each byte that we load, in other words, we want to maximize the number of FLOPS and minimize the amount of memory that is loaded, and one way of doing this is to avoid cache misses where possible (but you can’t avoid cache misses altogether, at some point you have to load new data into your caches, you just want to minimise where you can).

The good news is, there aren’t that many places where we can mess up our memory pipeline, and the compiler is doing a good job in optimising most memory deficiencies in our code for us automatically. For example, going back to the two loops we looked at above, if you were to compile the code with optimisation turned on, you would likely get the same performance with the same amount of cache misses. Why? Because the compiler is trying to optimise your code to reduce cache misses, and in this case, it is easy to do by just changing the order of the loop. It will only do that, however, if it can guarantee that it will not change the outcome of your program. If there is just a slight doubt, the compiler will not make any changes, and so more complex loops will be more difficult to optimise, hence it is a good idea to ensure we are doing our best to help the compiler do its job.

By now, you may start to see why using a compiled language is a real advantage. It is doing all sorts of memory optimisation in the background by rewriting our code that is not necessarily readable but performant. We keep our readable code but just have to pay a slight cost by waiting for the compiler to do its job. For smaller application, that cost is barely noticeable, but once your code has 10 000 – 100 000 lines of code, you really start to feel the pain, waiting for minutes for the compilation. But there are ways around that, by compiling your code into smaller chunks and then linking all of these together, so this is not necessarily a hard limitation. Interpreted languages such as Python do not provide us with any mechanism whatsoever to optimise our code when it is executed. The best we can do is to use a package that is implemented in C or C++ and compiled into machine code. This allows us to get fast performance but this also means that we are limited to the syntax the package provides us. This is not necessarily a limitation, but certainly worth keeping in mind.

Summary

This is really all there is on memory, in a nutshell. Of course, we have skimmed over some details here and you can look make any topic arbitrarily more complex, but you should have a good overview of the matter now. If you are intrigued and want to know more, I can only recommend the book by Georg Hager and Gerhard Wellein: Introduction to High-Performance Computing for Scientists and Engineers. There is a lot of information in this book, which takes a more practical and applied focus, but you only really need to read the first three chapters. Everything afterwards deals with parallelisation and how to make this more efficient, but that is somewhat of a more advanced paradigm we don’t have to worry about for now (but I’m sure we will circle back to this in the future)


Tom-Robin Teschner is a senior lecturer in computational fluid dynamics and course director for the MSc in computational fluid dynamics and the MSc in aerospace computational engineering at Cranfield University.