Relocation Semantics in the HPX Library

Google Summer of Code 2023 @ HPX by the Ste||ar group

Isidoros Tsaousis-Seiras
Aristotle's University of Thessaloniki
Commits on GitHub

Motivation

HPX provides highly optimized facilities to ensure maximal resource utilization. However there is still performance to be gained in object movement tasks. Proposal P1144 introduces the idea of relocation. A facility used to hide the boilerplate of object relocation, and optimize it when possible. Essentially, when relocating a buffer of objects, the uninitialized_relocate algorithm will determine if it is valid to reduce multiple move-constructor and destructor calls to a single memcpy(). This improvement impacts key primitives like vector.reserve(), leading to speedup in higher-level programs.

Approach

There are two main steps to implement relocation semantics. First, we need to determine if a type is trivially relocatable. Second, we need to implement the relocation mechanism. There is also a third step, which is specific to HPX, and that is to implement the parallelized version of the procedure.

1) Detecting trivial relocatability

Most, but not all types happen to be agnostic to their location in memory. That is, they can be moved around without any special care. For a complete list of such types, see here. To understand this better, let's delve into some examples.


For example, an std::vector object is trivially relocatable, because it is just a wrapper around pointer to a buffer. This is what trivially relocating it looks like:

Trivial relocation of a vector

It is perfectly safe to memcpy an std::vector to a new location and deallocate the memory holding the old object. That is because:


In contrast, an std::string is not trivially relocatable (in libstdc++), because it contains a pointer to an internal buffer (used for small string optimization). This is what attempting to trivially relocate it would look like:

Invalid trivial relocation of an std::string

The copied object will point to the same buffer as the original, so when the memory holding the original object is freed, the copied one will be left with a dangling pointer.


Another example of a non-trivially relocatable type is std::list (in libc++ and libstdc++). It is a circularly linked list, so the last element points to the first. This is what attempting to trivially relocate it would look like:

Invalid trivial relocation of an std::list

If we were to memcpy the list (the first node) to a new location, the last element would still point to the old location, so after freeing, that pointer would be dangling.


While there's no definitive sign that a type is trivially relocatable, certain common cases can be identified:

  1. Types that are trivially copyable
  2. Types without special member functions and composed solely of trivially relocatable sub-members

We can test for the first case using std::is_trivially_copyable. We cannot cover the second case with today's C++. And in any case it would be a very limited solution. For example neither cases cover std::vector, which is trivially relocatable, but not trivially copyable and it does have special member functions.

To cover all cases, P1144 proposes that the user can declare that a type is trivially relocatable.

To achieve this, HPX offers the following interface, implemented here:

HPX_DECLARE_TRIVIALLY_RELOCATABLE(custom_type);

Which expands to:

struct hpx::is_trivially_relocatable<custom_type> : std::true_type {};

Here, the user is specializing hpx::is_trivially_relocatable<T> for their type. By default, it's true for trivially copyable types, and false otherwise. That way we automatically optimize for all the simple cases, but we also let the user benefit further with their own types.


It is important to note that trivial relocatability is implementation-specific, so HPX can't automatically optimize the relocation of standard types like std::vector. As the standard lacks a way to denote trivial relocatability, the only way to confirm the safety of such optimization is by examining each STL vendor's implementation and labeling the types as trivially relocatable. However, this is impractical due to the high maintenance it would entail. Hence, users must designate types as trivially relocatable.

2) Applying relocation

Let t_buffer be a buffer of n objects of type T, and t_buffer_new be another buffer large enough to fit mn objects of type T.

The desired state after the relocation is:

  1. t_buffer_new contains the objects originally in t_buffer
  2. t_buffer can be repurposed or freed without further cleanup


The traditional approach is to move each object from t_buffer to t_buffer_new and destroy the moved-from objects in t_buffer.


This amounts to n calls to T's move-constructor and n and to its destructor. Depending on the type, these operations can be expensive. For example, clang's std::vector will set the data pointer of the moved-from object to nullptr and the size and capacity to 0.

But since we know that std::vector is a trivially relocatable type, we can do better.


We can simply bitwise-copy the entire buffer from t_buffer to t_buffer_new. This amounts to a single call to memcpy for n*sizeof(T) bytes.

The interface the user sees is:

uninitialized_relocate_n(t_buffer, n, t_buffer_new);

So the implementation details are hidden from them. Internally, the optimized path will be chosen if all of the following conditions are met:

  1. T is trivially relocatable
  2. T is not volatile (memcpy cannot be used on volatile objects)
  3. The buffer is contiguous

As seen here.

Testing this for trivially relocatable STL types, we see that the optimized path is substantially faster.

Testing was done on an Intel Xeon E5-2630 v4 machine
The code used for testing and the full results can be found here

Important note:

The user must not access or destroy the objects in the original buffer. They are either already destroyed or destroying them would ruin the state of the objects in the new buffer.


3) HPX bindings and parallelization

It's easy to see that both the traditional and the optimized approach are embarrassingly parallel. Using HPX's parallelization facilities, we can easily make this into a ready-to-use optimized algorithm. Specifically, the parallelization was achieved using the partitioner_with_cleanup internal facility. The implementation can be seen here.

The parallelization yields high observable speedup, as seen in the graph below.

Testing was done on an Intel Xeon E5-2630 v4 machine

Implementation

Code additions in the above PRs are accompanied with the matching tests.

Future work

The P1144 proposal is still active. It's essential to monitor its progress and integrate relevant updates into HPX. While the proposal awaits formal acceptance, prompting it's adoption from the community and even applying its advantages within HPX itself is a great way to help improve our software.

Some of the features that are still missing from the HPX implementation are:

Special thanks

My deepest thanks go to Giannis Gonidelis, Hartmut Kaiser, and Panagiotis Syskakis for their mentorship and invaluable insights throughout this project.

Additionally, I'd like to express my gratitude to P1144's author, Arthur O'Dwyer, not only for his pivotal code reviews and insights but also for pioneering the proposal itself.