Relocation Semantics in the HPX Library

Google Summer of Code 2023 @ HPX by the Ste||ar group

Isidoros Tsaousis-Seiras
Aristotle's University of Thessaloniki
Commits on GitHub

Motivation

HPX provides highly optimized facilities to ensure maximal resource utilization. However there is still performance to be gained in object movement tasks. Proposal P1144 introduces the idea of relocation. A facility used to hide the boilerplate of object relocation, and optimize it when possible. Essentially, when relocating a buffer of objects, the uninitialized_relocate algorithm will determine if it is valid to reduce multiple move-constructor and destructor calls to a single memcpy(). This improvement impacts key primitives like vector.reserve(), leading to speedup in higher-level programs.

Approach

There are two main steps to implement relocation semantics. First, we need to determine if a type is trivially relocatable. Second, we need to implement the relocation mechanism. There is also a third step, which is specific to HPX, and that is to implement the parallelized version of the procedure.

1) Detecting trivial relocatability

Most, but not all types happen to be agnostic to their location in memory. That is, they can be moved around without any special care. For a complete list of such types, see here. To understand this better, let's delve into some examples.

For example, an std::vector object is trivially relocatable, because it is just a wrapper around pointer to a buffer. This is what trivially relocating it looks like:

Trivial relocation of a vector

It is perfectly safe to memcpy an std::vector to a new location and deallocate the memory holding the old object. That is because:

No destructor will be called and thus the buffer will stay intact
The new vector will take care of deallocating the buffer when it is destroyed

In contrast, an std::string is not trivially relocatable (in libstdc++), because it contains a pointer to an internal buffer (used for small string optimization). This is what attempting to trivially relocate it would look like:

Invalid trivial relocation of an std::string

The copied object will point to the same buffer as the original, so when the memory holding the original object is freed, the copied one will be left with a dangling pointer.

Another example of a non-trivially relocatable type is std::list (in libc++ and libstdc++). It is a circularly linked list, so the last element points to the first. This is what attempting to trivially relocate it would look like:

Invalid trivial relocation of an std::list

If we were to memcpy the list (the first node) to a new location, the last element would still point to the old location, so after freeing, that pointer would be dangling.

While there's no definitive sign that a type is trivially relocatable, certain common cases can be identified:

Types that are trivially copyable
Types without special member functions and composed solely of trivially relocatable sub-members

We can test for the first case using std::is_trivially_copyable. We cannot cover the second case with today's C++. And in any case it would be a very limited solution. For example neither cases cover std::vector, which is trivially relocatable, but not trivially copyable and it does have special member functions.

To cover all cases, P1144 proposes that the user can declare that a type is trivially relocatable.

To achieve this, HPX offers the following interface, implemented here:

HPX_DECLARE_TRIVIALLY_RELOCATABLE(custom_type);

Which expands to:

struct hpx::is_trivially_relocatable<custom_type> : std::true_type {};

Here, the user is specializing hpx::is_trivially_relocatable<T> for their type. By default, it's true for trivially copyable types, and false otherwise. That way we automatically optimize for all the simple cases, but we also let the user benefit further with their own types.

It is important to note that trivial relocatability is implementation-specific, so HPX can't automatically optimize the relocation of standard types like std::vector. As the standard lacks a way to denote trivial relocatability, the only way to confirm the safety of such optimization is by examining each STL vendor's implementation and labeling the types as trivially relocatable. However, this is impractical due to the high maintenance it would entail. Hence, users must designate types as trivially relocatable.

2) Applying relocation

Let t_buffer be a buffer of n objects of type T, and t_buffer_new be another buffer large enough to fit m ≥ n objects of type T.

The desired state after the relocation is:

t_buffer_new contains the objects originally in t_buffer
t_buffer can be repurposed or freed without further cleanup

The traditional approach is to move each object from t_buffer to t_buffer_new and destroy the moved-from objects in t_buffer.

This amounts to n calls to T's move-constructor and n and to its destructor. Depending on the type, these operations can be expensive. For example, clang's std::vector will set the data pointer of the moved-from object to nullptr and the size and capacity to 0.

But since we know that std::vector is a trivially relocatable type, we can do better.

We can simply bitwise-copy the entire buffer from t_buffer to t_buffer_new. This amounts to a single call to memcpy for n*sizeof(T) bytes.

The interface the user sees is:

uninitialized_relocate_n(t_buffer, n, t_buffer_new);

So the implementation details are hidden from them. Internally, the optimized path will be chosen if all of the following conditions are met:

T is trivially relocatable
T is not volatile (memcpy cannot be used on volatile objects)
The buffer is contiguous

As seen here.

Testing this for trivially relocatable STL types, we see that the optimized path is substantially faster.

Testing was done on an Intel Xeon E5-2630 v4 machine

The code used for testing and the full results can be found here

Important note:

The user must not access or destroy the objects in the original buffer. They are either already destroyed or destroying them would ruin the state of the objects in the new buffer.

3) HPX bindings and parallelization

It's easy to see that both the traditional and the optimized approach are embarrassingly parallel. Using HPX's parallelization facilities, we can easily make this into a ready-to-use optimized algorithm. Specifically, the parallelization was achieved using the partitioner_with_cleanup internal facility. The implementation can be seen here.

The parallelization yields high observable speedup, as seen in the graph below.

Testing was done on an Intel Xeon E5-2630 v4 machine

Implementation

P6212 Was the PR that introduced the notion of relocatability to HPX. It implemented the hpx::is_relocatable tag, which is a condition different from trivial relocatability, but it is needed to ensure that types marked as trivially relocatable are actually legal to move in the first place.

P6264 introduced the notion of trivial relocatability to HPX. It implemented the hpx::is_trivially_relocatable tag, along with the user interface described above.

P6324 implemented the uninitialized_relocate algorithm in a bare-bones fashion, along with the relocate_at algorithm, which is the single element version of the former. It also implemented the method selection logic, using the hpx::is_trivially_relocatable tag.

P6314 implemented the HPX bindings and parallelization of the algorithm, following the tag_invoke pattern used throughout HPX. It also implemented the uninitialized_relocate_n which differs only in terms of the API.

Code additions in the above PRs are accompanied with the matching tests.

Future work

The P1144 proposal is still active. It's essential to monitor its progress and integrate relevant updates into HPX. While the proposal awaits formal acceptance, prompting it's adoption from the community and even applying its advantages within HPX itself is a great way to help improve our software.

Some of the features that are still missing from the HPX implementation are:

Currently, it is not possible to infer trivially relocatable types by analyzing their internal structure. Using some advanced meta-programming and serialization techniques, it might be possible to implement a weaker version of this feature, further benefiting the end user.

There is a part of P1144 that is not yet implemented in HPX. This is the relocate function, which requires language features that are not yet available in C++. It provides the interface T t = std::relocate(&t_old), which is equivalent to either move-ing t_old into t, or memcpy-ing it and skipping any constructor or destructor calls if it is trivially relocatable.

Special thanks

My deepest thanks go to Giannis Gonidelis, Hartmut Kaiser, and Panagiotis Syskakis for their mentorship and invaluable insights throughout this project.

Additionally, I'd like to express my gratitude to P1144's author, Arthur O'Dwyer, not only for his pivotal code reviews and insights but also for pioneering the proposal itself.