Posts Tagged ‘parallel programming’

C++ std::async


In order to facilitate spawning concurrent tasks C++11x provides us with std::async. I will discuss many topics, but I’ll try to focus on so called launch policies defined under std::launch. std::async returns a std::future object which is basically a ‘waitable’ proxy which returns the result of scheduled task.

Test environments:
Linux: Ubuntu 12.04 x64, gcc 4.6.3, libstdc++ 6-4.6
Windows: Windows 7 Pro x64, Visual Studio 2012 (w/o CTP)

The what

There are 2 launch policies (LP) defined by the standard:

  • std::launch::async that requires the thread to be launched immediately;
  • std::launch::deferred (aka sync) that uses caller thread and executes only when result of the future is requested

For simplicity and unambiguity I will refer to std::launch::async as eager (task), and std::launch::deferred as lazy (task).


Lazy task will be started iff one of the methods of future is called: wait() or get(). Using wait_for() or wait_until() (so called non-timed waiting functions) results in returning status future_status::deferred and the task will not be started.

In case of eager task execution starts immediately and every call to wait()/get() results in a potential block. Important thing to notice is the fact that if the underlying thread finishes execution it is released before any method is called. This means that the thread is kept only as long as it’s needed, and no longer.

Default policy

When not explicitly specifying the LP you leave it to the implementation. Let’s check how different implementations decide on the LP.


No matter how many tasks you schedule, they all will be lazy. In other words Linux doesn’t provide any default parallelism – if you want it, you set it.


Everything is run in parallel (eagerly). But won’t it be very slow to run so many thread at once? Microsoft found its way and is using a thread pool underneath which tries no to create too many threads too fast. It will be discussed in more details later.


Very important aspect of future objects is their destructors. By design, futures don’t wait (block) for the result to be made available, so you are not required to wait() or something. Unfortunately, when using eager task the destructor of future is required to wait for it to finish!

This leads to following gotcha that prevents achieving parallel execution:

std::async(std::launch::async, ...); // Returns std::future<...> which is immediately destroyed. 
std::async(std::launch::async, ...); // Wait for the destructor to join before starting new task.

This issue was raised by Herb Sutter and hopefully in the new revision of the standard it will get fixed. For now you should use the following “idiom”:

auto a = std::async(std::launch::async, ...);
auto b = std::async(std::launch::async, ...);
auto c = std::async(std::launch::async, ...);
// Join of c, b, a (in that order) happens here, but still all of them were able to be executed concurrently.


What about thread pools, you ask? In theory there is nothing written in the standard regarding TPs, so every eager task (lazy tasks don’t care about multiple threads as they will always be run on the ‘main’ thread) should be immediately run on a new thread. Reality is, however, different.


g++ (based on pthreads) follows the standard letter-by-letter and if you request 1000 eager tasks, you are gonna get them. It might not be the most efficient, but at least you are definitely won’t run into deadlock problems where a thread raising an event – for which hundreds of threads are waiting – won’t be scheduled due to limit of the pool.


Visual Studio (2012) is more tricky as it uses a TP. I don’t know the internals of the Microsoft creation, but let’s observe its behavior ourselves.


Observations that we should draw are these (some are based on more detailed view of the sample data, thus you have to believe me):

  • Even though the system could schedule more threads it never assigns 1k of them, but does it more gradually. I suspect this to be the standard behavior of TP where it schedules new thread only when it sees that there is not much chance to get an already used one.
  • There are 2 inflection points that we are interested in, at 75 and at 325 threads. Past these values the TP seems to be less willing to schedule new threads, and there is some delay between request to std::async and actual thread creation.
  • When launching the first eager task the system preallocates ~8 thread so that it is ready for new potential tasks.
  • After all tasks have finished their job, ~16 threads are left as the new idle pool waiting for new tasks.

Now let’s take a look at the std::threads scheduling.


This case is very peculiar. Not only we reach the 1k threads count, but we overshoot it! The threads are also scheduled much faster, not introducing any considerable delays (as it was in the case of std::async). Last curiosity is that after joining all threads, there were still 4 threads waiting, thus the underlying system still behaved more like a regular TP.

The why

Just before we start digging deeper, there is a reasonable question to ask ourselves: Why the heck to use the futures?

  • No need to manually wrap every call in a std::thread and handle passing the parameters and retrieving the result. Exceptions are also nicely handled here.
  • Clean abstraction representing an asynchronous result which might not be immediately available.

While first reason makes us code faster, the second one is much more important. By providing a future you inform the users of your code that they should not expect the result immediately.

Why is this important? Because it allows others to properly organize their code. When they see int foo() nothing is gonna change and they shall work as usual. On the other hand, seeing std::future foo() makes them:

  • schedule execution of some other tasks (parallelism ftw!) before retrieving this result;
  • attach the result to a group of objects on which the method will be waiting for, and while waiting it might render a nice “busy animation”;
  • composite it with other tasks (when std::future::then()-like method are ready;
  • think twice before calling this method many times over and over (as it might be quite slow);
  • only use get() to work in a fashion they are used to;
  • do other crazy stuff with it that I can’t predict.


There are several ways of implementing asynchronous results:

  • provide an additional parameter specify a callback function;
  • return a handle which you can query/wait for the completion and with which you can go to creator object to get our result;

Callbacks are the bread&butter of async. programming nowadays, but are they the best idea? I tend to disagree.

First, most of the times callbacks are tied to a specific object which has to exist at the time of the call, which might not always be the case. In order for everything to work smoothly either you have to detach your callback (which is hardly possible) or check (in every callback!) for existence of the parent object. This is both tedious and sometimes hard to solve as it’s hard to decide whether pretending that callback didn’t happen is the best course of actions.

Second, callbacks are called by the server, not by client. You can be in an arbitrary program state and just when you least expect it you will have to deal with the callback being called. This is like a goto coming from anywhere just to provide you with happy hours of debugging parallel program while deciphering the logs just to get the cause of failure of your little callback.

Third, it’s hard to work with the callbacks. You can’t simply render a waiting animation when waiting for several async. results. You have to manually create some synchronization primitive on which you will wait until all the schedules tasks have finished.


IMHO handles are the step in the right direction. You can wait on the proxy or query it every time your inner program loop executes, so there are no nondeterministic gotos. Unfortunately it’s still a little bit of a chore to keep the parent object around and to get the result from it. You can set the object to be static but in that case you have to deal with all the “pleasures” of concurrent threads and non-trivial construction/destruction.

The when

Return value

In the next “The which” chapter I described when to use eager and lazy tasks. You know what? The more eager the task, the more likely return value should be a future. And vice versa: lazy tasks are less likely not to become futures.

So why to use lazy tasks, ever? Laziness and eagerness is an implementation secret. You present an interface of async. result. Depending on various conditions these results might truly be parallel, or not. That gives you great flexibility.


float calc(float a, float b, float c);

Imagine that calculation of a, b and c might take a lot of time. Because of that you might think to do something like this:

float calc(std::future<float> a, std::future<float> b, std::future<float> c);

First, how do you know which of the arguments is in fact slow to calculate? It might vary from client to client, thus you would be forced to ‘futurize’ all your parameters. This leads to an interface which is not pleasant to use, as every argument would have to be wrapped like this:

calc(std::async(std::launch::deferred, []{return a;}), ...);

On the other hand you require the client to calculate all the parameters beforehand, while your code could (in the meantime while client eager tasks were running) schedule your own tasks to use the parallel power of the machine. By “forcing” them to use async. parameters you might get much better efficiency as now the client code starts to use the power of the machine.

My solution is not to use multiple async. parameters but a single wrapping class:

class Wrapper {
    std::future<float> getA();
    std::future<float> getB();
    std::future<float> getC();
float calc(const Wrapper& w);

Now the calc() can be have several overloads where every variation of Wrapper can have different combination of async. and sync. results basing on which you might behave differently.

Also, the Wrapper can be customized so that the user can either pass a float which will be turned automatically to lazy task, or a lambda which will be changed to an eager task. This way the whole interface can be async. while the clients can easily fill it with values not having to manually turn everything to futures.

The which

So the “big” question is when to use eager tasks, lazy or leave it all to implementation?

As I’ve previously described Windows will run everything eagerly, while Linux will run everything as lazy – there is no common balancing logic behind the scenes (“schedule up to N eagerly and then only lazy”). Because of it we should rather control whether or not to create a lazy or eager task ourselves, as the program might not behave uniformly at all, when ported.


Let’s imagine that an architect defined a following interface:

class ISales {
    virtual std::future<Result> forRegion(const std::string& region) = 0;
    virtual std::future<Result> forPartner(const std::string& partner) = 0;
    virtual std::future<Result> forPeriod(const Period& period) = 0;

Our class implements the interface and we have to decide what should be the TP of every of the method results. Code dependent on ISales is expected to work in the following manner:

  1. Schedule the execution of all data that will be requested by simply calling the for*() methods.
  2. [Optional] Calculate your stuff.
  3. get() the requested results.
  4. Finish calculations.


In order to decide on the TP, we have to define a set of indicators. None of the following indicators should be treated as absolute. You will have to balance and measure things before you end up with the proper TP.


Like Microsoft proposed, long running tasks (50 ms) should be made asynchronous (ie., eager). Logic behind this statement is obvious: you don’t want to block your application. While your caller thread waits on the other threads result, you are just wasting time slice cycles which will be given to some other process. The shorter period of time the caller has to wait, the better. Thus scheduling the long running tasks right when created, provides the best results.

CPU bound

Waiting on event/IO operation/network packet doesn’t consume CPU. Such operations should naturally be made eager as the cost of accompanying thread shouldn’t be large; while the caller thread would have to block for the same period of time as the helper thread (of eager tasks).

Hardware concurrency

Don’t bite off more than you can chew. Is it reasonable to schedule yet another thread (eager task) when you know there are already hundred threads running?

It’s meaningless to measure the currently running threads. First, the threads that are active might be the ones of a pool, so they are not actively doing anything, just waiting to be used by you. Second, even if those threads are used, they might be simply waiting on something, not using the CPU.

Framework time

The only count that is available to you is the number of parallel runnable threads. This is retrieved via std::thread::hardware_concurrency(). Keeping in mind what I’ve said in previous paragraph I would suggest the following way of tuning the TP:

  1. Prepare a parameterized framework which can easily change the TP of created tasks. You would probably have to build a layer on top of the std::async so that every task would have to go through it. When scheduling the task you would have to provide an enumeration (more gradations than std::launch) specifying how much you want it be async. This is really important as some tasks have to be run async. This is discussed in more details later.
  2. Run tests on various machines (2/4/6/8 parallel threads supported) and note the results.
  3. Derive the optimal strategy for the provided hardware concurrency.
  4. Store the tuned parameters inside the program (configuration file) and ship it.

I believe that a conservative usage of eager tasks should suffice in most cases, so you shouldn’t worry about this that much.

Don’t make the caller thread a lazy bastard!

A common problem that many programmers do when playing with concurrency can be presented as follows:

auto a = std::async(std::async, []{ doFoo(); });
auto b = std::async(std::async, []{ doBar(); });
auto c = std::async(std::async, []{ doXyz(); });
return {a.get(), b.get(), c.get());

The main thread is only blocking! Instead of scheduling everything to eager tasks, let the main thread always do something in the mean time.

auto a = std::async(std::async, []{ doFoo(); });
auto b = std::async(std::async, []{ doBar(); });
auto c = doXyz();
return {a.get(), b.get(), c);

You could, in theory, apply this optimization to our interface, but it would be hard to determine whether the caller has used his thread enough. Forcing him to block because you thought he hasn’t done anything is rather risky and potentially limits the concurrency potential of your code.

Internal dependencies

One of the biggest dangers of thread pools is that you are able to specify max number of concurrently running threads. Why is that? Thanks to internal dependencies. Take a look at the following code:

WaitableObject w;
auto a0 = std::async(std::launch::async, [&w]{ w.wait(); /* do some stuff */ });
auto a1 = std::async(std::launch::async, [&w]{ /* do some stuff */ w.raiseEvent(); });

Problem here is that when a0 is scheduled, it has taken all the available slots in the pool. Other threads are long running ones that deal with packets, user events, etc.; and they won’t be joined till the end of the program. a1 won’t be scheduled as there is no place for it. In other words: deadlock.

This scenario isn’t very common but when you multiply the number of threads and increase complexity of inter-thread relations you can easily deadlock yourself.

One of the solutions to this problem is to make your functions pure (without any side effects to it’s parameters or global state) or at least leave all the synchronization in future result. If a1 was to raise an event then it’s much better to pass it to a0 so that it can wait for it. This way you explicitly created a dependency chain and it will be much easier to reason about the code.

Second solution is not to have a limit on the TP, and from my observations there doesn’t seem to be any in case of Microsoft implementation.

In the context of our discussion it seems obvious that such tasks should be made eager, as the lazy task can be easily omitted and never run.


I hope I helped some of you understand the usage of std::async. I am eagerly awaiting different patterns that programmers will create when dealing with asynchronous results. To the next time!