Posts Tagged ‘C++11x’

For loops in times of C++11 part 1

27/09/2014

Regular for

We all know for. It was our friend since time immemorial. It’s simple to understand, it’s easy to convert into efficient assembly and it creates a new scope (and we do like scopes, don’t we?). Can we improve it? Let’s then talk about its deficiencies.

Note: I am assuming you understand that raw loops are something that should be used on the lowest level of implementation and your logic should be based on algorithms.

Tediousness

Writing

for (int i = 0; i < MAX_FOO; ++i)

doesn’t seem to involve many keystrokes, but if you consider that you will write similar code many times in your career that changes the picture a little bit. Because of it people either create their custom macros in Vim or snippets in Visual Studio or in IDEs featuring similar functionality.

Type

for (? idx = 0; ... 

What should be the type of our index? That’s obvious! It should be the same as the sentinel (aka max value). If we get the type wrong we got nice warnings from the compiler:

foo.cpp: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]

That’s not a big issue, but why to waste your precious time fixing such stupid mistakes? And what if the conversion (promotion) between types is acceptable, but you will be constantly paying a conversion cost every loop iteration (unless the compiler does its optimizations)?

Performance and risk

Let’s get to something really important: efficiency and safety. If you see following code:

for (int i = 0; i < slotMgr.maxSlots(); ++i) {
    ...
    ...
    ...
}

Can you, without a doubt, quickly tell if:

  • the index (i) won’t be modified inside the loop so that the loop will be executed less/more than maxSlots() times? It might not be the case, but in specific cases one might want to – for instance – retry using a slot in the next loop iteration by decrementing the iterator in the loop body.
  • maxSlots() value won’t change during execution? Maybe some body operation modifies (indirectly or directly) the slots count?

First problem with above scenarios is that for someone who will be reading the code it’s harder to understand (abstract) the loop logic. If you are not sure whether index won’t change, or slots value will be altered – then you will spend more time to understand what’s going on.

Second problem is efficiency. Compiler might have hard time telling whether the maxSlots() call is constants throughout the whole loop. If it isn’t (any many compilers have to assume such scenario) every loop you will be paying a cost of a function call which itself might not be as trivial as simple return this->size;.

Range-based for

With the advent of C++11 we got the lovely range based fors:

for (auto& x : vector_of_T)

What it does it simply uses std::begin() and std::end() free functions that allow you to put even C-style arrays in place of vector_of_T.

That’s all cool, but how does it help us in case of regular loops? Do you expect us to write something like this?

auto v = std::vector<int>{0, 1, 2, 3}; // Madness!

// Imitates for (int i = 0; i < 4; ++i)
for (auto i : v) { 
    ...
}

God forbid! Let’s keep the terseness of this syntactic sugar, add usability (no handwritten vectors) and safety – meet ForRange!

ForRange

The idea is simple. I want:

for (auto i : range(slotMgr.maxSlots())) {
    ...
}

Requirements for range(T):

  • as fast as hand written loop;
  • no need to specify type (auto is enough);
  • hoists the maxSlots() value for efficiency and simplicity;
  • allows arbitrary begin, end and step.

Implementation of ForRange is relatively simple, although a bit of code must be written. Never the less I encourage you to try to implement it on your own. I will come back with my implementation in part 2, together with some discussion about pros and cons of a given design.

C++ std::async

03/03/2013

In order to facilitate spawning concurrent tasks C++11x provides us with std::async. I will discuss many topics, but I’ll try to focus on so called launch policies defined under std::launch. std::async returns a std::future object which is basically a ‘waitable’ proxy which returns the result of scheduled task.

Test environments:
Linux: Ubuntu 12.04 x64, gcc 4.6.3, libstdc++ 6-4.6
Windows: Windows 7 Pro x64, Visual Studio 2012 (w/o CTP)

The what

There are 2 launch policies (LP) defined by the standard:

  • std::launch::async that requires the thread to be launched immediately;
  • std::launch::deferred (aka sync) that uses caller thread and executes only when result of the future is requested

For simplicity and unambiguity I will refer to std::launch::async as eager (task), and std::launch::deferred as lazy (task).

Behavior

Lazy task will be started iff one of the methods of future is called: wait() or get(). Using wait_for() or wait_until() (so called non-timed waiting functions) results in returning status future_status::deferred and the task will not be started.

In case of eager task execution starts immediately and every call to wait()/get() results in a potential block. Important thing to notice is the fact that if the underlying thread finishes execution it is released before any method is called. This means that the thread is kept only as long as it’s needed, and no longer.

Default policy

When not explicitly specifying the LP you leave it to the implementation. Let’s check how different implementations decide on the LP.

Linux

No matter how many tasks you schedule, they all will be lazy. In other words Linux doesn’t provide any default parallelism – if you want it, you set it.

Windows

Everything is run in parallel (eagerly). But won’t it be very slow to run so many thread at once? Microsoft found its way and is using a thread pool underneath which tries no to create too many threads too fast. It will be discussed in more details later.

Destruction

Very important aspect of future objects is their destructors. By design, futures don’t wait (block) for the result to be made available, so you are not required to wait() or something. Unfortunately, when using eager task the destructor of future is required to wait for it to finish!

This leads to following gotcha that prevents achieving parallel execution:

std::async(std::launch::async, ...); // Returns std::future<...> which is immediately destroyed. 
std::async(std::launch::async, ...); // Wait for the destructor to join before starting new task.

This issue was raised by Herb Sutter and hopefully in the new revision of the standard it will get fixed. For now you should use the following “idiom”:

auto a = std::async(std::launch::async, ...);
auto b = std::async(std::launch::async, ...);
auto c = std::async(std::launch::async, ...);
// Join of c, b, a (in that order) happens here, but still all of them were able to be executed concurrently.

Pooling

What about thread pools, you ask? In theory there is nothing written in the standard regarding TPs, so every eager task (lazy tasks don’t care about multiple threads as they will always be run on the ‘main’ thread) should be immediately run on a new thread. Reality is, however, different.

Linux

g++ (based on pthreads) follows the standard letter-by-letter and if you request 1000 eager tasks, you are gonna get them. It might not be the most efficient, but at least you are definitely won’t run into deadlock problems where a thread raising an event – for which hundreds of threads are waiting – won’t be scheduled due to limit of the pool.

Windows

Visual Studio (2012) is more tricky as it uses a TP. I don’t know the internals of the Microsoft creation, but let’s observe its behavior ourselves.

tp_async

Observations that we should draw are these (some are based on more detailed view of the sample data, thus you have to believe me):

  • Even though the system could schedule more threads it never assigns 1k of them, but does it more gradually. I suspect this to be the standard behavior of TP where it schedules new thread only when it sees that there is not much chance to get an already used one.
  • There are 2 inflection points that we are interested in, at 75 and at 325 threads. Past these values the TP seems to be less willing to schedule new threads, and there is some delay between request to std::async and actual thread creation.
  • When launching the first eager task the system preallocates ~8 thread so that it is ready for new potential tasks.
  • After all tasks have finished their job, ~16 threads are left as the new idle pool waiting for new tasks.

Now let’s take a look at the std::threads scheduling.

tp_thread

This case is very peculiar. Not only we reach the 1k threads count, but we overshoot it! The threads are also scheduled much faster, not introducing any considerable delays (as it was in the case of std::async). Last curiosity is that after joining all threads, there were still 4 threads waiting, thus the underlying system still behaved more like a regular TP.

The why

Just before we start digging deeper, there is a reasonable question to ask ourselves: Why the heck to use the futures?

  • No need to manually wrap every call in a std::thread and handle passing the parameters and retrieving the result. Exceptions are also nicely handled here.
  • Clean abstraction representing an asynchronous result which might not be immediately available.

While first reason makes us code faster, the second one is much more important. By providing a future you inform the users of your code that they should not expect the result immediately.

Why is this important? Because it allows others to properly organize their code. When they see int foo() nothing is gonna change and they shall work as usual. On the other hand, seeing std::future foo() makes them:

  • schedule execution of some other tasks (parallelism ftw!) before retrieving this result;
  • attach the result to a group of objects on which the method will be waiting for, and while waiting it might render a nice “busy animation”;
  • composite it with other tasks (when std::future::then()-like method are ready;
  • think twice before calling this method many times over and over (as it might be quite slow);
  • only use get() to work in a fashion they are used to;
  • do other crazy stuff with it that I can’t predict.

Competition

There are several ways of implementing asynchronous results:

  • provide an additional parameter specify a callback function;
  • return a handle which you can query/wait for the completion and with which you can go to creator object to get our result;
Callbacks

Callbacks are the bread&butter of async. programming nowadays, but are they the best idea? I tend to disagree.

First, most of the times callbacks are tied to a specific object which has to exist at the time of the call, which might not always be the case. In order for everything to work smoothly either you have to detach your callback (which is hardly possible) or check (in every callback!) for existence of the parent object. This is both tedious and sometimes hard to solve as it’s hard to decide whether pretending that callback didn’t happen is the best course of actions.

Second, callbacks are called by the server, not by client. You can be in an arbitrary program state and just when you least expect it you will have to deal with the callback being called. This is like a goto coming from anywhere just to provide you with happy hours of debugging parallel program while deciphering the logs just to get the cause of failure of your little callback.

Third, it’s hard to work with the callbacks. You can’t simply render a waiting animation when waiting for several async. results. You have to manually create some synchronization primitive on which you will wait until all the schedules tasks have finished.

Handles

IMHO handles are the step in the right direction. You can wait on the proxy or query it every time your inner program loop executes, so there are no nondeterministic gotos. Unfortunately it’s still a little bit of a chore to keep the parent object around and to get the result from it. You can set the object to be static but in that case you have to deal with all the “pleasures” of concurrent threads and non-trivial construction/destruction.

The when

Return value

In the next “The which” chapter I described when to use eager and lazy tasks. You know what? The more eager the task, the more likely return value should be a future. And vice versa: lazy tasks are less likely not to become futures.

So why to use lazy tasks, ever? Laziness and eagerness is an implementation secret. You present an interface of async. result. Depending on various conditions these results might truly be parallel, or not. That gives you great flexibility.

Parameter

float calc(float a, float b, float c);

Imagine that calculation of a, b and c might take a lot of time. Because of that you might think to do something like this:

float calc(std::future<float> a, std::future<float> b, std::future<float> c);

First, how do you know which of the arguments is in fact slow to calculate? It might vary from client to client, thus you would be forced to ‘futurize’ all your parameters. This leads to an interface which is not pleasant to use, as every argument would have to be wrapped like this:

calc(std::async(std::launch::deferred, []{return a;}), ...);

On the other hand you require the client to calculate all the parameters beforehand, while your code could (in the meantime while client eager tasks were running) schedule your own tasks to use the parallel power of the machine. By “forcing” them to use async. parameters you might get much better efficiency as now the client code starts to use the power of the machine.

My solution is not to use multiple async. parameters but a single wrapping class:

class Wrapper {
public:
    std::future<float> getA();
    std::future<float> getB();
    std::future<float> getC();
};
float calc(const Wrapper& w);

Now the calc() can be have several overloads where every variation of Wrapper can have different combination of async. and sync. results basing on which you might behave differently.

Also, the Wrapper can be customized so that the user can either pass a float which will be turned automatically to lazy task, or a lambda which will be changed to an eager task. This way the whole interface can be async. while the clients can easily fill it with values not having to manually turn everything to futures.

The which

So the “big” question is when to use eager tasks, lazy or leave it all to implementation?

As I’ve previously described Windows will run everything eagerly, while Linux will run everything as lazy – there is no common balancing logic behind the scenes (“schedule up to N eagerly and then only lazy”). Because of it we should rather control whether or not to create a lazy or eager task ourselves, as the program might not behave uniformly at all, when ported.

Interface

Let’s imagine that an architect defined a following interface:

class ISales {
public:
    virtual std::future<Result> forRegion(const std::string& region) = 0;
    virtual std::future<Result> forPartner(const std::string& partner) = 0;
    virtual std::future<Result> forPeriod(const Period& period) = 0;
};

Our class implements the interface and we have to decide what should be the TP of every of the method results. Code dependent on ISales is expected to work in the following manner:

  1. Schedule the execution of all data that will be requested by simply calling the for*() methods.
  2. [Optional] Calculate your stuff.
  3. get() the requested results.
  4. Finish calculations.

Indicators

In order to decide on the TP, we have to define a set of indicators. None of the following indicators should be treated as absolute. You will have to balance and measure things before you end up with the proper TP.

Latency

Like Microsoft proposed, long running tasks (50 ms) should be made asynchronous (ie., eager). Logic behind this statement is obvious: you don’t want to block your application. While your caller thread waits on the other threads result, you are just wasting time slice cycles which will be given to some other process. The shorter period of time the caller has to wait, the better. Thus scheduling the long running tasks right when created, provides the best results.

CPU bound

Waiting on event/IO operation/network packet doesn’t consume CPU. Such operations should naturally be made eager as the cost of accompanying thread shouldn’t be large; while the caller thread would have to block for the same period of time as the helper thread (of eager tasks).

Hardware concurrency

Don’t bite off more than you can chew. Is it reasonable to schedule yet another thread (eager task) when you know there are already hundred threads running?

It’s meaningless to measure the currently running threads. First, the threads that are active might be the ones of a pool, so they are not actively doing anything, just waiting to be used by you. Second, even if those threads are used, they might be simply waiting on something, not using the CPU.

Framework time

The only count that is available to you is the number of parallel runnable threads. This is retrieved via std::thread::hardware_concurrency(). Keeping in mind what I’ve said in previous paragraph I would suggest the following way of tuning the TP:

  1. Prepare a parameterized framework which can easily change the TP of created tasks. You would probably have to build a layer on top of the std::async so that every task would have to go through it. When scheduling the task you would have to provide an enumeration (more gradations than std::launch) specifying how much you want it be async. This is really important as some tasks have to be run async. This is discussed in more details later.
  2. Run tests on various machines (2/4/6/8 parallel threads supported) and note the results.
  3. Derive the optimal strategy for the provided hardware concurrency.
  4. Store the tuned parameters inside the program (configuration file) and ship it.

I believe that a conservative usage of eager tasks should suffice in most cases, so you shouldn’t worry about this that much.

Don’t make the caller thread a lazy bastard!

A common problem that many programmers do when playing with concurrency can be presented as follows:

auto a = std::async(std::async, []{ doFoo(); });
auto b = std::async(std::async, []{ doBar(); });
auto c = std::async(std::async, []{ doXyz(); });
return {a.get(), b.get(), c.get());

The main thread is only blocking! Instead of scheduling everything to eager tasks, let the main thread always do something in the mean time.

auto a = std::async(std::async, []{ doFoo(); });
auto b = std::async(std::async, []{ doBar(); });
auto c = doXyz();
return {a.get(), b.get(), c);

You could, in theory, apply this optimization to our interface, but it would be hard to determine whether the caller has used his thread enough. Forcing him to block because you thought he hasn’t done anything is rather risky and potentially limits the concurrency potential of your code.

Internal dependencies

One of the biggest dangers of thread pools is that you are able to specify max number of concurrently running threads. Why is that? Thanks to internal dependencies. Take a look at the following code:

WaitableObject w;
auto a0 = std::async(std::launch::async, [&w]{ w.wait(); /* do some stuff */ });
auto a1 = std::async(std::launch::async, [&w]{ /* do some stuff */ w.raiseEvent(); });
a1.wait();
a0.wait();

Problem here is that when a0 is scheduled, it has taken all the available slots in the pool. Other threads are long running ones that deal with packets, user events, etc.; and they won’t be joined till the end of the program. a1 won’t be scheduled as there is no place for it. In other words: deadlock.

This scenario isn’t very common but when you multiply the number of threads and increase complexity of inter-thread relations you can easily deadlock yourself.

One of the solutions to this problem is to make your functions pure (without any side effects to it’s parameters or global state) or at least leave all the synchronization in future result. If a1 was to raise an event then it’s much better to pass it to a0 so that it can wait for it. This way you explicitly created a dependency chain and it will be much easier to reason about the code.

Second solution is not to have a limit on the TP, and from my observations there doesn’t seem to be any in case of Microsoft implementation.

In the context of our discussion it seems obvious that such tasks should be made eager, as the lazy task can be easily omitted and never run.

Finale

I hope I helped some of you understand the usage of std::async. I am eagerly awaiting different patterns that programmers will create when dealing with asynchronous results. To the next time!

Thank you C++ for … strongly typed enumerations

02/12/2012

enum needs a scope

“Enumerations are scope-leeches – they infect the outer scope.”

C++03

class Foo {
	enum Single {
		ONE = 1, TWO = 2, THREE = 3
	};

	enum Double {
		ONE = 2, TWO = 4, THREE = 6
	};
};

Result

error: declaration of ‘ONE’ conflicts with previous declaration ‘Foo::Bar Foo::ONE

error: declaration of ‘TWO’ conflicts with previous declaration ‘Foo::Bar Foo::TWO

error: declaration of ‘THREE’ conflicts with previous declaration ‘Foo::Bar Foo::THREE

Solution #1: Prefix

class Foo {
public:
	enum Single {
		Single_ONE = 1, Single_TWO = 2, Single_THREE = 3
	};

	enum Double {
		Double_ONE = 2, Double_TWO = 4, Double_THREE = 6
	};
};

int main() {
	// Prints: 1
	std::cout << Foo::Single_ONE << std::endl;
	// Prints: 4
	std::cout << Foo::Double_TWO << std::endl;

Solution #2: Wrap

class Foo {
	struct Single {
		enum Enum {
			ONE = 1, TWO = 2, THREE = 3
		};
	};

	struct Double {
		enum Enum {
			ONE = 2, TWO = 4, THREE = 6
		};
	};
};

int main() {
	// New way of declaring the above
	// enums (note the required Enum)
	Foo::Single::Enum s;
	Foo::Double::Enum d;

	// Prints 1
	std::cout << Foo::Single::ONE << std::endl;
	// Prints 4
	std::cout << Foo::Double::TWO << std::endl;
}

C++11

class Foo {
public:
	enum class Single {
		ONE = 1, TWO = 2, THREE = 3
	};

	enum class Double {
		ONE = 2, TWO = 4, THREE = 6
	};
};

int main() {
	Foo::Single s; // Typical way to use
	Foo::Double d; // the enumerations

	// The reason to cast to int will be discussed later.
	// Prints: 1
	std::cout << (int)Foo::Single::ONE << std::endl;
	// Prints: 4
	std::cout << (int)Foo::Double::TWO << std::endl;
}

enum‘s double life as an int

“Enumerations are implicitly convertible between each other, as they are ints in nature.”

C++03

// Ordered by their strength
enum Unit {
	LIGHT_INFANTRY,
	HEAVY_INFANTRY,
	COMMANDO
};

// Ordered by range
enum UnitRange {
	SMALL,
	MEDIUM,
	LARGE
};

/// @return Biggest unit range
UnitRange GetBest() {
	return LARGE;
}

Unit GetUnitToBuild() {
	// Poor programmer assumed GetBest() will
	// return the best unit the AI could build.
	if (HEAVY_INFANTRY >= GetBest()) {
		return HEAVY_INFANTRY;
	} else {
		// Will always choose this path.
		return LIGHT_INFANTRY;
	}
}

C++11

// Ordered by their strength
enum class Unit {
	LIGHT_INFANTRY,
	HEAVY_INFANTRY,
	COMMANDO
};

// Ordered by range
enum class UnitRange {
	SMALL,
	MEDIUM,
	LARGE
};

/// @return Biggest unit range
UnitRange GetBest() {
	return UnitRange::LARGE;
}

Unit GetUnitToBuild() {
	if (Unit::HEAVY_INFANTRY >= GetBest()) {
		return Unit::HEAVY_INFANTRY;
	} else {
		return Unit::LIGHT_INFANTRY;
	}
}

Result

error: no match for ‘operator>=’ in ‘(Unit)1 >= GetBest()

enum are of some type

“Enumerations are internally ints, even if they don’t have to be.”

C++03

enum Recipent {
	ALL,
	SERVER,
	CLIENT,
	NODE
};

enum Response {
	NONE,
	IMMEDIATE,
	DELAYED,
	PERIODIC
};

// Size of our packet is 24B.
struct Packet {
	unsigned char data[16];
	Recipent reci;
	Response resp;
};

C++11

enum Recipent : unsigned char {
	ALL,
	SERVER,
	CLIENT,
	NODE
};

enum Response : unsigned char {
	NONE,
	IMMEDIATE,
	DELAYED,
	PERIODIC
};

// Now the size is only 18B.
struct Packet {
	unsigned char data[16];
	Recipent reci;
	Response resp;
};

Comment

While the bit fields provide even better ‘compression’ they are not so easy to use and require careful casting from/to enum.

Thank you C++ for … initializer list

25/11/2012

Most vexing parse

“Everything that resembles a function is a function.”

C++03

struct Bar {
};

struct Foo {
	Foo() : x(0) {}
	Foo(const Bar&) : x(1) {}
	Foo(const Bar&, const Bar&) : x(2) {}

	int x;
};

int main() {
	Foo a();
	Foo b(Bar());
	Foo c(Bar(), Bar());

	std::cout << a.x << std::endl;
	std::cout << b.x << std::endl;
	std::cout << c.x << std::endl;
}

Result

Line 17: request for member ‘x’ in ‘a’, which is of non-class type ‘Foo()’

Line 18: request for member ‘x’ in ‘b’, which is of non-class type ‘Foo(Bar (*)())’

Line 19: request for member ‘x’ in ‘c’, which is of non-class type ‘Foo(Bar (*)(), Bar (*)())’

C++11

int main() {
	Foo a{}; // or just: Foo a;
	Foo b{Bar()};
	Foo c{Bar(), Bar()};

	std::cout << a.x << std::endl;
	std::cout << b.x << std::endl;
	std::cout << c.x << std::endl;
}

Result

0

1

2

Container initialization

“It’s not easy to initialize a container, bar some fancy comma operator tricks.”

C++03

struct Foo {
	Foo() : _container() {
		_container.push_back(1);
		_container.push_back(2);
		_container.push_back(3);
		_container.push_back(4);
		_container.push_back(5);
		_container.push_back(6);
		_container.push_back(7);
	}

	// Alternative
	Foo(int) : _container(helper()) {
	}

	static std::vector<int> helper() {
		std::vector<int> result;
		result.push_back(1);
		result.push_back(2);
		result.push_back(3);
		result.push_back(4);
		result.push_back(5);
		result.push_back(6);
		result.push_back(7);
		return result;
	}

	std::vector<int> _container;
};

C++11

struct Foo {
	Foo() : _container{ 1, 2, 3, 4, 5, 6, 7 } {
	}

	std::vector<int> _container;
};

Brief constructors

You (almost) always have to specify the constructor name.

C++03

struct YouDontWantToSpellThis {
	YouDontWantToSpellThis(int a, int b) {
		;
	}
};

std::vector<YouDontWantToSpellThis> container;

container.push_back(YouDontWantToSpellThis(1, 2));

C++11

container.push_back({1, 2});