For loops in times of C++11 part 1


Regular for

We all know for. It was our friend since time immemorial. It’s simple to understand, it’s easy to convert into efficient assembly and it creates a new scope (and we do like scopes, don’t we?). Can we improve it? Let’s then talk about its deficiencies.

Note: I am assuming you understand that raw loops are something that should be used on the lowest level of implementation and your logic should be based on algorithms.



for (int i = 0; i < MAX_FOO; ++i)

doesn’t seem to involve many keystrokes, but if you consider that you will write similar code many times in your career that changes the picture a little bit. Because of it people either create their custom macros in Vim or snippets in Visual Studio or in IDEs featuring similar functionality.


for (? idx = 0; ... 

What should be the type of our index? That’s obvious! It should be the same as the sentinel (aka max value). If we get the type wrong we got nice warnings from the compiler:

foo.cpp: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]

That’s not a big issue, but why to waste your precious time fixing such stupid mistakes? And what if the conversion (promotion) between types is acceptable, but you will be constantly paying a conversion cost every loop iteration (unless the compiler does its optimizations)?

Performance and risk

Let’s get to something really important: efficiency and safety. If you see following code:

for (int i = 0; i < slotMgr.maxSlots(); ++i) {

Can you, without a doubt, quickly tell if:

  • the index (i) won’t be modified inside the loop so that the loop will be executed less/more than maxSlots() times? It might not be the case, but in specific cases one might want to – for instance – retry using a slot in the next loop iteration by decrementing the iterator in the loop body.
  • maxSlots() value won’t change during execution? Maybe some body operation modifies (indirectly or directly) the slots count?

First problem with above scenarios is that for someone who will be reading the code it’s harder to understand (abstract) the loop logic. If you are not sure whether index won’t change, or slots value will be altered – then you will spend more time to understand what’s going on.

Second problem is efficiency. Compiler might have hard time telling whether the maxSlots() call is constants throughout the whole loop. If it isn’t (any many compilers have to assume such scenario) every loop you will be paying a cost of a function call which itself might not be as trivial as simple return this->size;.

Range-based for

With the advent of C++11 we got the lovely range based fors:

for (auto& x : vector_of_T)

What it does it simply uses std::begin() and std::end() free functions that allow you to put even C-style arrays in place of vector_of_T.

That’s all cool, but how does it help us in case of regular loops? Do you expect us to write something like this?

auto v = std::vector<int>{0, 1, 2, 3}; // Madness!

// Imitates for (int i = 0; i < 4; ++i)
for (auto i : v) { 

God forbid! Let’s keep the terseness of this syntactic sugar, add usability (no handwritten vectors) and safety – meet ForRange!


The idea is simple. I want:

for (auto i : range(slotMgr.maxSlots())) {

Requirements for range(T):

  • as fast as hand written loop;
  • no need to specify type (auto is enough);
  • hoists the maxSlots() value for efficiency and simplicity;
  • allows arbitrary begin, end and step.

Implementation of ForRange is relatively simple, although a bit of code must be written. Never the less I encourage you to try to implement it on your own. I will come back with my implementation in part 2, together with some discussion about pros and cons of a given design.

Introduction to C++ templates pt.1


This series is intended for beginner C++ programmers, or those who feel a little rusty about their C++ programming skills in the wake of the new standard.

Thanks to C++11 it’s much easier to create templates. Especially that we can now create variadic templates that allow us to generically handle multiple arguments (in form of types and values). There is much knowledge about generic programming and I will not pretend to know everything there is to know. Never the less, I will try my best to share the best tips & tricks with you.

Before I jump into the task and it’s template solution I will talk for a while (this part) about iterations/loops in C++, which is required before we move onto templates themselves (in the next parts).

Calamity of abundance

Everyone of you is (or at least, should be) familiar with a for loop.

for (int i = 0; i < 10; ++i) { ... }

Above loop body is run exactly 10 times (I am assuming the body isn’t modifying the index). Most of the times, however, we don’t want/have to specify the loop count as we just iterate over a container and process its elements. C++11 offers many ways to do it.

std::vector<int> foo { 2, 4, 6, 8 };

// ver. A C++98
for (std::vector<int>::constant_iterator
     it = foo.begin();
     it != foo.end();
        << *it /* get the value pointed by iterator */
        << std::endl;

// ver. B C++11
for (auto it = foo.begin(); it != foo.end(); ++it) {
    std::cout << *it << std::endl;

// ver. C C++11
for (int /* can be auto */ val : foo) {
    std::cout << val << std::endl;

// ver. D C++11
auto end = std::end(foo);
for (auto it = std::begin(foo); it != end; ++it) {
    std::cout << *it << std::endl;

As you can see there are many ways to iterate over a collection. Before we start discussing the loops, notice the – so called – brace initialization of std::vector. If you are a little confused, think of it as a syntactic sugar to allow similar initialization that is allowed for C arrays(int[]) and structs.


In all the loop examples you can see (directly, or indirectly as in ver. C) the usage of iterators. Iterators are a basic concept in C++ which allows decoupling of the underlying container and way of retrieving it’s values, or iterating over them. I don’t want to repeat what it’s clearly described at C++ reference, so please take a look there, first.

To add some value what you have just read, hear are some popular examples of iterators:

  • InputIterator: Input TCP package stream where the connection provider throws the underlying buffer away after your iterations.
  • OutputIterator: Output TCP package stream that, right after writing to an element, sends it to the client and doesn’t allow to rewrite the same package again.
  • ForwardIterator (mutable): Single linked list, where (due to only one pointer) you are only allowed to go forward. The underlying elements won’t be invalidated so you can safely reiterate the whole sequence again.
  • BidirectionalIterator (mutable): Double linked list which – duh! – allows to traverse back and forth multiple times.
  • RandomAccessIterator (mutable): Arrays/vectors/(hash)maps which allow direct access to elements so you don’t have to iterate from beginning 8 times to get to 8th element. Just do *(it + 8) to get there in constant time.

What’s important to remember is that the more advanced iterator you wish to use, the less containers will be able to comply to the requirements. Thus if you can, you should target weaker iterators (think of targeting elderly Windows XP, and not limiting your users to Windows 8.1), and optionally allows some optimizations for stronger iterators.

Free functions

Take a look at version C of the loop. Version D presents an “unrolled” version C. Apart from the end iterator optimization (calculate it only once, as you won’t be able to modify the underlying sequence), the two important functions are std::begin() and std::end(). These are so called free functions which are overloaded so that they can take either a C array, any object that has a begin() member, or std::valarray / std::initializer_list objects (which don’t provide begin member).

These free functions allow to take suitable iterators from a vast array of object types which greatly eases the way of retrieving iterators from various containers.

In the next part we will feast upon std::for_each specification…

Data Oriented Design – Principles


Data Oriented Design (DOD) is an approach to designing software which focuses on context (data and transformations affecting it). Object Oriented Design (OOD), on the other hand, strives to decompose the problem into manageable pieces of responsibility represented as classes. DOD favors carefully crafted solution, well suited to the problem, while OOD wishes to find general patterns which will cover the requirements (including potential ones).

There is no clear winner among these two methodologies. While OOD might seem to be the prevalent design school nowadays, there are no reasons to ignore DOD. Why is that? DOD shines in two fields: performance and flexibility. By understanding the environment we can reach the optimal efficiency for a given platform. We can always create a feasible solution, not being distracted by (often) conflicting requirements that the pattern should cover.

I do believe that these approaches can be merged and you are the one to determine how DOD or OOD your application should be.



Imprint this image in your mind
Close your eyes
Imagine a mosaic of such images
This is your solution


  1. You shall determine properties of input data:
    * size (copying vs sharing)
    * uniqueness (mapping hash : data, and storing only hash vs keeping everything as is)
    * divisibility (many blocks (array) vs single entity)
    * source access method (stream vs database (various optimizations already present) vs simple binary (example: file))
    * data format (binary, JSON, XML, UTF-8 plain text)
    * influx (stable data vs bursts vs steady throughput)
  2. You shall know the detailed requirements of xforms
    * latency (time from scheduling the task to receiving the answer)
    * throughput (number of elements that should be xformed per second/minute/hour)
    * reentrancy (should the xform be reentrant and produce expected results)
    * pre/post-conditions and invariants (how restrained is your xform (postcond. & invariants) and how it should react to broken precond. (undefined behavior and greater efficiency vs double checking and slower))
  3. You shall define the behavior in concurrent scenarios
    * eventual guarantees (what will be the final state/guarantee in case of multiple actors?)
    * independence (strong synchronization and worse efficiency vs loose synchronization and better efficiency)
  4. You shall think like the platform
    * OS API functions (are there any higher level synchronization primitives, async IO read/write, etc.)
    * thread/process restrictions (max threads per process, cost of spawning proc./thread, inter-process messaging)
    * atomic file operations (can you safely lock on a file handle, or not (like in Windows))
    * user rights (what privileges are required for your solution to operate)

  5. You shall understand the machine
    * 32/64-bit (larger pointers (data size) and (potential) lack of support in case of some libraries vs more registers and ongoing support by the compilers)
    * CPU family (ARM/x86-x64/Cell) (efficiency/support of/for floating point arithmetics, weak/strong synchronization guarantees, branching efficiency, instruction/data cache sizes, unaligned data costs, speed (MHz))
    * peripherals (GPU present and potential GPGPU usage, disk speed (SSD/HDD/flash) and the cost of fetching data from second storage, access patterns of peripherals, DMA present and ability to delegate IO operations)
  6. You shall avoid dynamism
    * branching (avoid complex conditional logic and use early-out checks only if you do know that the performance is gained)
    * sort (data shall be sorted so that similar operations will be present in instruction cache)
    * virtual calls (while VC is relatively light in itself, the pipeline disturbing calls via pointers can seriously degrade performance)
    * bulk (operate on data in SIMD like fashion where the data will be properly aligned and processed in understandable and pre-fetch-friendly way)
    * static (strive to calculate as much as possible during compilation time, and prefer logic that can be optimized before running the code)
  7. You shall simplify
    * feature creep (implement only this, what should be implemented; eagerness will waste precious time)
    * no patterns (most of the time you are not designing widely used library, so focus on the job, not on the general way that the problem could be solved)
    * clarity (no matter how you optimize, make sure that the code is readable and commented where needed)
  8. You shall structure your code around data
    * states (determine the state of your data)
    * assign operations (moving from one state to another will be done with some xform)
  9. You shall not forget the universal & timeless design principles
    * DRY (one reason to change – one place to do it)
    * orthogonality (less dependencies between modules, easier reusability)
    * testability (no tests, no product)
    * morphism (decompose application so even the heavily-customized xforms can be swapped)
    * documentation (clearly present the goals, features and restrictions of your architecture)
  10. You shall measure
    * statistics (in critical parts of your program add logging procedures so that you know what are the usage statistics of your data)
    * tuning (according to stats tune the properties of your code in order to reach the set goals)

C++ std::async


In order to facilitate spawning concurrent tasks C++11x provides us with std::async. I will discuss many topics, but I’ll try to focus on so called launch policies defined under std::launch. std::async returns a std::future object which is basically a ‘waitable’ proxy which returns the result of scheduled task.

Test environments:
Linux: Ubuntu 12.04 x64, gcc 4.6.3, libstdc++ 6-4.6
Windows: Windows 7 Pro x64, Visual Studio 2012 (w/o CTP)

The what

There are 2 launch policies (LP) defined by the standard:

  • std::launch::async that requires the thread to be launched immediately;
  • std::launch::deferred (aka sync) that uses caller thread and executes only when result of the future is requested

For simplicity and unambiguity I will refer to std::launch::async as eager (task), and std::launch::deferred as lazy (task).


Lazy task will be started iff one of the methods of future is called: wait() or get(). Using wait_for() or wait_until() (so called non-timed waiting functions) results in returning status future_status::deferred and the task will not be started.

In case of eager task execution starts immediately and every call to wait()/get() results in a potential block. Important thing to notice is the fact that if the underlying thread finishes execution it is released before any method is called. This means that the thread is kept only as long as it’s needed, and no longer.

Default policy

When not explicitly specifying the LP you leave it to the implementation. Let’s check how different implementations decide on the LP.


No matter how many tasks you schedule, they all will be lazy. In other words Linux doesn’t provide any default parallelism – if you want it, you set it.


Everything is run in parallel (eagerly). But won’t it be very slow to run so many thread at once? Microsoft found its way and is using a thread pool underneath which tries no to create too many threads too fast. It will be discussed in more details later.


Very important aspect of future objects is their destructors. By design, futures don’t wait (block) for the result to be made available, so you are not required to wait() or something. Unfortunately, when using eager task the destructor of future is required to wait for it to finish!

This leads to following gotcha that prevents achieving parallel execution:

std::async(std::launch::async, ...); // Returns std::future<...> which is immediately destroyed. 
std::async(std::launch::async, ...); // Wait for the destructor to join before starting new task.

This issue was raised by Herb Sutter and hopefully in the new revision of the standard it will get fixed. For now you should use the following “idiom”:

auto a = std::async(std::launch::async, ...);
auto b = std::async(std::launch::async, ...);
auto c = std::async(std::launch::async, ...);
// Join of c, b, a (in that order) happens here, but still all of them were able to be executed concurrently.


What about thread pools, you ask? In theory there is nothing written in the standard regarding TPs, so every eager task (lazy tasks don’t care about multiple threads as they will always be run on the ‘main’ thread) should be immediately run on a new thread. Reality is, however, different.


g++ (based on pthreads) follows the standard letter-by-letter and if you request 1000 eager tasks, you are gonna get them. It might not be the most efficient, but at least you are definitely won’t run into deadlock problems where a thread raising an event – for which hundreds of threads are waiting – won’t be scheduled due to limit of the pool.


Visual Studio (2012) is more tricky as it uses a TP. I don’t know the internals of the Microsoft creation, but let’s observe its behavior ourselves.


Observations that we should draw are these (some are based on more detailed view of the sample data, thus you have to believe me):

  • Even though the system could schedule more threads it never assigns 1k of them, but does it more gradually. I suspect this to be the standard behavior of TP where it schedules new thread only when it sees that there is not much chance to get an already used one.
  • There are 2 inflection points that we are interested in, at 75 and at 325 threads. Past these values the TP seems to be less willing to schedule new threads, and there is some delay between request to std::async and actual thread creation.
  • When launching the first eager task the system preallocates ~8 thread so that it is ready for new potential tasks.
  • After all tasks have finished their job, ~16 threads are left as the new idle pool waiting for new tasks.

Now let’s take a look at the std::threads scheduling.


This case is very peculiar. Not only we reach the 1k threads count, but we overshoot it! The threads are also scheduled much faster, not introducing any considerable delays (as it was in the case of std::async). Last curiosity is that after joining all threads, there were still 4 threads waiting, thus the underlying system still behaved more like a regular TP.

The why

Just before we start digging deeper, there is a reasonable question to ask ourselves: Why the heck to use the futures?

  • No need to manually wrap every call in a std::thread and handle passing the parameters and retrieving the result. Exceptions are also nicely handled here.
  • Clean abstraction representing an asynchronous result which might not be immediately available.

While first reason makes us code faster, the second one is much more important. By providing a future you inform the users of your code that they should not expect the result immediately.

Why is this important? Because it allows others to properly organize their code. When they see int foo() nothing is gonna change and they shall work as usual. On the other hand, seeing std::future foo() makes them:

  • schedule execution of some other tasks (parallelism ftw!) before retrieving this result;
  • attach the result to a group of objects on which the method will be waiting for, and while waiting it might render a nice “busy animation”;
  • composite it with other tasks (when std::future::then()-like method are ready;
  • think twice before calling this method many times over and over (as it might be quite slow);
  • only use get() to work in a fashion they are used to;
  • do other crazy stuff with it that I can’t predict.


There are several ways of implementing asynchronous results:

  • provide an additional parameter specify a callback function;
  • return a handle which you can query/wait for the completion and with which you can go to creator object to get our result;

Callbacks are the bread&butter of async. programming nowadays, but are they the best idea? I tend to disagree.

First, most of the times callbacks are tied to a specific object which has to exist at the time of the call, which might not always be the case. In order for everything to work smoothly either you have to detach your callback (which is hardly possible) or check (in every callback!) for existence of the parent object. This is both tedious and sometimes hard to solve as it’s hard to decide whether pretending that callback didn’t happen is the best course of actions.

Second, callbacks are called by the server, not by client. You can be in an arbitrary program state and just when you least expect it you will have to deal with the callback being called. This is like a goto coming from anywhere just to provide you with happy hours of debugging parallel program while deciphering the logs just to get the cause of failure of your little callback.

Third, it’s hard to work with the callbacks. You can’t simply render a waiting animation when waiting for several async. results. You have to manually create some synchronization primitive on which you will wait until all the schedules tasks have finished.


IMHO handles are the step in the right direction. You can wait on the proxy or query it every time your inner program loop executes, so there are no nondeterministic gotos. Unfortunately it’s still a little bit of a chore to keep the parent object around and to get the result from it. You can set the object to be static but in that case you have to deal with all the “pleasures” of concurrent threads and non-trivial construction/destruction.

The when

Return value

In the next “The which” chapter I described when to use eager and lazy tasks. You know what? The more eager the task, the more likely return value should be a future. And vice versa: lazy tasks are less likely not to become futures.

So why to use lazy tasks, ever? Laziness and eagerness is an implementation secret. You present an interface of async. result. Depending on various conditions these results might truly be parallel, or not. That gives you great flexibility.


float calc(float a, float b, float c);

Imagine that calculation of a, b and c might take a lot of time. Because of that you might think to do something like this:

float calc(std::future<float> a, std::future<float> b, std::future<float> c);

First, how do you know which of the arguments is in fact slow to calculate? It might vary from client to client, thus you would be forced to ‘futurize’ all your parameters. This leads to an interface which is not pleasant to use, as every argument would have to be wrapped like this:

calc(std::async(std::launch::deferred, []{return a;}), ...);

On the other hand you require the client to calculate all the parameters beforehand, while your code could (in the meantime while client eager tasks were running) schedule your own tasks to use the parallel power of the machine. By “forcing” them to use async. parameters you might get much better efficiency as now the client code starts to use the power of the machine.

My solution is not to use multiple async. parameters but a single wrapping class:

class Wrapper {
    std::future<float> getA();
    std::future<float> getB();
    std::future<float> getC();
float calc(const Wrapper& w);

Now the calc() can be have several overloads where every variation of Wrapper can have different combination of async. and sync. results basing on which you might behave differently.

Also, the Wrapper can be customized so that the user can either pass a float which will be turned automatically to lazy task, or a lambda which will be changed to an eager task. This way the whole interface can be async. while the clients can easily fill it with values not having to manually turn everything to futures.

The which

So the “big” question is when to use eager tasks, lazy or leave it all to implementation?

As I’ve previously described Windows will run everything eagerly, while Linux will run everything as lazy – there is no common balancing logic behind the scenes (“schedule up to N eagerly and then only lazy”). Because of it we should rather control whether or not to create a lazy or eager task ourselves, as the program might not behave uniformly at all, when ported.


Let’s imagine that an architect defined a following interface:

class ISales {
    virtual std::future<Result> forRegion(const std::string& region) = 0;
    virtual std::future<Result> forPartner(const std::string& partner) = 0;
    virtual std::future<Result> forPeriod(const Period& period) = 0;

Our class implements the interface and we have to decide what should be the TP of every of the method results. Code dependent on ISales is expected to work in the following manner:

  1. Schedule the execution of all data that will be requested by simply calling the for*() methods.
  2. [Optional] Calculate your stuff.
  3. get() the requested results.
  4. Finish calculations.


In order to decide on the TP, we have to define a set of indicators. None of the following indicators should be treated as absolute. You will have to balance and measure things before you end up with the proper TP.


Like Microsoft proposed, long running tasks (50 ms) should be made asynchronous (ie., eager). Logic behind this statement is obvious: you don’t want to block your application. While your caller thread waits on the other threads result, you are just wasting time slice cycles which will be given to some other process. The shorter period of time the caller has to wait, the better. Thus scheduling the long running tasks right when created, provides the best results.

CPU bound

Waiting on event/IO operation/network packet doesn’t consume CPU. Such operations should naturally be made eager as the cost of accompanying thread shouldn’t be large; while the caller thread would have to block for the same period of time as the helper thread (of eager tasks).

Hardware concurrency

Don’t bite off more than you can chew. Is it reasonable to schedule yet another thread (eager task) when you know there are already hundred threads running?

It’s meaningless to measure the currently running threads. First, the threads that are active might be the ones of a pool, so they are not actively doing anything, just waiting to be used by you. Second, even if those threads are used, they might be simply waiting on something, not using the CPU.

Framework time

The only count that is available to you is the number of parallel runnable threads. This is retrieved via std::thread::hardware_concurrency(). Keeping in mind what I’ve said in previous paragraph I would suggest the following way of tuning the TP:

  1. Prepare a parameterized framework which can easily change the TP of created tasks. You would probably have to build a layer on top of the std::async so that every task would have to go through it. When scheduling the task you would have to provide an enumeration (more gradations than std::launch) specifying how much you want it be async. This is really important as some tasks have to be run async. This is discussed in more details later.
  2. Run tests on various machines (2/4/6/8 parallel threads supported) and note the results.
  3. Derive the optimal strategy for the provided hardware concurrency.
  4. Store the tuned parameters inside the program (configuration file) and ship it.

I believe that a conservative usage of eager tasks should suffice in most cases, so you shouldn’t worry about this that much.

Don’t make the caller thread a lazy bastard!

A common problem that many programmers do when playing with concurrency can be presented as follows:

auto a = std::async(std::async, []{ doFoo(); });
auto b = std::async(std::async, []{ doBar(); });
auto c = std::async(std::async, []{ doXyz(); });
return {a.get(), b.get(), c.get());

The main thread is only blocking! Instead of scheduling everything to eager tasks, let the main thread always do something in the mean time.

auto a = std::async(std::async, []{ doFoo(); });
auto b = std::async(std::async, []{ doBar(); });
auto c = doXyz();
return {a.get(), b.get(), c);

You could, in theory, apply this optimization to our interface, but it would be hard to determine whether the caller has used his thread enough. Forcing him to block because you thought he hasn’t done anything is rather risky and potentially limits the concurrency potential of your code.

Internal dependencies

One of the biggest dangers of thread pools is that you are able to specify max number of concurrently running threads. Why is that? Thanks to internal dependencies. Take a look at the following code:

WaitableObject w;
auto a0 = std::async(std::launch::async, [&w]{ w.wait(); /* do some stuff */ });
auto a1 = std::async(std::launch::async, [&w]{ /* do some stuff */ w.raiseEvent(); });

Problem here is that when a0 is scheduled, it has taken all the available slots in the pool. Other threads are long running ones that deal with packets, user events, etc.; and they won’t be joined till the end of the program. a1 won’t be scheduled as there is no place for it. In other words: deadlock.

This scenario isn’t very common but when you multiply the number of threads and increase complexity of inter-thread relations you can easily deadlock yourself.

One of the solutions to this problem is to make your functions pure (without any side effects to it’s parameters or global state) or at least leave all the synchronization in future result. If a1 was to raise an event then it’s much better to pass it to a0 so that it can wait for it. This way you explicitly created a dependency chain and it will be much easier to reason about the code.

Second solution is not to have a limit on the TP, and from my observations there doesn’t seem to be any in case of Microsoft implementation.

In the context of our discussion it seems obvious that such tasks should be made eager, as the lazy task can be easily omitted and never run.


I hope I helped some of you understand the usage of std::async. I am eagerly awaiting different patterns that programmers will create when dealing with asynchronous results. To the next time!

Microsoft Office (Excel) Interop Crash Course


There comes one day in your life that you will be requested to execute some Excel logic on the server side, so that the user (for instance) doesn’t have to manually refresh some pivot tables. Unfortunately, Excel Interop (EI) is a living hell full of gotchas and bugs, thus let me give you a few hints.

Use only if you have to

When dealing with Excel documents (server side) you may have two group of tasks: data based and logic based. Data modifications are simple changes to the contents of the sheets. Whereas logic transformations require faking a functionality of Excel, like entering 1 + 1 should give us a final value 2 when entered into the cell.

Modifications of data are usually handled easily by free tools (such as EPPlus). Unfortunately, faking Excel logic is not that simple, thus EI becomes you a very powerful tool. It allows you to basically perform the same steps that the user could do manually (in an Excel window).

Accept the fact that it might not work

Microsoft has stated that they do not support interop services, so if it doesn’t work for you then … you have a problem. If it doesn’t for you (and you are sure that you made everything you could to make it work) then you can always resort to VBA macros. While this solution is often frowned upon, it’s much more reliable and hundred times easier to implement.

ALWAYS properly clean up your objects

You will be dealing with COM objects which is a framework that I am not very found of, alas it’s even worse when called from C#. If you do not properly clean all the COM objects you referenced, you will end up with a zombie EXCEL.exe process left even after your process has been finished. There might be even other bad things that happened, so it even more serious.

Repeat after me:

  1. I shall not keep multiple references to the same COM object as after cleanup all of them will be invalidated.
  2. I shall Close(), Quit() or use any other cleanup method that an object provides.
  3. I shall always use FinalReleaseComObject() on every COM object I’ve stopped using.
  4. I shall null my reference.
  5. I shall call Garbage Collector multiple times due to “funny” behavior of COM objects:
  6. I am aware of exceptions and thus the cleanup will always be done in finally block or by using & IDisposable wrapper class.

This rule is very tedious to follow, and most of the guidelines that I’ll present stem from this one.

Let’s make Excel headless

As I explained using EI is like sending messages to Excel process. It won’t spawn a window, but you will be attacked by a horde of pop ups asking you whether “You really wish to overwrite file foo.xlsx>” etc. To deal with them, I set the following properties for my Application instance.

var app = new Interop.Excel.Application();
app.Interactive = false;
app.DisplayAlerts = false;

There might be some other flags you might have to clear/set, thus you have to experiment yourself.

Double dot = your biggest enemy

Take a look at the following code:

var refToC = refToA.b.c;
// Use refToC
// Close, release, null and gc refToA and refToC

Have we properly cleaned up our objects? No, reference to B is dangling! WTF, you ask? Unfortunately, the refToB (which is not explicitly defined) is created, and we haven’t released it. Because of that lovely behavior every “going deeper” via a reference requires a separate object.

var a = comObj.a;
var b = a.b;
var c = b.c;
c.Foo(); c.Bar();

Forget about Enumerator and foreach

foreach is just a wrapper over Enumerator thus I will just deal with the latter. So what is the problem? Internally, the Enumerator object holds a private reference to a COM object. Sadly, the Enumerator doesn’t provide any cleanup mechanisms (Dispose() and similar), thus you can never be sure when the reference to the COM will be dropped.

Thus, prepare to manually iterate every container using Count and for.

Indices are 1-based

As the last guideline an easy one: every time you call get_Item(), Item[i] and Item(i) (and i is an integer) remember that you index from 1, not 0. If you forget about it, a nice exception will be raised.

One optimization trick for C++


In one of his trainings Andrei has touched a very interesting subject of optimizing C++ (or just code in general) with respect to laying out the code, or basically understanding how the assembly will look like. In this post I would like to dig a little deeper into the subject.

Meet d10

Task is simple:

Count how many 10-base digits a given uint64_t value consists of.

For the purpose of this post (to avoid repetition) let’s assume that every presented implementation of the above task has the following declaration:

uint32_t d10(uint64_t value)

In order to neatly group all the constants in our code, the following enumeration is used:

enum Base10 : uint64_t {
	D1Min	= 0UL,
	D1Max	= 9UL,
	D2Min	= 10UL,
	D2Max	= 99UL,
	D3Min	= 100UL,
	D3Max	= 999UL,
	D4Min	= 1000UL,
	D4Max	= 9999UL,
	D5Min	= 10000UL,
	D5Max	= 99999UL,
	D6Min	= 100000UL,
	D6Max	= 999999UL,
	D7Min	= 1000000UL,
	D7Max	= 9999999UL,
	D8Min	= 10000000UL,
	D8Max	= 99999999UL,
	D9Min	= 100000000UL,
	D9Max	= 999999999UL,
	D10Min	= 1000000000UL,
	D10Max	= 9999999999UL,
	D11Min	= 10000000000UL,
	D11Max	= 99999999999UL,
	D12Min	= 100000000000UL,
	D12Max	= 999999999999UL,
	D13Min	= 1000000000000UL,
	D13Max	= 9999999999999UL,
	D14Min	= 10000000000000UL,
	D14Max	= 99999999999999UL,
	D15Min	= 100000000000000UL,
	D15Max	= 999999999999999UL,
	D16Min	= 1000000000000000UL,
	D16Max	= 9999999999999999UL,
	D17Min	= 10000000000000000UL,
	D17Max	= 99999999999999999UL,
	D18Min	= 100000000000000000UL,
	D18Max	= 999999999999999999UL,
	D19Min	= 1000000000000000000UL,
	D19Max	= 9999999999999999999UL,
	D20Min	= 10000000000000000000UL,
	D20Max	= 18446744073709551615UL

The important thing to remember is that the enumeration values become part of the instruction code stream and do not have to referenced via pointers/temporary values/stack.

Performance tests should be interpreted as follows: smaller is better. In other words 200% score means that the given algorithm was 4 times worse then the baseline.

Specs of the test machine are: Intel Pentium Dual Core T4200 2GHz, 4GB DDR2, ubuntu 12.04, gcc 4.6.3 -O3 64-bit.

Let’s be compact

The compact version will become the baseline for our tests and it basically represents the standard way that most of us would implement d10.


	uint32_t result = 0U;

	do {
		value /= 10UL;
	} while (value);

	return result;


Dividing is slow! It’s roughly 5x slower than multiplying. Because of that compilers introduce an optimization trick which basically turns

MOV eax, $dividend
MOV edi, $divisor # We can't divide by an immediate value
DIV edi


MOV eax, $dividend
MOV edi, $magic_number # We can't multiply by an immediate value
MUL edi
SHR edx, $magic_number_2 # Result of multiplication is put to edx

In other words, we turned th division into multiplication by some magic numbers where basically multiplication and shift result in the same operation as division. Very nifty.

Apart from that, the compact code isn’t very interesting:

  1. Increment ecx register (where result is stored)
  2. Do the multiplication-shift (aka division)
  3. Execute test instruction to check whether after division we got 0
  4. If we got 0, just end the code; if not – jump to the beginning

Base 10 is not enough

Let’s pretend that we do not want only to return the count of 10-base digits and we wish to make it more general.


	const double float_value = value;

	double float_result;
	for (float_result = 1.0;
		 float_value >= std::pow(10.0, float_result);

	return static_cast<uint32_t>(float_result);

Suddenly we got a solution which works for every base (base-2, base-16, etc.).


Let’s look at the performance results:


Horrible! There are, however, several reasons for such a low score:

  • We are making a call to another function which might not be very fast
  • In order to call the function we have to move arguments to xmm0 and xmm1 registers which are by use by us, and then retrieve the result by putting the stack values into the xmmX registers
  • Every conversion from/to int requires using cvtsi2sd/cvttsd2si instructions which take their time (15 i. on older CPUs, 4 on faster)
  • We cannot use ‘inlining’ of constant value 1.0, as it is a floating variable which means that we have to add it from memory location:
    addsd	xmmX, QWORD PTR .LC0[rip]
    	.long	0
    	.long	1072693248
  • As we need to use (for storing constant value 10.0) base register rbx we have to push and pop it from the stack every function call.

Thing to remember here is that operations on floats are not that slow as we might believe. They are a few cycles slower than standard operations, so it’s not a tragedy.

Fast-math to the rescue!

We all do know about the infamous -ffast-math flag for gcc which basically enables various optimizations which allow for faster floating point operations. The catch? They might break your code.

Because of that there are two camps: one believes that -ffast-math is just a death wish and should never be used, while the other thinks that if your code doesn’t do anything funny (denormals, etc.) it can safely be used.

In our case usage of -ffast-math resulted in correct results (tests passes), thus I gave the famous flag a chance.


Results are as follows:


Generally not that fast (pun intended). Some gain is observed, but nothing spectacular.


Being general didn’t help us so let’s take the road of precomputed values. During our first attempt we will choose a simple solution using a table through which we will walk. In LOOKUP which is std::array we stored following values: Base10::D1Min, Base10::D2Min, …, Base10::D20Min. Not pretty, but simple.


	uint32_t result;

	for (result = 0U;
		 value >= LOOKUP[result] && result < LOOKUP.size();

	return result;



Finally some improvements, but due to huge values of our first versions in future comparisons, they will be dropped. Enhance!


Ok, so even if the cpu has to:

  • fetch every data piece from via a pointer,
  • cannot inline the data in any way (as it doesn’t know the tale values),
  • has to make two comparisons ever time,

it’s still tens percents faster! In other words, avoiding computation (mul and shr) by providing a well organized memory that can be prefetched (but still not stored completely in the cache as the whole table takes 1.25kB, while the whole cache set is only 0.5kB) we can drastically reduce the execution time.

Partial unroll

Let’s take other approach (presented by Andrei), instead of using tables, we shall “unroll” the computations that we perform in compact version, so that we don’t have to iterate so many times.


	uint32_t result = 1U;
	while (true) {
		if (value < Base10::D2Min) return result;
		if (value < Base10::D3Min) return result + 1U;
		if (value < Base10::D4Min) return result + 2U;
		if (value < Base10::D5Min) return result + 3U;

		value /= 10000UL;
		result += 4;

	return result;


We basically traded less loops and computations (mul + shr) for more conditional logic. More branches, you say? But we know that the branch predictor will have problems with this, and no pre-fetching and … blah, blah. Test first, you fool!


Not only the unroll is faster than the compact (standard) version, but it also beats the table version which seems relatively simpler (less branches).

BTW, the generated assembly isn’t very innovative. Compare (cmp), return if below or equal (jbe) or iterate the loop. The only tinkering the compiler has done is that it changed the "jump if below n" to "jump if below or equal n – 1“.

Unroll like there’s no tomorrow!

Partial seems good? Then let’s make it fully unrolled!


	if (value <= Base10::D1Max) return 1;
	if (value <= Base10::D2Max) return 2;
	if (value <= Base10::D3Max) return 3;
	if (value <= Base10::D4Max) return 4;
	if (value <= Base10::D5Max) return 5;
	if (value <= Base10::D6Max) return 6;
	if (value <= Base10::D7Max) return 7;
	if (value <= Base10::D8Max) return 8;
	if (value <= Base10::D9Max) return 9;
	if (value <= Base10::D10Max) return 10;
	if (value <= Base10::D11Max) return 11;
	if (value <= Base10::D12Max) return 12;
	if (value <= Base10::D13Max) return 13;
	if (value <= Base10::D14Max) return 14;
	if (value <= Base10::D15Max) return 15;
	if (value <= Base10::D16Max) return 16;
	if (value <= Base10::D17Max) return 17;
	if (value <= Base10::D18Max) return 18;
	if (value <= Base10::D19Max) return 19;
	if (value <= Base10::D20Max) return 20;


We know it looks hideous and it’s unimplementable for bigger set of arguments, but what about the efficiency? Glad you asked.


Even faster, and the compiler still didn’t do any crazy stuff.

Hiding conditional logic

While unrolling provides us with a nice boost, there are still a lot of branches/jumps. What if we could “avoid” them?


		+ (value >= Base10::D2Min)
		+ (value >= Base10::D3Min)
		+ (value >= Base10::D4Min)
		+ (value >= Base10::D5Min)
		+ (value >= Base10::D6Min)
		+ (value >= Base10::D7Min)
		+ (value >= Base10::D8Min)
		+ (value >= Base10::D9Min)
		+ (value >= Base10::D10Min)
		+ (value >= Base10::D11Min)
		+ (value >= Base10::D12Min)
		+ (value >= Base10::D13Min)
		+ (value >= Base10::D14Min)
		+ (value >= Base10::D15Min)
		+ (value >= Base10::D16Min)
		+ (value >= Base10::D17Min)
		+ (value >= Base10::D18Min)
		+ (value >= Base10::D19Min)
		+ (value >= Base10::D20Min);


The obvious penalty of the above solution is that we will have to always go through all the digits, even though we could return early. The positive? There is no branching. CPU can nicely rush through all the code filling the pipeline.


While the penalties are clearly visible (big lag for small digits) the potential gains cannot be seen and it doesn’t seem that the lack of branches helps the CPU that much.

It’s all about searching

Looking at the fully unrolled loops we can easily see that what we’ve been doing all the time is plain stupid linear search. Being decently equipped engineers we know that we can turn this n into log(n) via binary search.


	auto upRange = std::upper_bound(LOOKUP.cbegin(), LOOKUP.cend(), value);

	if (upRange == LOOKUP.cend()) {
		return 20;
	} else {
		return std::distance(LOOKUP.cbegin(), upRange);


Being lazy and smart cookies at the same time, we used the standard algorithms.


Not the improvement we wanted.

Not for the faint of heart

Andrei didn’t present the fully unrolled binary search, so let’s show it.


	if (value < Base10::D8Min) {
		if (value < Base10::D4Min) {
			if (value < Base10::D2Min) {
				return 1;
			} else {
				if (value < Base10::D3Min) {
					return 2;
				} else {
					return 3;
		} else {
			if (value < Base10::D6Min) {
				if (value < Base10::D5Min) {
					return 4;
				} else {
					return 5;
			} else {
				if (value < Base10::D7Min) {
					return 6;
				} else {
					return 7;
	} else {
		if (value < Base10::D16Min) {
			if (value < Base10::D12Min) {
				if (value < Base10::D10Min) {
					if (value < Base10::D9Min) {
						return 8;
					} else {
						return 9;
				} else {
					if (value < Base10::D11Min) {
						return 10;
					} else {
						return 11;
			} else {
				if (value < Base10::D14Min) {
					if (value < Base10::D13Min) {
						return 12;
					} else {
						return 13;
				} else {
					if (value < Base10::D15Min) {
						return 14;
					} else {
						return 15;
		} else {
			if (value < Base10::D18Min) {
				if (value < Base10::D17Min) {
					return 16;
				} else {
					return 17;
			} else {
				if (value < Base10::D19Min) {
					return 18;
				} else {
					if (value < Base10::D20Min) {
						return 19;
					} else {
						return 20;


Yes, it’s uber horrible to write, but we didn’t do it to get the cleanest code award.


And we have the winner!

That’s the end of our journey

I wish this post will become a minor, but still valuable addition, to what Andrei has presented. To summarize all the tips I gathered while creating this material:

  • don’t assume – measure
  • compiler knows more tricks then you do, so concentrate on the algorithms and data layout, not micro-optimizations
  • general programming knowledge is applicable everywhere (i.e., binary search)
  • moving data from storage (cache) to code instructions is good, but don’t go overboard – you can’t squash all the code, everytime

As the percentage charts don’t show the differences between the fastest algorithms enough, one more chart.


Test all levels, god dammit!


Recently I’ve heard a discussion regarding how a test system should be implemented. The system follows a standard architecture found in many, many places.


One of my colleagues proposed a following solution:

Let’s test at the GUI level. If we do that we will have everything tested, as all layers have to work correctly in order to give us properly rendered page.

Drawing from my small testing experience, I can state this is a very bad idea. It’s very tempting to do so, but there are several flaws of this approach and hopefully you won’t make this mistake.

Observable State

The first major flaw is that the on the highest level of abstraction (GUI) we will only observe a small piece of the reality. Let’s say you have a table in your html with aggregated data (sum of all income, etc.). There is quite a lot of data, so you draw a conclusion that as the values match your expected state everything is in order.

Invalid scenario

This is what really happened.

  1. Database was created
  2. Database was filled with test input vector
  3. System executed several test script commands
  4. PHP retrieved aggregated values from tables
  5. Defect PHP command removed half of the values from database
  6. HTML consists of properly calculated data (which was captured in point 4)
  7. Database is torn-down
  8. New test begins

Covering Infections

This time, let’s presume that virtually all state that is relevant to the system is visible in the GUI. What if there is not one, but two defects, when one covers the other?

Invalid scenario

  1. SetUp() logic completed successfully
  2. Defective DB stored procedure which was supposed to add 10$ to every account only added 5$
  3. Defective (due to outdated logic) PHP function applies a bonus of 5$ to every read column
  4. Correct values are observed in HTML

Silent Defects

Remember, the more functional/system tests you will do, the more you are tied to a specific environment/input data. This time we have a low-level function that returns the week of the year, called YearWeek(). This procedure is called many times and we never (during our GUI tests) encountered any problems with it.

Unfortunately, after some time a QA finds out that in one part of the system a difference of week number A and week number B (which happens always later in the year) is negative! How come Assuumption: B > A Result: B - A < 0!?

What happened is that border weeks in December/January of one year are treated as weeks of the previous/next year. In order to fix this problem you will have to return both, the year and a week. Let’s assume that this functionality is used across the whole system. Do you think incorporating such change of logic can be easy in multiple places in the system?

Every insufficiently tested functionality is bound to show it’s weakness some day, which might require a lot of costly changes and delays.

Debugging Paranoia

Dang it! Your test environment reported an one test has failed. Value A in textbox T is invalid as we expected B. You know that you will do for the next n hours? You will dig from the very top of your system (GUI), to the very bottom (Database). Madness!

By having test for every layer, 1+ test will fail. Basing on the failed non-GUI tests you can quickly determine at what layer the error appeared. If all layers reported errors, someone created a defect in database. If only GUI failed, then you know you will take a closer look at the HTML/JavaScript/etc.


Testing GUI itself is tiresome. First, you probably need to understand and use some framework so that you can even start testing. Second, such test aren’t the fastest ones, thus you will need some more worker machines if you wish to follow the continuous integration path. Third, these test are very brittle, especially when comparing screen-to-screen. Any change in the design that is done by an artist, or changing the environment settings (resolution, bpp, browser settings) can lead to avalanche of failed tests.

Virtual destructors are like a virus


The good

During your life as a C++ programmer you will hear many guidelines like: use smart pointers, RAII, don’t play with pointers, measure what you optimize, etc. One of these rules goes as follows:

Every class which will be a base for other classes, has to have its destructor defined as virtual.

As we all do know, the virtual destructors are inherited. In other words, they behave like regular methods. If you have a base class A through which pointer/reference you will call a destructor defined in B (which inherits A), the latter will be used.

class A {
	A() { print("Constructor of A"); }
	virtual ~A() { print("Destructor of A"); }

class B : public A {
	B() { print("Constructor of B"); }
	virtual ~B() { print("Destructor of B"); }

int main() {
	print("A* = B");
	A* b0 = new B();
	delete b0;

	print("B* = B");
	B* b1 = new B();
	delete b1;

	return 0;

A* = B
Constructor of A
Constructor of B
Destructor of B
Destructor of A

B* = B
Constructor of A
Constructor of B
Destructor of B
Destructor of A

The bad

Everything works, as it should. But what if we break the virtual destructor rule?

class A {
	A() { print("Constructor of A"); }
	~A() { print("Destructor of A"); } // No virtual here

class B : public A {
	B() { print("Constructor of B"); }
	virtual ~B() { print("Destructor of B"); }

A* = B
Constructor of A
Constructor of B
Destructor of A
_BLOCK_TYPE_IS_VALID assert error!

B* = B
Constructor of A
Constructor of B
Destructor of B
Destructor of A

B pointer and destruction work correctly, but for A pointer we have an assert! What does the _BLOCK_TYPE_IS_VALID mean? Basically we are trying to delete memory which we shouldn’t. As far as I can tell, it happens because:

  1. We tell the machine to delete the b0
  2. We look at b0 and sees that we are dealing with a class
  3. As this is a class which we call through pointer/reference maybe we should use a virtual table (VT)?
  4. There is no virtual members in A thus no VT is used
  5. We call the standard destructor of A
  6. Bang! We are trying to delete class A but it happens that the pointer has lead us to object of B which contains VT which A didn’t know of. sizeof(A) is 1 (as AFAIK it’s not legal to have size equal 0) and sizeof(B) is 4 (due to presence of VT). We wish to delete 1 byte, but there is a block of 4 bytes. Due to DEBUG heap monitoring, the error was caught.

To prove it, take a look at the following code:

class A {
	A() { print("Constructor of A"); }
	~A() { print("Destructor of A"); }

class B : public A {
	B() { print("Constructor of B"); }
	~B() { print("Destructor of B"); } // No virtual here

A* = B
Constructor of A
Constructor of B
Destructor of A
B* = B
Constructor of A
Constructor of B
Destructor of B
Destructor of A

As you can see above, nothing exploded. As expected, B dtor wasn’t called, but it didn’t crash as delete had to clean the same size of block in case of A and B (which is 1B).

The ugly

What about such code?

class A {
	A() { print("Constructor of A"); }
	virtual ~A() { print("Destructor of A"); }

class B : public A {
	B() { print("Constructor of B"); }
	~B() { print("Destructor of B"); }

class C : public B {
	C() { print("Constructor of C"); }
	~C() { print("Destructor of C"); }

int main() {
	print("A* = C");
	A* c0 = new C();
	delete c0;

	print("B* = C");
	B* c1 = new C();
	delete c1;

	print("C* = C");
	C* c2 = new C();
	delete c2;

	return 0;

A* = C // ok
Constructor of A
Constructor of B
Constructor of C
Destructor of C
Destructor of B
Destructor of A
B* = C
Constructor of A
Constructor of B
Constructor of C
Destructor of C // WTF!?
Destructor of B
Destructor of A
C* = C // ok
Constructor of A
Constructor of B
Constructor of C
Destructor of C
Destructor of B
Destructor of A

How come c1 (which is object of C through non-virtual pointer to B) behaves like a standard, well-behaved class? Didn’t we hear that you shouldn’t delete a class through non-virtual-destructor of its base? The idea here is simple:

Virtual destructor defined in any base class above a class through which pointer you will delete your derived object will spread like a virus to all derived classes.

If you are infected with virtual dtor/method, the VT (which will be present in every class which derive from virtual class) will make sure that the bottom-most dtor/method is called. Hope this helps some of you as I felt very ashamed not knowing this fact :).

Forcing Entity Framework to run outside of transaction


When using transactions (via TransactionScope) and Entity Framework you might run into a problem when you would like to execute some commands outside of enclosing transaction.

using (var t1 = new TransactionScope()) {
    // ...
    logUsingEF("Done foo"); // Will not be called if an exception is thrown below
    // ...

Problem here is that if anything wrong happens (exception), your log won’t be written. A simple solution:

using (var t1 = new TransactionScope()) {
    // ...
    using (var t2 = new TransactionScope(TransactionScopeOption.Suppress)) {
        logUsingEF("Done foo"); // Will always run
        t2.Complete(); // Not required but keep for consistency sake
    // ...

How to deal with interfaces



Interfaces (in the sense of the implementation-deprived classes) are the bread and butter of most OO applications out there. They define the contract between client and server.

If I were to give you 4 tips regarding design of interfaces, these would be:

  • Interface should make it easy to do good things, and hard – wrong ones.
  • Prototype your interfaces – test how the interface really copes with the problems at hand.
  • Don’t throw too many eggs into one basket – just because several classes don’t adhere to some interface, doesn’t mean you shouldn’t implement it. Just put the outsiders into their own baskets.
  • Hide implementation. Strive to catch the general, abstracted idea. This way adding new implementers will be easy.

Alright, so you have your sexy interface that looks more of less like the thing beneath:

class ICommunicator {
	virtual ~ICommunicator() {}
	virtual void connect() = 0;
	virtual void* getData() = 0;

class TcpClient : public ICommunicator {
	virtual void connect() {
		// ...
	virtual void* getData() {
		// ...

The idea is rather simple:

  1. Client creates the instance of ICommunicator.
  2. Client connects to some station.
  3. If the connection succeeds (no exception has been thrown) client starts receiving data.


One day, your boss comes to your office (ok, let’s pretend you don’t have to visit him yourself) and asks you to add functionality to encrypt the transmission. Great, you think, it shouldn’t be that hard.

Unfortunately in order for the encryption to work (in our case) you have to follow these steps:

  1. Fetch the initial packet of unencrypted data.
  2. Get info regarding client to which you will be connecting to.
  3. Basing on the client type, specify the certificate.
  4. Switch to a new encrypted socket with a predetermined certificate.
  5. Fetch data as usual.

Now, how you gonna do it? There are several conflicting purposes:

  • small or no change to the interface,
  • low amount of work,
  • extensibility for future changes,
  • cohesiveness of code,
  • high performance,
  • readability & maintainability,

It’s virtually impossible to cover all of them so you will have to decide yourself. For the purpose of this post I will just deal with changing the interface.

Can I make the change?

Before you even start thinking about changing the interface, then you better check if you are allowed to do it. There may be thousands of developers dependent on your code out there, and making any change will definitely break their code. Some companies even prohibit changes to the interface and force you to create always a new one with version number. In other words, when changing IFoo you will create IFooV1.

Does it fit here?

Ask yourself: do I want every implementor of the interface to support encryption? Is this an integral part of the interface that I missed during design?

If we could hide the whole handshaking/certificate swapping behind the connect then no changes should be made. It would be quite reasonable in fact (and I would chose this solution), but due to various constraints we won’t be able to neatly hide it.

The question remains: is encryption a part of requirements? Can every current implementor provide a reasonable way of encrypting the channel? I am not so sure. It is quite a heavy burden placed on the implementors and because of many different ways of implementing the encryption even further changes to the interface could be required.

Can I catch the general idea of interface better?

If you know you are gonna change the interface, maybe it’s time for some refactorization? I mean, is there a way of decomposing interface (i.e., splitting methods into ‘n’ children) so that every implementer will have something meaningful to do in every new method?


class IFoo {
	virtual ~IFoo() {}
	virtual void doStuff() = 0;
// Decomposed
class IFoo {
	virtual ~IFoo() {}
	virtual void initialize() = 0;
	virtual void execute() = 0;
	virtual void finish() = 0;

Such decomposition may not be applicable in our case, but I’ve seen quite a few code samples where such change (together with some parameters/return values) allowed new implementors to be created without much interference in the logic of already created classes.

Can I accept a nop?

This question is quite often asked as the first one, which many times lead to big interfaces. Before you start noping your methods ask yourself: can we accept dummy implementation of encryption? Does it make any sense to lie to our clients that we encrypt the data in any meaningful way? In my humble opinion, encryption is used when we really care about secrecy and removing it would make our products incomplete and not market-ready.

The idea is simple of noping: instead of forcing everyone to implement every method, allow them to provide dummy logic/implementation. In case of

void doStuff() = 0;

not implementing it shouldn’t cause much problems. But what if our doStuff looks like this:

IBar* doStuff(IBaz& output2) = 0;

? Such methods require a global commitment where we establish a law that you should return nullptr if you don’t implement the method, and you shouldn’t modify the output parameters in any way.

While this solution works, it has at least two problems. First, it requires handling nops result in every client using the interface (which is quite ugly & copy-paste-ish). Second, we deal with much more fundamental issue: why do we use interface when we don’t require implementation? I mean, the essence of interface is that we can depend on some invariants/requirements and build our code on them. In my understanding all the logic should be done in implementors. If we force clients to handle various nop scenarios then something feels wrong.

I don’t want to sound too harsh here, because I know some libraries use this approach to allow heavy modification of code logic via interfaces. This is a valid method to deal with soft-requirement-type (optional) methods, but think twice before going this road. It might lead to architecture where everything is so optional that the foundations on which you build your code are very unstable.

Nop part II: exceptions

Another way of dealing with nop methods in interface is the use of exceptions for not implemented methods.

class IStream {
	virtual ~IStream() {}
	virtual void push(int val) { throw new Exception(); }
	virtual int fetch() { throw new Exception(); }

class OutputStream : public IStream {
	virtual void push(int val) { /* ... */ }
	// fetch() will throw if called

This abomination is used where we run into an issue where we have both: wide interface with optional requirements AND non-nopable logic, which we either can’t nop due to language restrictions or because of logic reasons (if you fetch something you don’t want to get nulls/0s all the time, because it doesn’t have any sense).

If you ask me, that would be the last thing I would use in my code. It creates an unwritten rule that only a part of implemented interface can be used, and those exceptions lurk around to show their head right during a presentation for your client. What’s more, the implementation hiding provided by the interface is thrown out the window, as you have have to know what class you received. I mean, isn’t dynamic_cast just about the same?

Is the interface even a good idea?

As you already see, the interfaces aren’t as nice as the OO books tend to describe them. They definitely provide a clean abstraction mechanism that defines the architecture … at a cost. Static nature of interfaces force us to go great lengths when allowing broader range of methods to be used, which leads to a problem of noping and partial implementations.

What if I do have some defined set of functions that I would like to call and I don’t require them implemented in any way (fully optional)? If you have one way channel then we get to land of … events and messages!

Think of it: you may have 1000s of events defined and 800 of them never used by anyone. Should you care? Not one bit (just skip the question why the heck did you implement 800 events which are not used). You add 15 more messages. Do your clients care? Nope, unless there are really interested in the new batch. You just got yourself a very flexible interface implementation … at a cost :). Now you live in a land where nothing is certain and debugging might be hell on earth where all the events are running around, calling different events in random order, etc.

Hope this helps some of you look at the interfaces from different perspectives.