Posts Tagged ‘C#’

One optimization trick for C++

03/02/2013

In one of his trainings Andrei has touched a very interesting subject of optimizing C++ (or just code in general) with respect to laying out the code, or basically understanding how the assembly will look like. In this post I would like to dig a little deeper into the subject.

Meet d10

Task is simple:

Count how many 10-base digits a given uint64_t value consists of.

For the purpose of this post (to avoid repetition) let’s assume that every presented implementation of the above task has the following declaration:

uint32_t d10(uint64_t value)

In order to neatly group all the constants in our code, the following enumeration is used:

enum Base10 : uint64_t {
	D1Min	= 0UL,
	D1Max	= 9UL,
	D2Min	= 10UL,
	D2Max	= 99UL,
	D3Min	= 100UL,
	D3Max	= 999UL,
	D4Min	= 1000UL,
	D4Max	= 9999UL,
	D5Min	= 10000UL,
	D5Max	= 99999UL,
	D6Min	= 100000UL,
	D6Max	= 999999UL,
	D7Min	= 1000000UL,
	D7Max	= 9999999UL,
	D8Min	= 10000000UL,
	D8Max	= 99999999UL,
	D9Min	= 100000000UL,
	D9Max	= 999999999UL,
	D10Min	= 1000000000UL,
	D10Max	= 9999999999UL,
	D11Min	= 10000000000UL,
	D11Max	= 99999999999UL,
	D12Min	= 100000000000UL,
	D12Max	= 999999999999UL,
	D13Min	= 1000000000000UL,
	D13Max	= 9999999999999UL,
	D14Min	= 10000000000000UL,
	D14Max	= 99999999999999UL,
	D15Min	= 100000000000000UL,
	D15Max	= 999999999999999UL,
	D16Min	= 1000000000000000UL,
	D16Max	= 9999999999999999UL,
	D17Min	= 10000000000000000UL,
	D17Max	= 99999999999999999UL,
	D18Min	= 100000000000000000UL,
	D18Max	= 999999999999999999UL,
	D19Min	= 1000000000000000000UL,
	D19Max	= 9999999999999999999UL,
	D20Min	= 10000000000000000000UL,
	D20Max	= 18446744073709551615UL
};

The important thing to remember is that the enumeration values become part of the instruction code stream and do not have to referenced via pointers/temporary values/stack.

Performance tests should be interpreted as follows: smaller is better. In other words 200% score means that the given algorithm was 4 times worse then the baseline.

Specs of the test machine are: Intel Pentium Dual Core T4200 2GHz, 4GB DDR2, ubuntu 12.04, gcc 4.6.3 -O3 64-bit.

Let’s be compact

The compact version will become the baseline for our tests and it basically represents the standard way that most of us would implement d10.

compact

	uint32_t result = 0U;

	do {
		++result;
		value /= 10UL;
	} while (value);

	return result;

Remarks

Dividing is slow! It’s roughly 5x slower than multiplying. Because of that compilers introduce an optimization trick which basically turns

MOV eax, $dividend
MOV edi, $divisor # We can't divide by an immediate value
DIV edi

into

MOV eax, $dividend
MOV edi, $magic_number # We can't multiply by an immediate value
MUL edi
SHR edx, $magic_number_2 # Result of multiplication is put to edx

In other words, we turned th division into multiplication by some magic numbers where basically multiplication and shift result in the same operation as division. Very nifty.

Apart from that, the compact code isn’t very interesting:

  1. Increment ecx register (where result is stored)
  2. Do the multiplication-shift (aka division)
  3. Execute test instruction to check whether after division we got 0
  4. If we got 0, just end the code; if not – jump to the beginning

Base 10 is not enough

Let’s pretend that we do not want only to return the count of 10-base digits and we wish to make it more general.

general

	const double float_value = value;

	double float_result;
	for (float_result = 1.0;
		 float_value >= std::pow(10.0, float_result);
		 ++float_result);

	return static_cast<uint32_t>(float_result);

Suddenly we got a solution which works for every base (base-2, base-16, etc.).

Remarks

Let’s look at the performance results:

general

Horrible! There are, however, several reasons for such a low score:

  • We are making a call to another function which might not be very fast
  • In order to call the function we have to move arguments to xmm0 and xmm1 registers which are by use by us, and then retrieve the result by putting the stack values into the xmmX registers
  • Every conversion from/to int requires using cvtsi2sd/cvttsd2si instructions which take their time (15 i. on older CPUs, 4 on faster)
  • We cannot use ‘inlining’ of constant value 1.0, as it is a floating variable which means that we have to add it from memory location:
    addsd	xmmX, QWORD PTR .LC0[rip]
    .LC0:
    	.long	0
    	.long	1072693248
    
  • As we need to use (for storing constant value 10.0) base register rbx we have to push and pop it from the stack every function call.

Thing to remember here is that operations on floats are not that slow as we might believe. They are a few cycles slower than standard operations, so it’s not a tragedy.

Fast-math to the rescue!

We all do know about the infamous -ffast-math flag for gcc which basically enables various optimizations which allow for faster floating point operations. The catch? They might break your code.

Because of that there are two camps: one believes that -ffast-math is just a death wish and should never be used, while the other thinks that if your code doesn’t do anything funny (denormals, etc.) it can safely be used.

In our case usage of -ffast-math resulted in correct results (tests passes), thus I gave the famous flag a chance.

Remarks

Results are as follows:

general-fast

Generally not that fast (pun intended). Some gain is observed, but nothing spectacular.

Table

Being general didn’t help us so let’s take the road of precomputed values. During our first attempt we will choose a simple solution using a table through which we will walk. In LOOKUP which is std::array we stored following values: Base10::D1Min, Base10::D2Min, …, Base10::D20Min. Not pretty, but simple.

table

	uint32_t result;

	for (result = 0U;
		 value >= LOOKUP[result] && result < LOOKUP.size();
		 ++result);

	return result;

Remarks

Results:
table

Finally some improvements, but due to huge values of our first versions in future comparisons, they will be dropped. Enhance!

table_2

Ok, so even if the cpu has to:

  • fetch every data piece from via a pointer,
  • cannot inline the data in any way (as it doesn’t know the tale values),
  • has to make two comparisons ever time,

it’s still tens percents faster! In other words, avoiding computation (mul and shr) by providing a well organized memory that can be prefetched (but still not stored completely in the cache as the whole table takes 1.25kB, while the whole cache set is only 0.5kB) we can drastically reduce the execution time.

Partial unroll

Let’s take other approach (presented by Andrei), instead of using tables, we shall “unroll” the computations that we perform in compact version, so that we don’t have to iterate so many times.

partial_unroll

	uint32_t result = 1U;
	
	while (true) {
		if (value < Base10::D2Min) return result;
		if (value < Base10::D3Min) return result + 1U;
		if (value < Base10::D4Min) return result + 2U;
		if (value < Base10::D5Min) return result + 3U;

		value /= 10000UL;
		result += 4;
	};

	return result;

Remarks

We basically traded less loops and computations (mul + shr) for more conditional logic. More branches, you say? But we know that the branch predictor will have problems with this, and no pre-fetching and … blah, blah. Test first, you fool!

punroll

Not only the unroll is faster than the compact (standard) version, but it also beats the table version which seems relatively simpler (less branches).

BTW, the generated assembly isn’t very innovative. Compare (cmp), return if below or equal (jbe) or iterate the loop. The only tinkering the compiler has done is that it changed the "jump if below n" to "jump if below or equal n – 1“.

Unroll like there’s no tomorrow!

Partial seems good? Then let’s make it fully unrolled!

unroll

	if (value <= Base10::D1Max) return 1;
	if (value <= Base10::D2Max) return 2;
	if (value <= Base10::D3Max) return 3;
	if (value <= Base10::D4Max) return 4;
	if (value <= Base10::D5Max) return 5;
	if (value <= Base10::D6Max) return 6;
	if (value <= Base10::D7Max) return 7;
	if (value <= Base10::D8Max) return 8;
	if (value <= Base10::D9Max) return 9;
	if (value <= Base10::D10Max) return 10;
	if (value <= Base10::D11Max) return 11;
	if (value <= Base10::D12Max) return 12;
	if (value <= Base10::D13Max) return 13;
	if (value <= Base10::D14Max) return 14;
	if (value <= Base10::D15Max) return 15;
	if (value <= Base10::D16Max) return 16;
	if (value <= Base10::D17Max) return 17;
	if (value <= Base10::D18Max) return 18;
	if (value <= Base10::D19Max) return 19;
	if (value <= Base10::D20Max) return 20;

Remarks

We know it looks hideous and it’s unimplementable for bigger set of arguments, but what about the efficiency? Glad you asked.

unroll

Even faster, and the compiler still didn’t do any crazy stuff.

Hiding conditional logic

While unrolling provides us with a nice boost, there are still a lot of branches/jumps. What if we could “avoid” them?

unroll_v2

	return
		1
		+ (value >= Base10::D2Min)
		+ (value >= Base10::D3Min)
		+ (value >= Base10::D4Min)
		+ (value >= Base10::D5Min)
		+ (value >= Base10::D6Min)
		+ (value >= Base10::D7Min)
		+ (value >= Base10::D8Min)
		+ (value >= Base10::D9Min)
		+ (value >= Base10::D10Min)
		+ (value >= Base10::D11Min)
		+ (value >= Base10::D12Min)
		+ (value >= Base10::D13Min)
		+ (value >= Base10::D14Min)
		+ (value >= Base10::D15Min)
		+ (value >= Base10::D16Min)
		+ (value >= Base10::D17Min)
		+ (value >= Base10::D18Min)
		+ (value >= Base10::D19Min)
		+ (value >= Base10::D20Min);

Remarks

The obvious penalty of the above solution is that we will have to always go through all the digits, even though we could return early. The positive? There is no branching. CPU can nicely rush through all the code filling the pipeline.

unroll_v2

While the penalties are clearly visible (big lag for small digits) the potential gains cannot be seen and it doesn’t seem that the lack of branches helps the CPU that much.

It’s all about searching

Looking at the fully unrolled loops we can easily see that what we’ve been doing all the time is plain stupid linear search. Being decently equipped engineers we know that we can turn this n into log(n) via binary search.

binary_stl

	auto upRange = std::upper_bound(LOOKUP.cbegin(), LOOKUP.cend(), value);

	if (upRange == LOOKUP.cend()) {
		return 20;
	} else {
		return std::distance(LOOKUP.cbegin(), upRange);
	}

Remarks

Being lazy and smart cookies at the same time, we used the standard algorithms.

binary_stl

Not the improvement we wanted.

Not for the faint of heart

Andrei didn’t present the fully unrolled binary search, so let’s show it.

binary

	if (value < Base10::D8Min) {
		if (value < Base10::D4Min) {
			if (value < Base10::D2Min) {
				return 1;
			} else {
				if (value < Base10::D3Min) {
					return 2;
				} else {
					return 3;
				}
			}
		} else {
			if (value < Base10::D6Min) {
				if (value < Base10::D5Min) {
					return 4;
				} else {
					return 5;
				}
			} else {
				if (value < Base10::D7Min) {
					return 6;
				} else {
					return 7;
				}
			}
		}
	} else {
		if (value < Base10::D16Min) {
			if (value < Base10::D12Min) {
				if (value < Base10::D10Min) {
					if (value < Base10::D9Min) {
						return 8;
					} else {
						return 9;
					}
				} else {
					if (value < Base10::D11Min) {
						return 10;
					} else {
						return 11;
					}
				}
			} else {
				if (value < Base10::D14Min) {
					if (value < Base10::D13Min) {
						return 12;
					} else {
						return 13;
					}
				} else {
					if (value < Base10::D15Min) {
						return 14;
					} else {
						return 15;
					}
				}
			}
		} else {
			if (value < Base10::D18Min) {
				if (value < Base10::D17Min) {
					return 16;
				} else {
					return 17;
				}
			} else {
				if (value < Base10::D19Min) {
					return 18;
				} else {
					if (value < Base10::D20Min) {
						return 19;
					} else {
						return 20;
					}
				}
			}
		}

Remarks

Yes, it’s uber horrible to write, but we didn’t do it to get the cleanest code award.

binary

And we have the winner!

That’s the end of our journey

I wish this post will become a minor, but still valuable addition, to what Andrei has presented. To summarize all the tips I gathered while creating this material:

  • don’t assume – measure
  • compiler knows more tricks then you do, so concentrate on the algorithms and data layout, not micro-optimizations
  • general programming knowledge is applicable everywhere (i.e., binary search)
  • moving data from storage (cache) to code instructions is good, but don’t go overboard – you can’t squash all the code, everytime

As the percentage charts don’t show the differences between the fastest algorithms enough, one more chart.

total

Advertisements

Thank you C++ for … exceptions

10/12/2012

Return value is not the best error handling mechanism

C

This example closely resembles the code I’ve seen in Windows CE drivers.

BOOL doStuff() {
    void* mem_buff = NULL;
    PHANDLE res_1_handle = NULL;
    PHANDLE res_2_handle = NULL;
    BOOL result = FALSE;

    if (mem_buff = malloc(WORK_BUFFER_SIZE)) {
        goto error;
    }

    if (RES_ERR == acqRes1(res_1_handle)) {
        goto error;
    }

    if (RES_ERR == acqRes2(res_2_handle)) {
        goto error;
    }

    if (compute(res_1_handle, res_2_handle, mem_buff)) {
        result = TRUE;
    }

    error:
    if (res_2_handle != NULL) {
        relRes2(res_2_handle);
        res_2_handle = NULL;
    }

    if (res_1_handle != NULL) {
        relRes1(res_1_handle);
        res_1_handle = NULL;
    }

    if (mem_buff != NULL) {
        free(mem_buff);
        mem_buff = NULL;
    }

    return result;
}

What’s not visible above is the amount of repeated error handling logic deep in the bowels of relRes[12]() and acqRes[12]().

C++11

As you may guess the resource classes can be done much better by the usage of templates so that there is much less code repetition. As we try to stick with the C code, C++ can’t shine that well.

class Resource1 {
public:
    Resource1() : m_handle(NULL) {
        if (RES_ERR == acqRes1(m_handle)) {
            throw;
        }
    }

    ~Resource1() {
        relRes1(m_handle);
        m_handle = NULL;
    }

    PHANDLE get() {
        return m_handle;
    }

private:
    PHANDLE m_handle;
};

class Resource2 {
public:
    Resource2() : m_handle(NULL) {
        if (RES_ERR == acqRes2(m_handle)) {
            throw;
        }
    }

    ~Resource2() {
        relRes2(m_handle);
        m_handle = NULL;
    }

    PHANDLE get() {
        return m_handle;
    }

private:
    PHANDLE m_handle;
};

bool doStuff() {
    try {
        Resource1 res1;
        Resource2 res2;
        std::unique_ptr<void> buffer(
            (void*)new char[WORK_BUFFER_SIZE]
        );

        return compute(res1.get(),
                       res2.get(),
                       buffer.get());
    }
    catch (...)
    {
        return false;
    }
}

You can choose your winner now.

Thank you C++ for … strongly typed enumerations

02/12/2012

enum needs a scope

“Enumerations are scope-leeches – they infect the outer scope.”

C++03

class Foo {
	enum Single {
		ONE = 1, TWO = 2, THREE = 3
	};

	enum Double {
		ONE = 2, TWO = 4, THREE = 6
	};
};

Result

error: declaration of ‘ONE’ conflicts with previous declaration ‘Foo::Bar Foo::ONE

error: declaration of ‘TWO’ conflicts with previous declaration ‘Foo::Bar Foo::TWO

error: declaration of ‘THREE’ conflicts with previous declaration ‘Foo::Bar Foo::THREE

Solution #1: Prefix

class Foo {
public:
	enum Single {
		Single_ONE = 1, Single_TWO = 2, Single_THREE = 3
	};

	enum Double {
		Double_ONE = 2, Double_TWO = 4, Double_THREE = 6
	};
};

int main() {
	// Prints: 1
	std::cout << Foo::Single_ONE << std::endl;
	// Prints: 4
	std::cout << Foo::Double_TWO << std::endl;

Solution #2: Wrap

class Foo {
	struct Single {
		enum Enum {
			ONE = 1, TWO = 2, THREE = 3
		};
	};

	struct Double {
		enum Enum {
			ONE = 2, TWO = 4, THREE = 6
		};
	};
};

int main() {
	// New way of declaring the above
	// enums (note the required Enum)
	Foo::Single::Enum s;
	Foo::Double::Enum d;

	// Prints 1
	std::cout << Foo::Single::ONE << std::endl;
	// Prints 4
	std::cout << Foo::Double::TWO << std::endl;
}

C++11

class Foo {
public:
	enum class Single {
		ONE = 1, TWO = 2, THREE = 3
	};

	enum class Double {
		ONE = 2, TWO = 4, THREE = 6
	};
};

int main() {
	Foo::Single s; // Typical way to use
	Foo::Double d; // the enumerations

	// The reason to cast to int will be discussed later.
	// Prints: 1
	std::cout << (int)Foo::Single::ONE << std::endl;
	// Prints: 4
	std::cout << (int)Foo::Double::TWO << std::endl;
}

enum‘s double life as an int

“Enumerations are implicitly convertible between each other, as they are ints in nature.”

C++03

// Ordered by their strength
enum Unit {
	LIGHT_INFANTRY,
	HEAVY_INFANTRY,
	COMMANDO
};

// Ordered by range
enum UnitRange {
	SMALL,
	MEDIUM,
	LARGE
};

/// @return Biggest unit range
UnitRange GetBest() {
	return LARGE;
}

Unit GetUnitToBuild() {
	// Poor programmer assumed GetBest() will
	// return the best unit the AI could build.
	if (HEAVY_INFANTRY >= GetBest()) {
		return HEAVY_INFANTRY;
	} else {
		// Will always choose this path.
		return LIGHT_INFANTRY;
	}
}

C++11

// Ordered by their strength
enum class Unit {
	LIGHT_INFANTRY,
	HEAVY_INFANTRY,
	COMMANDO
};

// Ordered by range
enum class UnitRange {
	SMALL,
	MEDIUM,
	LARGE
};

/// @return Biggest unit range
UnitRange GetBest() {
	return UnitRange::LARGE;
}

Unit GetUnitToBuild() {
	if (Unit::HEAVY_INFANTRY >= GetBest()) {
		return Unit::HEAVY_INFANTRY;
	} else {
		return Unit::LIGHT_INFANTRY;
	}
}

Result

error: no match for ‘operator>=’ in ‘(Unit)1 >= GetBest()

enum are of some type

“Enumerations are internally ints, even if they don’t have to be.”

C++03

enum Recipent {
	ALL,
	SERVER,
	CLIENT,
	NODE
};

enum Response {
	NONE,
	IMMEDIATE,
	DELAYED,
	PERIODIC
};

// Size of our packet is 24B.
struct Packet {
	unsigned char data[16];
	Recipent reci;
	Response resp;
};

C++11

enum Recipent : unsigned char {
	ALL,
	SERVER,
	CLIENT,
	NODE
};

enum Response : unsigned char {
	NONE,
	IMMEDIATE,
	DELAYED,
	PERIODIC
};

// Now the size is only 18B.
struct Packet {
	unsigned char data[16];
	Recipent reci;
	Response resp;
};

Comment

While the bit fields provide even better ‘compression’ they are not so easy to use and require careful casting from/to enum.

ASP.NET dynamic controls & one gotcha you might have missed

24/11/2012

One of the most unintuitive and bristling with gotchas topic regarding ASP.NET site development is the creation of dynamic controls. Dynamic controls are these kinds of controls which are not static on the page. In other words, one page can contain different number and types of controls, depending on its state.

There are several rules to keep in mind while creating such controls:

  • Due to stateless nature of websites controls have to be recreated every postback/page request. Creating them only once will result in empty page after postback.
  • When recreating the controls you have to keep in mind to create the same ones that you have created when rendering last client request. Failure in doing so will cost you a nice exception of form “Invalid View State“.
  • Recreation of dynamically rendered controls should be done in LoadViewState() which you should override.
  • Creation of new set of controls requires deletion of old ones and creation of new set in Load() method during which you will probably consume various events telling you to change the state of the page.

There is however one gotcha that is not mentioned very often, or not emphasized enough. Whenever you are recreating your dynamic controls remember to assign the same IDs to the controls in order the events to work. If you are sure that you done all correctly (regarding dynamic controls) and for some reason the events are not handled, then you better check the IDs.

FileInfo Exists lies! Or not?

20/11/2012

I’ve recently wrote a code that basically looks like that:

FileInfo GetFile()
{
    var info = new FileInfo(path_to_not_yet_created_file);
    
    var someObj = new SomeObj();

    // ... Logic gathering/preparing various pieces of data ...

    someObj.SaveToFile(info);

    return info;
}

void foo()
{
    FileInfo file = GetFile();
    
    // Defensive programming ftw!
    Debug.Assert(file.Exists);
    
    // ... More code ...
}

Seems reasonable.

  1. Create FileInfo to some file to which you gonna save your result.
  2. Do some computations.
  3. Write the result to a new file pointed by FileInfo.
  4. Validate that the file was created using Exists property.

But it doesn’t work. Exists always returns false. The file is definitely there, process have adequate access rights, the path matches … even your dog can see the file. Madness!

After playing around for some time I learned one simple truth: FileInfo isn’t automatically refreshed and won’t reflect the current state of the file when querying its state. In Debug.Assert(file.Exists) Exists will always return false as at the point of creation (var info = new FileInfo(path_to_not_yet_created_file)) the file wasn’t existing.

I will stick with plain ol’ string for now, and remember not to use FileInfo so much. Hope that saves you some time.

No translation to SQL

26/10/2012

When you start to like LINQ (to SQL) very much, someday you may find yourself in a nasty situation when “LINQ cannot translate method to SQL” during … runtime. That’s sad but quite logical: LINQ maps the logic to equivalent SQL statement. If there is no equivalence, it won’t work. In other words:

var foo = my_context.my_table.Where(val => SuperComplexMethod(val));

won’t work.

The usual answer is to use Expression<Func> but it will only help if it is possible to do (like converting Math.Pow()).

Another option is to restructure your function and let it use the supported logic.

Sometimes, however, you are to lazy to do all of that just to call some arbitrary function. The solution? Just use the AsEnumerable(). It basically tells the LINQ: “Ok, you can’t do what I want using the underlying database, so leave it to me”. There might be some performance issues due to not being able to use the power of the database, but many time this is the easiest way to go.

var foo = my_context.my_table.AsEnumerable().Where(val => SuperComplexMethod(val));

Mutex-free C# application deadlock

01/10/2012

We’ve all heard about & encountered deadlocks. Nasty buggers they are, won’t you agree? After several long-night fights against them you should be well equipped to deal with them. You remember not to call external (delegate) code in synchronized context, you carefully add various logs to monitor behavior of mutexes, but to your surprise there is still an observable deadlock!

First, let’s make sure that we deal with a deadlock, not some funky system behavior, external library bug or endless loop. To test threads behavior you can basically do two things: change their number or add Thread.Yield(). We will focus on the first option.

In order to change the number of actively running threads we use the ThreadPool method. ThreadPool is the default way of using threads so I can safely assume that you will directly or indirectly (Tasks) use them.

We changed the number of threads, and we can observe that indeed the deadlock appears so we eagerly start debugging our poor parallel program, but the problem is not there.

What you should always check when using thread pools is whether the synchronization logic has taken all available threads. Take a look at the following example (assume the ThreadPool can run 10 threads concurrently).

The problem should be visible now.

  1. #10 waits for empty slot.
  2. Old threads won’t finish as they are long-running.
  3. Short-lived thread #9 (or a chain of inter-dependent threads) is waiting on #10
  4. #10 starves to death locking whole application.

You might not encounter this problem every day, but be careful when:

  • you frequently use BeginInvoke() in your code,
  • you port your application to older system on which the default number of threads is lower than you assumed.

While thread pools provide us with efficient creation of threads, they bring their own issues. Remember to enable sufficiently high number of max threads in pool, or reduce the number of threads that have to be run concurrently, in order to avoid synchronization starvation.

Events + Locks = Problems

14/07/2012

TLDR; Never raise events from synchronized context.

Let’s imagine that you are creating some middle layer between your forms and business layer (logic). In order not to allow races and data corruption you introduce synchronized context. Obviously you won’t be synchronizing on this. So this is our first implementation:

class ThreadSafe {
    private Object m_lock = new Object();

    public void Foo() { lock(m_lock) { ... } }
}

Everything seems OK, but being event-lovers as we are we want to raise them in order to tell other components that our state changed. From now on I will mostly add only new parts of the code to reduce the clutter.

class ThreadSafe {
    public event EventHandler DoneFoo;

    // As only the DoneFoo enclosing class can call it we want to allow subclasses to call their base class events.
    protected virtual void OnFoo(EventArgs e) {
        if (DoneFoo != null) DoneFoo(this, e);
    }
}

Now we can finally hook the event to our Foo() method.

class ThreadSafe {
    public void Foo() {
        lock(m_lock) { ... DoneFoo(EventArgs.Empty); }
    }
}

And … boom! You’ve got a deadlock, sir. Why? Imagine that you have an event listener class called ListenOnFoo that operates on GUI elements.

class ListenOnFoo {
    private ThreadSafe m_ts;

    // Register event handler
    public ListenOnFoo(ThreadSafe ts) { ts.DoneFoo += OnDoneFoo; m_ts = ts; }

    // Remember that we have to adhere to the EventHandler delegate declaration.
    private void OnDoneFoo(Object sender, EventArgs e) {
        DoInternalLogic();
    }

    // As we are dealing with GUI we have to make sure we call the logic via [Begin]Invoke() in order not to corrupt the GUI.
    private void DoInternalLogic() {
        if (this.InvokeRequired) {
            this.Invoke(new MethodInvoker(DoInternalLogic));
        }

        m_threadSafe.Foo(); // DEADLOCK!
    }
}

Obviously, DoInternalLogic introduces infinite loop problem, but let’s skip this for the moment. As you probably see, we are locking m_lock with the first thread and the second one (GUI). We have got a deadlock.

The problem with such kind of problems is they are very difficult to track and in case of complex GUI logic it could take us much time to find the issue. Because of that, it’s important not to make such mistakes in the first place.

First, the general rule of thumb is not to use Control.Invoke() and instead use its younger sister Control.BeginInvoke(). The difference between these two (as you can probably guess from the popular naming convention) is that the latter calls the delegate asynchronously, allowing the first thread (in our case) to finish execution and unlock the m_lock.

The issue with BeginInvoke() is that it won’t return the result right away (as it is non-blocking). If you happen to have invokable methods with non-void return (or when you do not want to proceed before they successfully modify the inner object state), then consider handling IAsyncResult that is returned from BeginInvoke().

I won’t discuss ways of dealing with asynchronous programs as this is a very broad topic (decently described on MSDN). What I will say for now, is that you should think about logic of your code and find a place where you can block (definitely not in GUI thread!) – while waiting on AsyncResult.

Second general rule of thumb is to keep synchronized regions as small and simple as possible. In our case, the simplest solution would be to move OnFoo() call from the lock.

class ThreadSafe {
    public void Foo() {
        lock(m_lock) { ... }
        DoneFoo(EventArgs.Empty); // Not in lock anymore
    }
}

That was easy, but I would rather go even further and disallow calling methods in locked regions. I know it sounds crazy but if you allow code like the following, you might have serious deadlock issue in future.

void Bar() {
    lock(m_lock) {
        Baz();
        Bay();
        Bax();
    }
}

You won’t know if any of these methods raises an event or invokes a delegate. Obviously with enough care and comments you can safely use methods inside synchronized regions but, please, KISS!