Intro

Recently I have been reading Q&A for the 70-483 Microsoft certification exam. Most of the questions are very simple, but some of them were actually hard to answer correctly. Anyway, one of the questions forced me to make a small research regarding how C# compiler concatenates strings. At first, I thought that the question is very easy and there is nothing special about that, but then I realized that I am not 100% confident with my answers, I will show you why.

Question

You are developing an application that will convert data into multiple output formats. You are developing a code segment that will produce tab-delimited output. All output routines implement the following interface:

public interface IOutputFormatter<T>
{
    string GetOutput(IEnumerator<T> iterator, int recordSize);
}

The application includes the following code:

public class Formatter : IOutputFormatter<string>
{
    private readonly Func<int, char> suffix = col => col % 2 == 0 ? '\n' : '\t';

    public string GetOutput(IEnumerator<string> iterator, int recordSize)
    {
        // Insert your code here
    }
}

You need to minimize the completion time of the GetOutput() method. Which code segment should you insert?

A:

string output = null;
for (int i = 1; iterator.MoveNext(); i++)
{
    output = string.Concat(output, iterator.Current, suffix(i));
}

return output;

B:

var output = new StringBuilder();
for (int i = 1; iterator.MoveNext(); i++)
{
    output.Append(iterator.Current);
    output.Append(suffix(i));
}

return output.ToString();

C:

string output = null;
for (int i = 1; iterator.MoveNext(); i++)
{
    output = output + iterator.Current + suffix(i);
}

return output;

D:

string output = null;
for (int i = 1; iterator.MoveNext(); i++)
{
    output += iterator.Current + suffix(i);
}

return output;

Research

At first, I thought that I should chose an answer with a StringBuilder (B), because it is a best practice to use a StringBuilder to concatenate string. I always use a StringBuilder when I need to concatenate more than 3 string variables. But then I decided to spend some tile profiling different answers to gain some valuable insights about how C# compiler works.

I used the following code to generate test data and call the GetOutput method:

var strings = new List<string>();
var testDataSize = 100000;
for (var i = 0; i < testDataSize; i++)
{
    strings.Add($"item{i}");
}

var output = new Formatter().GetOutput(strings.GetEnumerator(), 1000);

After that, I started collecting data for further analysis.

One of the best things in Visual Studio 2012+ is the built-in profiler. I am not going to describe all cool features it has, we will just use it to measure how much time it will take to execute a single method. To access the profiler go to Analyze => Performance Profiler from a top menu in Visual Studio. You will see the following screen:

Analysis target

Select "Performance Wizard" and "Instrumentation":

Performance Wizard

Then click "Next", "Next", "Next" and the profiler will be launched.

I have done this for all 4 tests and here is the results:

A - more than 1 minute

Formatter A

B (StringBuilder) - around 1 second

Formatter B

C - more than 1 minute

Formatter C

D - around 1 minute

Formatter D

It is obvious that version with a StringBuilder is the best. But know what? It is not a correct answer according to Q&A. I don't know why, but the guy who put all those questions together decided that the correct answer is D. According to explanation in that document:

A String object concatenation operation always creates a new object from the existing string and the new data. A StringBuilder object maintains a buffer to accommodate the concatenation of new data. New data is appended to the buffer if room is available; otherwise, a new, larger buffer is allocated, data from the original buffer is copied to the new buffer, and the new data is then appended to the new buffer. The performance of a concatenation operation for a String or StringBuilder object depends on the frequency of memory allocations. A String concatenation operation always allocates memory, whereas a StringBuilder concatenation operation allocates memory only if the StringBuilder object buffer is too small to accommodate the new data. Use the String class if you are concatenating a fixed number of String objects. In that case, the compiler may even combine individual concatenation operations into a single operation. Use a StringBuilder object if you are concatenating an arbitrary number of strings; for example, if you're using a loop to concatenate a random number of strings of user input.

So he is trying to say, that something like this:

string s = @""
    + "line1"
    + "line2"
    + "line3"
    + "line4"
    + "line5"
    + "line6";

...will be optimized by compiler to this:

string s = "line1line2line3line4line5line6";

And it will, you can always check the resulting IL code in a Release mode to better understand how everything works inside:

IL_004c: ldstr        "line1line2line3line4line5line6"
IL_0051: stloc.3      // s

But once you add something more complex, you will get a totally different IL code. Consider this example:

string[] arr = new string[6]{
    "line1",
    "line2",
    "line3",
    "line4",
    "line5",
    "line6"
};

var arrOutput = string.Empty;
foreach(var line in arr)
{
    output += line;
}

Here we create a fixed size array with a set of string literals, then we concatenate them. What can be simpler?

It will generate the following IL code:

IL_005e: ldc.i4.6     
IL_005f: newarr       [mscorlib]System.String
IL_0064: dup          
IL_0065: ldc.i4.0     
IL_0066: ldstr        "line1"
IL_006b: stelem.ref   
IL_006c: dup          
IL_006d: ldc.i4.1     
IL_006e: ldstr        "line2"
IL_0073: stelem.ref   
IL_0074: dup          
IL_0075: ldc.i4.2     
IL_0076: ldstr        "line3"
IL_007b: stelem.ref   
IL_007c: dup          
IL_007d: ldc.i4.3     
IL_007e: ldstr        "line4"
IL_0083: stelem.ref   
IL_0084: dup          
IL_0085: ldc.i4.4     
IL_0086: ldstr        "line5"
IL_008b: stelem.ref   
IL_008c: dup          
IL_008d: ldc.i4.5     
IL_008e: ldstr        "line6"
IL_0093: stelem.ref   
IL_0094: stloc.s      arr
IL_0096: ldsfld       string [mscorlib]System.String::Empty
IL_009b: stloc.s      arrOutput
IL_009e: ldloc.s      arr
IL_00a0: stloc.s      V_8
IL_00a2: ldc.i4.0     
IL_00a3: stloc.s      V_9
IL_00a5: br.s         IL_00bf
IL_00a7: ldloc.s      V_8
IL_00a9: ldloc.s      V_9
IL_00ab: ldelem.ref   
IL_00ac: stloc.s      line
IL_00af: ldloc.2      // output
IL_00b0: ldloc.s      line
IL_00b2: call         string [mscorlib]System.String::Concat(string, string)
IL_00b7: stloc.2      // output

The important line is:

IL_00b2: call         string [mscorlib]System.String::Concat(string, string)

Compiler didn't optimize this code. It uses a string.Concat method which does a lot of expensive operations under the hood. Every time you see a lot of calls to string.Concat in IL code or in profiler - be ready for problems with the performance of your application. Period.

We can see a similar IL code for answer D:

IL_0006: ldloc.0      // output
IL_0007: ldarg.1      // iterator
IL_0008: callvirt     instance !0/*string*/ class [mscorlib]System.Collections.Generic.IEnumerator`1<string>::get_Current()
IL_000d: ldarg.0      // this
IL_000e: ldfld        class [mscorlib]System.Func`2<int32, char> Boades.Demo.StringBuilderVsString.FormatterD::suffix
IL_0013: ldloc.1      // i
IL_0014: callvirt     instance !1/*char*/ class [mscorlib]System.Func`2<int32, char>::Invoke(!0/*int32*/)
IL_0019: stloc.2      // V_2
IL_001a: ldloca.s     V_2
IL_001c: call         instance string [mscorlib]System.Char::ToString()
IL_0021: call         string [mscorlib]System.String::Concat(string, string, string)
IL_0026: stloc.0      // output

We see the call to string.Concat:

IL_0021: call         string [mscorlib]System.String::Concat(string, string, string)

It means compiler didn't optimize the code in this case either. I doubt it can optimize anything in this case.

Anyways, let's take a look on a IL code for answer B:

IL_000a: ldloc.0      // output
IL_000b: ldarg.1      // iterator
IL_000c: callvirt     instance !0/*string*/ class [mscorlib]System.Collections.Generic.IEnumerator`1<string>::get_Current()
IL_0011: callvirt     instance class [mscorlib]System.Text.StringBuilder [mscorlib]System.Text.StringBuilder::Append(string)
IL_0016: pop      
IL_0017: ldloc.0      // output
IL_0018: ldarg.0      // this
IL_0019: ldfld        class [mscorlib]System.Func`2<int32, char> Boades.Demo.StringBuilderVsString.FormatterB::suffix
IL_001e: ldloc.1      // i
IL_001f: callvirt     instance !1/*char*/ class [mscorlib]System.Func`2<int32, char>::Invoke(!0/*int32*/)
IL_0024: callvirt     instance class [mscorlib]System.Text.StringBuilder [mscorlib]System.Text.StringBuilder::Append(char)
IL_0029: pop          

We don't see any calls to String.Concat here, it means that code will run much faster on big data-sets than anything that use string.Concat

Summary

There are many cases when you shouldn't use a StringBuilder. It is best to use a StringBuilder when you are dealing with unpredictable input with unknown number of strings. Sometimes you need to make some microoptimizations, in this case you should consider using string.Join and ordinary string.Concat. Make sure you know how your code is going to be used and apply any optimization accordingly. Do not forget to profile your app when making those changes.


;