To fill in the vacuum of technical articles on this no-blog I decided to put in this article about small but interesting problem with LINQ to Objects performance that a guy posted on a developers' forum a while ago. Fear not for I am working on more serious technical article but I am doing it slowly like a glacier moving.
The task in question was a simple implementation of arithmetic coding
using LINQ to Objects. Basically arithmetic encoding writes a number telling how many times a symbol repeats and then the symbol that repeats. For example 111 is encoded as 31 (3 times 1) and 21 is encoded as 1211 (1 time 2 and 1 time 1). The original version was this:
public static string Encode(string input)
StringBuilder result = new StringBuilder();
var piece = input.AsEnumerable();
var u0 = piece.First();
var first = piece.TakeWhile(c => c == u0);
var count = first.Count();
piece = piece.Skip(count);
The algorithm takes the first character of the string ( piece.First() ) queries for repeating characters ( piece.TakeWhile(c => c == u0) ), counts the number of repeating characters ( first.Count() ), encodes them in the result ( result.Append(count.ToString()).Append(u0) ) then reassigns what is left to the same variable ( piece = piece.Skip(count); ) and continues looping until the input is depleted ( while (piece.Any()) ).
If you test it with a short string it works fine:
string input = "1234567890abcdefghijklmnopqrstuvwxyz";
string result = Encode(input);
However if you pass a longer input it becomes really slow:
for (int i = 0; i < 20; i++)
input += "1234567890abcdefghijklmnopqrstuvwxyz";
result = Encode(input);
The reason for this behavior is that LINQ extension methods defer the execution. When you query an enumerable you do not produce a new collection. You produce an enumerable class that contains the original enumerable and contains information on how to produce a new enumerable. When invoked the new enumerable is produced using this information. Note that like all enumerable objects every enumerator that is produced is independent of the previous one. The fact that you enumerated the collection one does not mean that the next time you enumerate the same enumerator would be used. In fact a new one would be produced.
In this case the Skip() method produces its result in that way. When we assign it to the piece variable we get an enumerable that can be called when needed. However the next time we call the Skip() method we produce an enumerable that contains the previous enumerable. In order to execute the result it has to invoke the original nested enumerable which in turn has to invoke its own nested enumerable and so on until they recursively reach the original string. Creating and destroying the amount of enumerators needed for a longer string kills the performance.
The good news is that we can easily solve this problem. We just need to add ToArray() after the Skip() method:
piece = piece.Skip(count).ToArray();
This method will create an array from the enumerable. This way we avoid all the nesting because the array is produced on the spot just once. Suddenly the performance is back to normal and this method works just fine. Of course if you need every bit of performance there are better ways to solve this task in C#. This method still creates a lot of objects and arrays and puts some pressure on the garbage collector. However the performance here suffers the most by input that is not suitable for arithmetic coding in the first place (i.e. one that does not consist of long sequences of repeating symbols).
I hope that this will help you avoid performance problems with LINQ to Objects. I have no doubt that everyone is using it. I just cannot imagine how people used to write a loop to filter a collection in the past.