I believe every developer has been in a situation where the software behaves in a way that just cannot be. Every developer has applied fixes that do not make sense and he had no idea how they worked. Recently I came across the most brutal version of "cannot be" I had ever encountered while working on software.
In the project I work on users can attach images and other files in comments to various items. When the user uploads a file a preview of the image is displayed and then the user can post the comment. This is all done via AJAX. When the comment is posted the server returns HTML for the comment which is displayed on the page. When the user uploads a file a temporary folder is created for the comment and the images and their thumbnails are stored in it. The preview images are loaded from this folder and when the comment is posted it is stored in the database together with the images. The temporary folder is then deleted and the HTML for the comment is returned to the client. The images for the actual comment are loaded from the database. As you may know I am a big advocate of
storing user files in the database and this includes temporary files but this was not my decision. In fact we moved the files to the database after the whole upload was already written. The original version copied the temporary files to a new location on the file system.
The first problem we ran into was that sometimes the delete of the folder failed probably due to some OS file lock. My solution was to catch the IO exception sleep the thread for 10 milliseconds and try again. Sometimes very rarely the delete in the catch would still throw an exception. This is why I added a second try/catch inside the catch block and forced the thread into a sleep for a whole second. It seems like the file lock issue is quite common. Even
Google Chrome has similar solution to the problem (it tries to delete the file twice).
However we noticed something even stranger. When a comment was posted the text appeared immediately but there was a significant lag before the image was displayed. This only happens the first time a comment is displayed. Of course my code was blamed because it forced a thread into a sleep. I was quick to point out that my code (including the sleep) was already executed when the HTML is sent back to the client and the text of the comment did appear immediately. It was the image request that was lagging. Clearly my code could not cause that. A coworker of mine commented out the whole code responsible for deleting the temporary folders and… the images were loading fast… WTF?!?!? He was quick to refactor the code to use the wonderful
Task Parallel Library to delay the deleting of the temporary folder by scheduling a task on another thread. As cool as TPL is I stopped him and demanded that we found out why this seemingly impossible thing was happening instead of hacking around it. I just did not believe that a sleep in one request can cause the next request to lag. What is more the second request was only triggered when the first request had returned because the link to the image is in the HTML that the first request returns.
After several hours of commenting out random lines of code, changing sleep timings and blank staring at the code I noticed that after the AJAX request was finished and before the request for the image began the Visual Studio status bar went crazy looping through different dlls and then I had the eureka moment. Because the folder structure of the project had changed (i.e. a folder was deleted) IIS had to recompile the project. The recompilation triggered by the delete and not the sleep caused the delay. Now that I think of it I am so glad I insisted on investigating the cause of the issue. If we had implemented a delay for deleting the temporary folder it would appear that the problem was solved because the image would appear instantly. However after the delay the folder would be deleted and we would still get a delay. This delay would have been much worse because we would not be able to link it to attaching files. It would have been seemingly random inexplicable delay for the whole system.
As much as I would like I cannot add this issue to the list of arguments for storing user files in the database instead of on the file system. My
quick search did not find easy way to disable recompilation triggered by certain folder but even if it did I would still not use it. Even when you store user files on the file system you should pull the folder out of the website folder structure. This simplifies updates to the code and backup of the user files. Of course you should never store user files on the file system but even more importantly you should always know the answer to the question "Why?".