How Hard Are URLs in Text?

   Abandon all hope you who read this, for this is where you will find how simple programming problems can turn out to be really hard. For our current project we wanted to implement a seemingly simple functionality – when the user posts a comment via a text area new lines should be converted to "<br />" in the resulting HTML, HTML tags should be encoded so our project would not be vulnerable to script injections and users would be able to post HTML in the comments and finally URLs should be detected and converted to links. Sounds simple? As soon as I was charged with this task I remembered reading a post by the wise Jeff Atwood on his famous blog Coding Horror about how hard URLs can be. The post deals with the issue of a closing parenthesis at the end of a URL and I immediately decided that I was going to ignore this issue like most systems do. However while working on this task I hit many more walls.

   Like you probably expect we will start with a regular expression used for detecting URLs. At first I fell in the trap and used a complex regular expression for validating URLs. When you need to validate a URL you need to be much more exact and disallow certain characters. On the contrary if a user posts invalid URL it is still an URL and you can still make it a link even if it is an invalid one. So we go with the simplest possible regex found on Jeff's blog:

   \bhttp://[^\s]+

   I wanted more links detected so I added several more protocols:

   \b(http(s)?|ftp|file)://[^\s]+

   We also wanted to detect as links stuff starting with "www." and no protocol prefix.

   \b(((http(s)?|ftp|file)://)|www\.)[^\s]+

   So far so good. We have crafted a regex to detect the links. Now we need to replace them. Like most platforms the .NET Framework has a method to replace regex matches within a text. I went for the simplest overload that has the following signature Regex.Replace(string input, string pattern, string replacement). I made a method to replace the URLs with links:

       public static string ReplaceUrlsWithLinks(string input)
       {
           return Regex.Replace(input,
               @"\b(((http(s)?|ftp|file)://)|www\.)[^\s]+",
               "<a target='_blank' href='$&'>$&</a>"); // $& references the whole match
       }

   We are almost done... or at least it seems so. Before we call this method we need to encode the HTML special characters in the string. If we do that after replacing the links they will be encoded as a string with the <a> tag treated as text. We also need to replace the new lines with <br /> tag.

   Utilities.ReplaceUrlsWithLinks(Server.HtmlEncode(txtCommentText.Text).Replace(Environment.NewLine, "<br />")) *

   It is hard to follow the order in which the methods are called so we either need to divide the code  on separate lines or use the power on C# and create extension methods.

   public static string HtmlEncode(this string input)
   {
       return HttpUtility.HtmlEncode(input);
   }

   txtComment.Text.HtmlEncode().Replace(Environment.NewLine, "<br />").ReplaceUrlsWithLinks();

   This is much better. It even works for most links, but we forgot about the links starting with "www.". It seems like browsers do not really like href attributes that do not have a protocol. They treat them as a relative path. We need to add "http://" in front of these links but not in front of other links. For this task we summon the great spirits of functional programming. We are going to use the overload of the Regex.Replace method with the signature Replace(string input, string pattern, MatchEvaluator evaluator, RegexOptions options). MatchEvaluator is a delegate that is used for evaluating each regex match and returning the replacement string. You can pass custom C# code there instead of using the regex replacement patterns. While we are at it we will pass some regex options. Our method now looks like this:

   public static string ReplaceUrlsWithLinks(this string input)
   {
       return Regex.Replace(input,
           @"\b(((http(s)?|ftp|file)://)|www\.)[^\s]+",
           match =>
           {
               string url = match.Value.StartsWith("www.", StringComparison.InvariantCultureIgnoreCase) ? "http://" + match.Value : match.Value;
               return String.Format("<a target='_blank' href='{0}'>{1}</a>", url, match.Value);                    
           },
           RegexOptions.IgnoreCase | RegexOptions.ExplicitCapture);
   }

   This works until we put a new line directly after the link. In this case the new line is replaced with a "<br />" and it matches the regex together with the URL because there is no whitespace to separate them. We need to move the call to String.Replace after the  ReplaceUrlsWithLinks call:

   txtComment.Text.HtmlEncode().ReplaceUrlsWithLinks().Replace(Environment.NewLine, "<br />")

  Now this seems to work... until someone finds out what is wrong with it. In the real project we needed to reverse the process to provide editing functionality so I wrote the following methods:

   public static string ReplaceLinksWithUrls(this string input)
   {
       return Regex.Replace(input, @"<a\b[^>]*>(?<URL>.*?)</a>", "${URL}", RegexOptions.IgnoreCase | RegexOptions.ExplicitCapture);
   }

   public static string HtmlDecode(this string input)
   {
       return HttpUtility.HtmlDecode(input);
   }

   Then you can reverse the string like this:

   text.Replace("<br />", Environment.NewLine).ReplaceLinksWithUrls().HtmlDecode()

   It took me many hours to come up with this solution and in the process I hit many more walls which I did not write about because it turned out that those walls were built on a longer road than this one so you do not need to go there. This solution is probably not perfect. Even while I was writing this article I found ways to improve the original version. If you find any weaknesses in the solution or you know shorter or easier one it will be appreciated if you post it in the comments.

* Environment.NewLine is the way to detect new line in .NET but when you are sending the string with JavaScript for an AJAX call you have to use "\n" because this is what JavaScript believes new line is and on Windows Environment.NewLine will return "\r\n". And yes, I hit this wall too.
Tags:   english programming 
Posted by:   Stilgar
02:37 03.09.2009

Comments:



No comments yet.




Post as:



Post a comment: