January 10, 2012

Remove Wild Characters & HTML from a string

While try to parse email or try to get string from Rich Text Editor (RTE) we may found some Junk Data (HTML, Wild Characters & etc.) inside that string.

In that time we need a eliminator like below.
   1:  public string RemoveHTMLandWildChar(string input)
   2:  {
   3:      //Remove HTML
   4:      input = Regex.Replace(input, @"<(.|\n)*?>", string.Empty);
   5:   
   6:      //Remove Wild characters 
   7:      input = Regex.Replace(input, @"\[\w+]", string.Empty);
   8:      return input;
   9:  }

Description: This is what \[\w+] means:
  • \[ - Regualar Expressions don't have to start with a backslash (\). The reason why we started with a \ in this case, is that the opening square bracket has a special meaning to the RegEx parser and thus you have to escape it using the backslash. \[ means that "start with an opening backslash"
  • \w - This means a word character which is an alphanumeric character or the underscore character.
  • + - The plus sign means "find one or more". Thus \w+ means find one or more alphanmeric characters. The * character, by the way, means "find zero ore more".
  • ] - The closing square bracket doesn't need to be escaped, thus we don't need a \ before it.
Thus, the whole expression,
 \[\w+], - "match one or more alphanumeric characters that are surrounded by squre brackets".
If you want this to say "match zero or more..." then you have to change your regular expression to \[\w*].

No comments:

Post a Comment