Regular Expressions - get simple text from Html string- (by stripping/removing/replacing)-


In this article we are going to see simple and quick method to remove/strip html tags form string.


We are going to use two regular expression, if you want you can combine them into one. I thought to keep it two so that it will be easy to explain :).


Regular expressions are :

  1. Pattern_1 : <(.|)\b(.|\n)+?>

  2. Pattern_2 :  &(.|)\b(.|\n)+?;


Now we are going to find and replace instance of HTML tag with empty string.


using System.Text.RegularExpressions;

//Pattern_1 :

const string HTML_TAG_PATTERN = @"<(.|)\b(.|\n)+?>";


Pattern_2 :

const string HTML_ENCODED_PATTERN = @"&(.|)\b(.|\n)+?;";


string testString = “<span id="s_Id_1"  class="sMain" style="color;red;"  >This is test of  <p class="pMain">Reqular Expression</p>This regular expression will match all HTML tags and their attributes. This will LEAVE the content of the tags within the string

Lets check '<' and  '>' char and some more like &nbps; &lt; &gt; …many more

</span>”;


public static string StripHTML (string inputString)

{

  string Output_1 =Regex.Replace(inputString, HTML_TAG_PATTERN, string.Empty);

  return Output_1 ;

}


StripHTML(testString);


This will find all instance of html tag and replace it with empty string  as shown in blue color.


Match Pattern_1 :<span id="s_Id_1"  class="sMain" style="color;red;"  >This is test <p class="pMain">Reqular Expression</p>This regular expression will match all HTML tags and their attributes. This will LEAVE the content of the tags within the string
Lets check '<' and  '>' char and some more like &nbps; &lt; &gt; ...many more
</span>


As  a result will get Output_1  as follows -

Output_1 : This is test Regular Expression. This regular expression will match all HTML tags and their attributes. This will LEAVE the content of the tags within the string
Lets check '<' and  '>' char and some more like &nbps; &lt; &gt; …many more.


Now that we have successfully removed/striped html tags now lets add pattern 2


public static string StripHTML (string inputString)

{

  string Output_1 =Regex.Replace(inputString, HTML_TAG_PATTERN, string.Empty);

  string Output_2 =Regex.Replace(Output_1 , HTML_ENCODED_PATTERN , string.Empty);


  return Output_2;

}


Here when we execute-  StripHTML(testString); following blue colored text will be removed form Output_1  , resulting Output_2 :


Match Pattern_2 - Output_1 : This is test Regular Expression. This regular expression will match all HTML tags and their attributes. This will LEAVE the content of the tags within the string
Lets check '<' and  '>' char and some more like
&nbps; &lt; &gt; …many more.


Output_2 : This is test Reqular Expression. This regular expression will match all HTML tags and their attributes. This will LEAVE the content of the tags within the string
Lets check '<' and  '>' char and some more like ...many more.


Hope this will work for you !!! have fun :)