Tuesday, 16 July 2013

How To Extract HTML Links With Regular Expression

In this tutorial, we will show you how to extract hyperlink from a HTML page. For example, to get the link from following content :
this is text1 <a href='' target='_blank'>hello</a> this is text2...
  1. First get the “value” from a tag – Result : a href='' target='_blank'
  2. Later get the “link” from above extracted value – Result :

1. Regular Expression Pattern

Extract A tag Regular Expression Pattern
Extract Link From A tag Regular Expression Pattern
(		#start of group #1
 ?i		#  all checking are case insensive
)		#end of group #1
<a              #start with "<a"
  (		#  start of group #2
    [^>]+	#     anything except (">"), at least one character
   )		#  end of group #2
  >		#     follow by ">"
    (.+?)	#	match anything 
         </a>	#	  end with "</a>
\s*			   #can start with whitespace
  (?i)			   # all checking are case insensive
     href		   #  follow by "href" word
        \s*=\s*		   #   allows spaces on either side of the equal sign,
              (		   #    start of group #1
               "([^"]*")   #      allow string with double quotes enclosed - "string"
               |	   #	  ..or
               '[^']*'	   #        allow string with single quotes enclosed - 'string'
               |           #	  ..or
               ([^'">]+)   #      can't contains one single quotes, double quotes ">"
	      )		   #    end of group #1

2. Java Link Extractor Example

Here’s a simple Java Link extractor example, to extract the a tag value from 1st pattern, and use 2nd pattern to extract the link from 1st pattern.
package com.mkyong.crawler.core;
import java.util.Vector;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class HTMLLinkExtractor {
	private Pattern patternTag, patternLink;
	private Matcher matcherTag, matcherLink;
	private static final String HTML_A_TAG_PATTERN = "(?i)<a([^>]+)>(.+?)</a>";
	private static final String HTML_A_HREF_TAG_PATTERN = 
	public HTMLLinkExtractor() {
		patternTag = Pattern.compile(HTML_A_TAG_PATTERN);
		patternLink = Pattern.compile(HTML_A_HREF_TAG_PATTERN);
	 * Validate html with regular expression
	 * @param html
	 *            html content for validation
	 * @return Vector links and link text
	public Vector<HtmlLink> grabHTMLLinks(final String html) {
		Vector<HtmlLink> result = new Vector<HtmlLink>();
		matcherTag = patternTag.matcher(html);
		while (matcherTag.find()) {
			String href =; // href
			String linkText =; // link text
			matcherLink = patternLink.matcher(href);
			while (matcherLink.find()) {
				String link =; // link
				HtmlLink obj = new HtmlLink();
		return result;
	class HtmlLink {
		String link;
		String linkText;
		public String toString() {
			return new StringBuffer("Link : ").append(
			.append(" Link Text : ").append(this.linkText).toString();
		public String getLink() {
			return link;
		public void setLink(String link) { = replaceInvalidChar(link);
		public String getLinkText() {
			return linkText;
		public void setLinkText(String linkText) {
			this.linkText = linkText;
		private String replaceInvalidChar(String link){
			link = link.replaceAll("'", "");
			link = link.replaceAll("\"", "");
			return link;

3. Unit Test

Unit test with TestNG. Simulate the HTML content via @DataProvider.
package com.mkyong.crawler.core;
import java.util.Vector;
import org.testng.Assert;
import org.testng.annotations.BeforeClass;
import org.testng.annotations.DataProvider;
import org.testng.annotations.Test;
import com.mkyong.crawler.core.HTMLLinkExtractor.HtmlLink;
 * HTML link extrator Testing
 * @author mkyong
public class TestHTMLLinkExtractor {
	private HTMLLinkExtractor htmlLinkExtractor;
	String TEST_LINK = "";
	public void initData() {
		htmlLinkExtractor = new HTMLLinkExtractor();
	public Object[][] HTMLContentProvider() {
	  return new Object[][] {
	    new Object[] { "abc hahaha <a href='" + TEST_LINK + "'>google</a>" },
	    new Object[] { "abc hahaha <a HREF='" + TEST_LINK + "'>google</a>" },
	    new Object[] { "abc hahaha <A HREF='" + TEST_LINK + "'>google</A> , "
		+ "abc hahaha <A HREF='" + TEST_LINK + "' target='_blank'>google</A>" },
	    new Object[] { "abc hahaha <A HREF='" + TEST_LINK + "' target='_blank'>google</A>" },
	    new Object[] { "abc hahaha <A target='_blank' HREF='" + TEST_LINK + "'>google</A>" },
	    new Object[] { "abc hahaha <A target='_blank' HREF=\"" + TEST_LINK + "\">google</A>" },
	    new Object[] { "abc hahaha <a HREF=" + TEST_LINK + ">google</a>" }, };
	@Test(dataProvider = "HTMLContentProvider")
	public void ValidHTMLLinkTest(String html) {
		Vector<HtmlLink> links = htmlLinkExtractor.grabHTMLLinks(html);
		//there must have something
		Assert.assertTrue(links.size() != 0);
		for (int i = 0; i < links.size(); i++) {
			HtmlLink htmlLinks = links.get(i);
			Assert.assertEquals(htmlLinks.getLink(), TEST_LINK);
[TestNG] Running:
PASSED: ValidHTMLLinkTest("abc hahaha <a href=''>google</a>")
PASSED: ValidHTMLLinkTest("abc hahaha <a HREF=''>google</a>")
PASSED: ValidHTMLLinkTest("abc hahaha <A HREF=''>google</A> , abc hahaha <A HREF='' target='_blank'>google</A>")
PASSED: ValidHTMLLinkTest("abc hahaha <A HREF='' target='_blank'>google</A>")
PASSED: ValidHTMLLinkTest("abc hahaha <A target='_blank' HREF=''>google</A>")
PASSED: ValidHTMLLinkTest("abc hahaha <A target='_blank' HREF="">google</A>")
PASSED: ValidHTMLLinkTest("abc hahaha <a HREF=>google</a>")
Top 10 Simple Java Regular Expression

Regular expression is an art of the programing, it’s hard to debug , learn and understand, but the powerful features are still attract many developers to code regular expression. Let’s explore the following 10 practical regular expression ~ enjoy :)

1. Username Regular Expression Pattern

^                    # Start of the line
  [a-z0-9_-]	     # Match characters and symbols in the list, a-z, 0-9 , underscore , hyphen
             {3,15}  # Length at least 3 characters and maximum length of 15 
$                    # End of the line

2. Password Regular Expression Pattern

(			# Start of group
  (?=.*\d)		#   must contains one digit from 0-9
  (?=.*[a-z])		#   must contains one lowercase characters
  (?=.*[A-Z])		#   must contains one uppercase characters
  (?=.*[@#$%])		#   must contains one special symbols in the list "@#$%"
              .		#     match anything with previous condition checking
                {6,20}	#        length at least 6 characters and maximum of 20	
)			# End of group

3. Hexadecimal Color Code Regular Expression Pattern

^		 #start of the line
 #		 #  must constains a "#" symbols
 (		 #  start of group #1
  [A-Fa-f0-9]{6} #    any strings in the list, with length of 6
  |		 #    ..or
  [A-Fa-f0-9]{3} #    any strings in the list, with length of 3
 )		 #  end of group #1 
$		 #end of the line

4. Email Regular Expression Pattern

^			#start of the line
  [_A-Za-z0-9-]+	#  must start with string in the bracket [ ], must contains one or more (+)
  (			#  start of group #1
    \\.[_A-Za-z0-9-]+	#     follow by a dot "." and string in the bracket [ ], must contains one or more (+)
  )*			#  end of group #1, this group is optional (*)
    @			#     must contains a "@" symbol
     [A-Za-z0-9]+       #        follow by string in the bracket [ ], must contains one or more (+)
      (			#	   start of group #2 - first level TLD checking
       \\.[A-Za-z0-9]+  #	     follow by a dot "." and string in the bracket [ ], must contains one or more (+)
      )*		#	   end of group #2, this group is optional (*)
      (			#	   start of group #3 - second level TLD checking
       \\.[A-Za-z]{2,}  #	     follow by a dot "." and string in the bracket [ ], with minimum length of 2
      )			#	   end of group #3
$			#end of the line

5. Image File Extension Regular Expression Pattern

(			#Start of the group #1
 [^\s]+			#  must contains one or more anything (except white space)
       (		#    start of the group #2
         \.		#	follow by a dot "."
         (?i)		#	ignore the case sensitive checking
             (		#	  start of the group #3
              jpg	#	    contains characters "jpg"
              |		#	    ..or
              png	#	    contains characters "png"
              |		#	    ..or
              gif	#	    contains characters "gif"
              |		#	    ..or
              bmp	#	    contains characters "bmp"
             )		#	  end of the group #3
       )		#     end of the group #2	
  $			#  end of the string
)			#end of the group #1

6. IP Address Regular Expression Pattern

^		#start of the line
 (		#  start of group #1
   [01]?\\d\\d? #    Can be one or two digits. If three digits appear, it must start either 0 or 1
		#    e.g ([0-9], [0-9][0-9],[0-1][0-9][0-9])
    |		#    ...or
   2[0-4]\\d	#    start with 2, follow by 0-4 and end with any digit (2[0-4][0-9]) 
    |           #    ...or
   25[0-5]      #    start with 2, follow by 5 and end with 0-5 (25[0-5]) 
 )		#  end of group #2
  \.            #  follow by a dot "."
....            # repeat with 3 time (3x)
$		#end of the line

7. Time Format Regular Expression Pattern

Time in 12-Hour Format Regular Expression Pattern

(				#start of group #1
 1[012]				#  start with 10, 11, 12
 |				#  or
 [1-9]				#  start with 1,2,...9
)				#end of group #1
 :				#    follow by a semi colon (:)
  [0-5][0-9]			#   follow by 0..5 and 0..9, which means 00 to 59
            (\\s)?		#        follow by a white space (optional)
                  (?i)		#          next checking is case insensitive
                      (am|pm)	#            follow by am or pm

Time in 24-Hour Format Regular Expression Pattern

(				#start of group #1
 [01]?[0-9]			#  start with 0-9,1-9,00-09,10-19
 |				#  or
 2[0-3]				#  start with 20-23
)				#end of group #1
 :				#  follow by a semi colon (:)
  [0-5][0-9]			#    follow by 0..5 and 0..9, which means 00 to 59

8. Date Format (dd/mm/yyyy) Regular Expression Pattern

(			#start of group #1
 0?[1-9]		#  01-09 or 1-9
 |                  	#  ..or
 [12][0-9]		#  10-19 or 20-29
 |			#  ..or
 3[01]			#  30, 31
) 			#end of group #1
  /			#  follow by a "/"
   (			#    start of group #2
    0?[1-9]		#	01-09 or 1-9
    |			#	..or
    1[012]		#	10,11,12
    )			#    end of group #2
     /			#	follow by a "/"
      (			#	  start of group #3
       (19|20)\\d\\d	#	    19[0-9][0-9] or 20[0-9][0-9]
       )		#	  end of group #3

9. HTML tag Regular Expression Pattern

<	  	#start with opening tag "<"
 (		#   start of group #1
   "[^"]*"	#	only two double quotes are allow - "string"
   |		#	..or
   '[^']*'	#	only two single quotes are allow - 'string'
   |		#	..or
   [^'">]	#	cant contains one single quotes, double quotes and ">"
 )		#   end of group #1
 *		# 0 or more
>		#end with closing tag ">"

10. HTML links Regular Expression Pattern

HTML A tag Regular Expression Pattern

(		#start of group #1
 ?i		#  all checking are case insensive
)		#end of group #1
<a              #start with "<a"
  (		#  start of group #2
    [^>]+	#     anything except (">"), at least one character
   )		#  end of group #2
  >		#     follow by ">"
    (.+?)	#	match anything 
         </a>	#	  end with "</a>

Extract HTML link Regular Expression Pattern

\s*			   #can start with whitespace
  (?i)			   # all checking are case insensive
     href		   #  follow by "href" word
        \s*=\s*		   #   allows spaces on either side of the equal sign,
              (		   #    start of group #1
               "([^"]*")   #      only two double quotes are allow - "string"
               |	   #	  ..or
               '[^']*'	   #      only two single quotes are allow - 'string'
               |           #	  ..or
               ([^'">]+)   #     cant contains one single / double quotes and ">"
	      )		   #    end of group #1

Monday, 26 November 2012

Tutorial on Lists in HTML

Hyper Text Markup Language

Generally we are implementing lists , using lists on navigation bars , simple drop downs but leave it .Am going to give am brief description on lists.


A list of items with set of list items is called lists.
there are 4 types oof lists

Ordered Lists

it will be represented in html is <ol> element and close with</ol>.It contains item lists it will be represented as<li> and closed with </li>.

Unordered Lists

it will be represented in html is <ul> element and close with</ul>.It contains item lists it will be represented as<li> and closed with </li>.


above are the list types in orders list

<ol type="1">

  1. siddhu

  2. pavan

  3. chakitha

U replace with any one of the above..same as for unordered lists also we have 3 lists types 1)disks


it will be use full to define definitions and there is a facility to give definition term(dt),definition desccription(dd)

