Showing posts with label Html. Show all posts
Showing posts with label Html. Show all posts

Tuesday, 16 July 2013

How To Extract HTML Links With Regular Expression

// siddhu vydyabhushana // 6 comments
In this tutorial, we will show you how to extract hyperlink from a HTML page. For example, to get the link from following content :
this is text1 <a href='mkyong.com' target='_blank'>hello</a> this is text2...
  1. First get the “value” from a tag – Result : a href='mkyong.com' target='_blank'
  2. Later get the “link” from above extracted value – Result : mkyong.com

1. Regular Expression Pattern

Extract A tag Regular Expression Pattern
(?i)<a([^>]+)>(.+?)</a>
Extract Link From A tag Regular Expression Pattern
\s*(?i)href\s*=\s*(\"([^"]*\")|'[^']*'|([^'">\s]+));
Description
(		#start of group #1
 ?i		#  all checking are case insensive
)		#end of group #1
<a              #start with "<a"
  (		#  start of group #2
    [^>]+	#     anything except (">"), at least one character
   )		#  end of group #2
  >		#     follow by ">"
    (.+?)	#	match anything 
         </a>	#	  end with "</a>
\s*			   #can start with whitespace
  (?i)			   # all checking are case insensive
     href		   #  follow by "href" word
        \s*=\s*		   #   allows spaces on either side of the equal sign,
              (		   #    start of group #1
               "([^"]*")   #      allow string with double quotes enclosed - "string"
               |	   #	  ..or
               '[^']*'	   #        allow string with single quotes enclosed - 'string'
               |           #	  ..or
               ([^'">]+)   #      can't contains one single quotes, double quotes ">"
	      )		   #    end of group #1

2. Java Link Extractor Example

Here’s a simple Java Link extractor example, to extract the a tag value from 1st pattern, and use 2nd pattern to extract the link from 1st pattern.
HTMLLinkExtractor.java
package com.mkyong.crawler.core;
 
import java.util.Vector;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
 
public class HTMLLinkExtractor {
 
	private Pattern patternTag, patternLink;
	private Matcher matcherTag, matcherLink;
 
	private static final String HTML_A_TAG_PATTERN = "(?i)<a([^>]+)>(.+?)</a>";
	private static final String HTML_A_HREF_TAG_PATTERN = 
		"\\s*(?i)href\\s*=\\s*(\"([^\"]*\")|'[^']*'|([^'\">\\s]+))";
 
 
	public HTMLLinkExtractor() {
		patternTag = Pattern.compile(HTML_A_TAG_PATTERN);
		patternLink = Pattern.compile(HTML_A_HREF_TAG_PATTERN);
	}
 
	/**
	 * Validate html with regular expression
	 * 
	 * @param html
	 *            html content for validation
	 * @return Vector links and link text
	 */
	public Vector<HtmlLink> grabHTMLLinks(final String html) {
 
		Vector<HtmlLink> result = new Vector<HtmlLink>();
 
		matcherTag = patternTag.matcher(html);
 
		while (matcherTag.find()) {
 
			String href = matcherTag.group(1); // href
			String linkText = matcherTag.group(2); // link text
 
			matcherLink = patternLink.matcher(href);
 
			while (matcherLink.find()) {
 
				String link = matcherLink.group(1); // link
				HtmlLink obj = new HtmlLink();
				obj.setLink(link);
				obj.setLinkText(linkText);
 
				result.add(obj);
 
			}
 
		}
 
		return result;
 
	}
 
	class HtmlLink {
 
		String link;
		String linkText;
 
		HtmlLink(){};
 
		@Override
		public String toString() {
			return new StringBuffer("Link : ").append(this.link)
			.append(" Link Text : ").append(this.linkText).toString();
		}
 
		public String getLink() {
			return link;
		}
 
		public void setLink(String link) {
			this.link = replaceInvalidChar(link);
		}
 
		public String getLinkText() {
			return linkText;
		}
 
		public void setLinkText(String linkText) {
			this.linkText = linkText;
		}
 
		private String replaceInvalidChar(String link){
			link = link.replaceAll("'", "");
			link = link.replaceAll("\"", "");
			return link;
		}
 
	}
}

3. Unit Test

Unit test with TestNG. Simulate the HTML content via @DataProvider.
TestHTMLLinkExtractor.java
package com.mkyong.crawler.core;
 
import java.util.Vector;
 
import org.testng.Assert;
import org.testng.annotations.BeforeClass;
import org.testng.annotations.DataProvider;
import org.testng.annotations.Test;
 
import com.mkyong.crawler.core.HTMLLinkExtractor.HtmlLink;
 
/**
 * HTML link extrator Testing
 * 
 * @author mkyong
 * 
 */
public class TestHTMLLinkExtractor {
 
	private HTMLLinkExtractor htmlLinkExtractor;
	String TEST_LINK = "http://www.google.com";
 
	@BeforeClass
	public void initData() {
		htmlLinkExtractor = new HTMLLinkExtractor();
	}
 
	@DataProvider
	public Object[][] HTMLContentProvider() {
	  return new Object[][] {
	    new Object[] { "abc hahaha <a href='" + TEST_LINK + "'>google</a>" },
	    new Object[] { "abc hahaha <a HREF='" + TEST_LINK + "'>google</a>" },
 
	    new Object[] { "abc hahaha <A HREF='" + TEST_LINK + "'>google</A> , "
		+ "abc hahaha <A HREF='" + TEST_LINK + "' target='_blank'>google</A>" },
 
	    new Object[] { "abc hahaha <A HREF='" + TEST_LINK + "' target='_blank'>google</A>" },
	    new Object[] { "abc hahaha <A target='_blank' HREF='" + TEST_LINK + "'>google</A>" },
	    new Object[] { "abc hahaha <A target='_blank' HREF=\"" + TEST_LINK + "\">google</A>" },
	    new Object[] { "abc hahaha <a HREF=" + TEST_LINK + ">google</a>" }, };
	}
 
	@Test(dataProvider = "HTMLContentProvider")
	public void ValidHTMLLinkTest(String html) {
 
		Vector<HtmlLink> links = htmlLinkExtractor.grabHTMLLinks(html);
 
		//there must have something
		Assert.assertTrue(links.size() != 0);
 
		for (int i = 0; i < links.size(); i++) {
			HtmlLink htmlLinks = links.get(i);
			//System.out.println(htmlLinks);
			Assert.assertEquals(htmlLinks.getLink(), TEST_LINK);
		}
 
	}
}
Result
[TestNG] Running:
  /private/var/folders/w8/jxyz5pf51lz7nmqm_hv5z5br0000gn/T/testng-eclipse--530204890/testng-customsuite.xml
 
PASSED: ValidHTMLLinkTest("abc hahaha <a href='http://www.google.com'>google</a>")
PASSED: ValidHTMLLinkTest("abc hahaha <a HREF='http://www.google.com'>google</a>")
PASSED: ValidHTMLLinkTest("abc hahaha <A HREF='http://www.google.com'>google</A> , abc hahaha <A HREF='http://www.google.com' target='_blank'>google</A>")
PASSED: ValidHTMLLinkTest("abc hahaha <A HREF='http://www.google.com' target='_blank'>google</A>")
PASSED: ValidHTMLLinkTest("abc hahaha <A target='_blank' HREF='http://www.google.com'>google</A>")
PASSED: ValidHTMLLinkTest("abc hahaha <A target='_blank' HREF="http://www.google.com">google</A>")
PASSED: ValidHTMLLinkTest("abc hahaha <a HREF=http://www.google.com>google</a>")
Read More

Top 10 Simple Java Regular Expression

// siddhu vydyabhushana // 1 comment
Regular expression is an art of the programing, it’s hard to debug , learn and understand, but the powerful features are still attract many developers to code regular expression. Let’s explore the following 10 practical regular expression ~ enjoy :)

1. Username Regular Expression Pattern

 ^[a-z0-9_-]{3,15}$
^                    # Start of the line
  [a-z0-9_-]	     # Match characters and symbols in the list, a-z, 0-9 , underscore , hyphen
             {3,15}  # Length at least 3 characters and maximum length of 15 
$                    # End of the line

2. Password Regular Expression Pattern

((?=.*\d)(?=.*[a-z])(?=.*[A-Z])(?=.*[@#$%]).{6,20})
(			# Start of group
  (?=.*\d)		#   must contains one digit from 0-9
  (?=.*[a-z])		#   must contains one lowercase characters
  (?=.*[A-Z])		#   must contains one uppercase characters
  (?=.*[@#$%])		#   must contains one special symbols in the list "@#$%"
              .		#     match anything with previous condition checking
                {6,20}	#        length at least 6 characters and maximum of 20	
)			# End of group

3. Hexadecimal Color Code Regular Expression Pattern

^#([A-Fa-f0-9]{6}|[A-Fa-f0-9]{3})$
^		 #start of the line
 #		 #  must constains a "#" symbols
 (		 #  start of group #1
  [A-Fa-f0-9]{6} #    any strings in the list, with length of 6
  |		 #    ..or
  [A-Fa-f0-9]{3} #    any strings in the list, with length of 3
 )		 #  end of group #1 
$		 #end of the line

4. Email Regular Expression Pattern

^[_A-Za-z0-9-]+(\\.[_A-Za-z0-9-]+)*@[A-Za-z0-9]+
(\\.[A-Za-z0-9]+)*(\\.[A-Za-z]{2,})$
^			#start of the line
  [_A-Za-z0-9-]+	#  must start with string in the bracket [ ], must contains one or more (+)
  (			#  start of group #1
    \\.[_A-Za-z0-9-]+	#     follow by a dot "." and string in the bracket [ ], must contains one or more (+)
  )*			#  end of group #1, this group is optional (*)
    @			#     must contains a "@" symbol
     [A-Za-z0-9]+       #        follow by string in the bracket [ ], must contains one or more (+)
      (			#	   start of group #2 - first level TLD checking
       \\.[A-Za-z0-9]+  #	     follow by a dot "." and string in the bracket [ ], must contains one or more (+)
      )*		#	   end of group #2, this group is optional (*)
      (			#	   start of group #3 - second level TLD checking
       \\.[A-Za-z]{2,}  #	     follow by a dot "." and string in the bracket [ ], with minimum length of 2
      )			#	   end of group #3
$			#end of the line

5. Image File Extension Regular Expression Pattern

([^\s]+(\.(?i)(jpg|png|gif|bmp))$)
(			#Start of the group #1
 [^\s]+			#  must contains one or more anything (except white space)
       (		#    start of the group #2
         \.		#	follow by a dot "."
         (?i)		#	ignore the case sensitive checking
             (		#	  start of the group #3
              jpg	#	    contains characters "jpg"
              |		#	    ..or
              png	#	    contains characters "png"
              |		#	    ..or
              gif	#	    contains characters "gif"
              |		#	    ..or
              bmp	#	    contains characters "bmp"
             )		#	  end of the group #3
       )		#     end of the group #2	
  $			#  end of the string
)			#end of the group #1

6. IP Address Regular Expression Pattern

^([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\.([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\.
([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\.([01]?\\d\\d?|2[0-4]\\d|25[0-5])$
^		#start of the line
 (		#  start of group #1
   [01]?\\d\\d? #    Can be one or two digits. If three digits appear, it must start either 0 or 1
		#    e.g ([0-9], [0-9][0-9],[0-1][0-9][0-9])
    |		#    ...or
   2[0-4]\\d	#    start with 2, follow by 0-4 and end with any digit (2[0-4][0-9]) 
    |           #    ...or
   25[0-5]      #    start with 2, follow by 5 and end with 0-5 (25[0-5]) 
 )		#  end of group #2
  \.            #  follow by a dot "."
....            # repeat with 3 time (3x)
$		#end of the line

7. Time Format Regular Expression Pattern

Time in 12-Hour Format Regular Expression Pattern

(1[012]|[1-9]):[0-5][0-9](\\s)?(?i)(am|pm)
(				#start of group #1
 1[012]				#  start with 10, 11, 12
 |				#  or
 [1-9]				#  start with 1,2,...9
)				#end of group #1
 :				#    follow by a semi colon (:)
  [0-5][0-9]			#   follow by 0..5 and 0..9, which means 00 to 59
            (\\s)?		#        follow by a white space (optional)
                  (?i)		#          next checking is case insensitive
                      (am|pm)	#            follow by am or pm

Time in 24-Hour Format Regular Expression Pattern

([01]?[0-9]|2[0-3]):[0-5][0-9]
(				#start of group #1
 [01]?[0-9]			#  start with 0-9,1-9,00-09,10-19
 |				#  or
 2[0-3]				#  start with 20-23
)				#end of group #1
 :				#  follow by a semi colon (:)
  [0-5][0-9]			#    follow by 0..5 and 0..9, which means 00 to 59

8. Date Format (dd/mm/yyyy) Regular Expression Pattern

(0?[1-9]|[12][0-9]|3[01])/(0?[1-9]|1[012])/((19|20)\\d\\d)
(			#start of group #1
 0?[1-9]		#  01-09 or 1-9
 |                  	#  ..or
 [12][0-9]		#  10-19 or 20-29
 |			#  ..or
 3[01]			#  30, 31
) 			#end of group #1
  /			#  follow by a "/"
   (			#    start of group #2
    0?[1-9]		#	01-09 or 1-9
    |			#	..or
    1[012]		#	10,11,12
    )			#    end of group #2
     /			#	follow by a "/"
      (			#	  start of group #3
       (19|20)\\d\\d	#	    19[0-9][0-9] or 20[0-9][0-9]
       )		#	  end of group #3

9. HTML tag Regular Expression Pattern

<("[^"]*"|'[^']*'|[^'">])*>
<	  	#start with opening tag "<"
 (		#   start of group #1
   "[^"]*"	#	only two double quotes are allow - "string"
   |		#	..or
   '[^']*'	#	only two single quotes are allow - 'string'
   |		#	..or
   [^'">]	#	cant contains one single quotes, double quotes and ">"
 )		#   end of group #1
 *		# 0 or more
>		#end with closing tag ">"

10. HTML links Regular Expression Pattern

HTML A tag Regular Expression Pattern

(?i)<a([^>]+)>(.+?)</a>
(		#start of group #1
 ?i		#  all checking are case insensive
)		#end of group #1
<a              #start with "<a"
  (		#  start of group #2
    [^>]+	#     anything except (">"), at least one character
   )		#  end of group #2
  >		#     follow by ">"
    (.+?)	#	match anything 
         </a>	#	  end with "</a>

Extract HTML link Regular Expression Pattern

\s*(?i)href\s*=\s*(\"([^"]*\")|'[^']*'|([^'">\s]+));
\s*			   #can start with whitespace
  (?i)			   # all checking are case insensive
     href		   #  follow by "href" word
        \s*=\s*		   #   allows spaces on either side of the equal sign,
              (		   #    start of group #1
               "([^"]*")   #      only two double quotes are allow - "string"
               |	   #	  ..or
               '[^']*'	   #      only two single quotes are allow - 'string'
               |           #	  ..or
               ([^'">]+)   #     cant contains one single / double quotes and ">"
	      )		   #    end of group #1

Read More

Monday, 26 November 2012

Tutorial on Lists in HTML

// siddhu vydyabhushana // Leave a Comment
HTML
Hyper Text Markup Language

Generally we are implementing lists , using lists on navigation bars , simple drop downs but leave it .Am going to give am brief description on lists.

DEFINITION

A list of items with set of list items is called lists.
there are 4 types oof lists
1)Ordered
2)Unordered
3)definition
4)nested

Ordered Lists

it will be represented in html is <ol> element and close with</ol>.It contains item lists it will be represented as<li> and closed with </li>.

Unordered Lists


it will be represented in html is <ul> element and close with</ul>.It contains item lists it will be represented as<li> and closed with </li>.




1,A,a,I,i

above are the list types in orders list


OUTPUT
 
<ol type="1">
<li>siddhu</li>
<li>pavan</li>
<li>chakitha</li>
</ol>

  1. siddhu

  2. pavan

  3. chakitha


U replace with any one of the above..same as for unordered lists also we have 3 lists types 1)disks
         2)circle
         3)square

DEFINITION LISTS

it will be use full to define definitions and there is a facility to give definition term(dt),definition desccription(dd)



Read More