<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Techspiration &#187; web</title>
	<atom:link href="http://www.blog.karthikbala.com/tag/web/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.blog.karthikbala.com</link>
	<description>Technology, Spirituality and Rationality</description>
	<lastBuildDate>Sat, 03 Apr 2010 15:21:26 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Crawling made Easy</title>
		<link>http://www.blog.karthikbala.com/crawling-made-easy/</link>
		<comments>http://www.blog.karthikbala.com/crawling-made-easy/#comments</comments>
		<pubDate>Mon, 08 Jun 2009 17:39:43 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[crawling]]></category>
		<category><![CDATA[java]]></category>
		<category><![CDATA[web]]></category>

		<guid isPermaLink="false">http://www.blog.karthikbala.com/?p=5</guid>
		<description><![CDATA[In this post i will explain how to crawl a website and get the required content all using Java.]]></description>
			<content:encoded><![CDATA[<div class="entry-content">
<p>In this post i will explain how to crawl a website and get the required content all using Java.</p>
<p>Am doing a project where am crawling different telugu cinema websites, getting the required content and displaying them all together on a single page i.e on my personal web page.</p>
<p>So as a part of that i crawled <a title="www.telugulo.com" href="http://www.telugulo.com/" target="_blank">www.telugulo.com</a> from which i have extracted the cinema part, so lets see how i did it… excuse me if you dont understand telugu but for learning crawling it wont be a constraint.</p>
<p>When i click on the cinema link which is in the homepage of www.telugulo.com it took me to a page with url <strong>http://telugulo.com/news.php?section=2</strong> and in this particular page i found headlines interesting and thought of extracting all the headlines.</p>
<p>Then i wrote a program which will open the page and extract the headlines for every half an hour such that i always have the latest headlines. and then i can use the headlines for any purpose, either to directly display them on my website or blahblah…</p>
<p><code>So now lets look at how java helps in doing all these:</code></p>
<p>Note the url which u want to crawl, here it is http://telugulo.com/news.php?section=2</p>
<ol>
<li><span style="color: #333399;">String strurl = http://telugulo.com/news.php?section=2;</span></li>
<li><span style="color: #333399;">URL url = new URL(strurl);</span></li>
<li><span style="color: #333399;">System.out.println(”Received url is “+url);</span></li>
<li><span style="color: #333399;">URLConnection hpCon = url.openConnection();</span></li>
<li><span style="color: #333399;">InputStream ins = hpCon.getInputStream();</span></li>
<li><span style="color: #333399;">InputStreamReader bis = new InputStreamReader(ins);</span></li>
<li><span style="color: #333399;">BufferedReader </span><span style="color: #333399;">teluStream</span><span style="color: #333399;"> = new BufferedReader(bis);</span></li>
<li><span style="color: #333399;">System.out.println(”got the stream”);</span></li>
</ol>
<p>Look at the above code,</p>
<p>line 4 opens the connection on the given url and now the hpCon is the connection on the given url,</p>
<p>line 5 gets the content present in the page as a stream</p>
<p>line 7 uses a bufferedreader which is used to read a character stream</p>
<p>so now <span style="color: #333399;">teluStream.readline() </span>is the function u need to read the content of the page line by line</p>
<p>once u got the line, check if it is the headline and if it is so, extract it and save it in a local file.</p>
<p>I recommend you to use firefox where u can add several plugins that make ur crawling job more easy<script type="text/javascript"><!--
ient = "pub-2945383363046281";
/* 110x32, created 5/1/08 */
google_ad_slot = "8282021149";
google_ad_width = 110;
google_ad_height = 32;
google_cpa_choice = ""; // on file
// --></script><br />
<script src="file:///home/karthik/Desktop/karthikbala.com/search_files/show_ads.js" type="text/javascript"></script><script type="text/javascript"><!--
window.google_render_ad();
// --></script><ins style="border: medium none; margin: 0pt; padding: 0pt; display: inline-table; height: 32px; position: relative; visibility: visible; width: 110px;"><ins style="border: medium none; margin: 0pt; padding: 0pt; display: block; height: 32px; position: relative; visibility: visible; width: 110px;"></ins></ins><script type="text/javascript"><!--
window.google_render_ad();
// --></script><ins style="border: medium none; margin: 0pt; padding: 0pt; display: inline-table; height: 32px; position: relative; visibility: visible; width: 110px;"><ins style="border: medium none; margin: 0pt; padding: 0pt; display: block; height: 32px; position: relative; visibility: visible; width: 110px;"></ins></ins><br />
Now let me tell you how i extracted the headlines:</p>
<p>I found the headlines as images rather to a text, each headline we read in that page is an image, so now i want to extract those images</p>
<p>and there are many images in the page and to exactly extract the headlines we need to find some difference with other images on the page</p>
<p>i found that these headline images are starting after a particular phrase</p>
<pre id="line342"><strong><span class="attribute-name">alt</span>=<span class="attribute-value">"taja"</span></strong></pre>
<p><strong> </strong>which means fresh.</p>
<p>so i just did <span style="color: #333399;">teluStream.readline()</span> until the line gets the phrase alt=”taja” and started extracting for the first five images which are none other than the required headlines</p>
<p>i used a while just like below to extract the headline images:</p>
<ol>
<li><span style="color: #333399;">String streamLine;</span></li>
<li><span style="color: #333399;">int i = 0;</span></li>
<li>S<span style="color: #333399;">tring gifStr[] = new String[5];</span></li>
<li><span style="color: #333399;">while((streamLine = teluStream.readLine())!=null)</span></li>
<li><span style="color: #333399;">{</span></li>
<li><span style="color: #333399;">if(streamLine.contains(regx))</span></li>
<li><span style="color: #333399;">{</span></li>
<li><span style="color: #333399;">while((streamLine = teluStream.readLine())!= null &amp;&amp; i&lt;5)</span></li>
<li><span style="color: #333399;">{</span></li>
<li><span style="color: #333399;">if(streamLine.contains(”gif”))</span></li>
<li><span style="color: #333399;">{</span></li>
<li><span style="color: #333399;">gifStr[i] = streamLine;</span></li>
<li><span style="color: #333399;">System.out.println(”gif string: “+i+gifStr[i]);</span></li>
<li><span style="color: #333399;">i++;</span></li>
<li><span style="color: #333399;">}</span></li>
<li><span style="color: #333399;">}</span></li>
<li><span style="color: #333399;">break;</span></li>
<li><span style="color: #333399;">}</span></li>
<li><span style="color: #333399;">}<br />
</span></li>
</ol>
<p>The while loop iterates for all the lines in the whole page, but i just wanted the images after the phrase alt=”taja”  so</p>
<p>line 6 checks if the line contains the taja phrase, if so</p>
<p>line 8 is an another while loop which iterates for i &lt; 5</p>
<p>line 10 checks if the line contains a gif (image), if so</p>
<p>in line 12 we store that particular line into an array</p>
<p>as we got the first image we increament the i in line 14</p>
<p>we are almost done, we found the lines in the page which contains the information we want</p>
<p>now in that each line we have to extract the exact path of the image,</p>
<p>the programmer who designed the page only writes the image path in the page and its the browsers duty to get the image from that path and display in the specified area</p>
<p>so now our duty is to get the path and open a connection on that path and then get the content of that image as similar to that of getting the content of the page that we have done very intially</p>
<p>but first lets c how we get the exact path from the line we extracted out of a big page</p>
<p>the line we extracted looks like this :</p>
<pre id="line342"><strong>&lt;<span class="start-tag">td</span>&gt;&lt;<span class="start-tag">a</span><span class="attribute-name"> href</span>=<span class="attribute-value">"view_news.php?id=6528"</span>&gt;&lt;<span class="start-tag">img</span><span class="attribute-name"> src</span>=<span class="attribute-value">"./images/Head-chiranjeevi-jeevithaniki-pargu-lankai-enduku.gif" </span><span class="attribute-name">border</span>=<span class="attribute-value">0</span>&gt;&lt;/<span class="end-tag">a</span>&gt;&lt;/<span class="end-tag">td</span>&gt;</strong></pre>
<p>we want that img src in the above line, a simple string manipulation would give us the img src, lets c the manipulation part of the game:</p>
<ol>
<li><span style="color: #333399;"> /*</span></li>
<li><span style="color: #333399;"> * This function returns the image path from given String</span></li>
<li><span style="color: #333399;"> */</span></li>
<li><span style="color: #333399;"> public String getImageurl(String urlLine) throws Exception {</span></li>
<li><span style="color: #333399;"> String str1[],str2[]; </span></li>
<li><span style="color: #333399;"> URL url = new URL(siteName);</span></li>
<li><span style="color: #333399;"> </span></li>
<li><span style="color: #333399;"> str1 = urlLine.split(”&lt;img src=”);</span></li>
<li><span style="color: #333399;"> if (str1[1].contains(””&#8221;))</span></li>
<li><span style="color: #333399;"> {</span></li>
<li><span style="color: #333399;"> str2 = str1[1].split(””&#8221;);</span></li>
<li><span style="color: #333399;"> } </span></li>
<li><span style="color: #333399;"> System.out.println(str2.length + ”  ” + str2[0]);</span></li>
<li><span style="color: #333399;"> int l = str2.length;</span></li>
<li><span style="color: #333399;"> String imageUrl = “http://” + www.telugulo.com + “/” + str2[1];</span></li>
<li><span style="color: #333399;"> System.out.println(imageUrl);</span></li>
<li><span style="color: #333399;"> return imageUrl;</span></li>
<li><span style="color: #333399;"> }</span></li>
</ol>
<p>the getImageurl is the function which will return us the exact path of the image</p>
<p>String urlLine the argument passed to the function is the line which contains the img src</p>
<p>when i checked the line clearly i found that the img src is in between the inverted comas, but i found there are another inverted comas also in the line</p>
<p>so i used the split function to split the line into two parts such that the first inverted comas section is removed</p>
<p>the split function splits the line into parts based on the argument u passed to the function and it always returns the array with the splitted values</p>
<p>if u use split(is) on a line “karthik is hero”</p>
<p><span style="color: #333399;">String line = “karthik is hero”;</span></p>
<p><span style="color: #333399;">String result[] = line.split(is);</span></p>
<p>then result[0] will contain “karthik” and result[1] will contain “hero” notice here the result will not include the value that u used to split</p>
<p>so line 8 splits the given line into 2 parts, where the second part will have the img src</p>
<p>the second part look like this</p>
<p><strong><strong><span class="attribute-value">“./images/Head-chiranjeevi-jeevithaniki-pargu-lankai-enduku.gif” </span><span class="attribute-name">border</span>=<span class="attribute-value">0</span>&gt;&lt;/<span class="end-tag">a</span>&gt;&lt;/<span class="end-tag">td</span>&gt;</strong></strong></p>
<p>notice in the above <strong><strong>&lt;<span class="start-tag">img</span><span class="attribute-name"> src</span>=<span class="attribute-value"> </span></strong></strong><span class="attribute-value">is not included, neither it is included in the first part of the split result</span></p>
<p>line 11 splits the above line with <strong>“</strong><br />
then guess into how many parts the line will be split into</p>
<p>exactly, into 3 parts</p>
<p>where the first part is null as “inverted comma” which is the split value is the starting character</p>
<p>the second part is <strong><strong><span class="attribute-value">./images/Head-chiranjeevi-jeevithaniki-pargu-lankai-enduku.gif</span></strong></strong></p>
<p>and the third part is <strong><strong><span class="attribute-value"> </span><span class="attribute-name">border</span>=<span class="attribute-value">0</span>&gt;&lt;/<span class="end-tag">a</span>&gt;&lt;/<span class="end-tag">td</span>&gt;</strong></strong></p>
<p>so what we want is the second part, but the second part is not a complete url, the host name i.e www.telugulo.com is missing</p>
<p>line 15 adds the host name to the img src making it a complete url, upon which u can open connection as we did intially and get the stream and write the stream into a file headline.gif</p></div>
]]></content:encoded>
			<wfw:commentRss>http://www.blog.karthikbala.com/crawling-made-easy/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
