Crawling made Easy
Posted: June 8th, 2009 | Author: admin | Filed under: Programming | Tags: crawling, java, Programming, web | No Comments »In this post i will explain how to crawl a website and get the required content all using Java.
Am doing a project where am crawling different telugu cinema websites, getting the required content and displaying them all together on a single page i.e on my personal web page.
So as a part of that i crawled www.telugulo.com from which i have extracted the cinema part, so lets see how i did it… excuse me if you dont understand telugu but for learning crawling it wont be a constraint.
When i click on the cinema link which is in the homepage of www.telugulo.com it took me to a page with url http://telugulo.com/news.php?section=2 and in this particular page i found headlines interesting and thought of extracting all the headlines.
Then i wrote a program which will open the page and extract the headlines for every half an hour such that i always have the latest headlines. and then i can use the headlines for any purpose, either to directly display them on my website or blahblah…
So now lets look at how java helps in doing all these:
Note the url which u want to crawl, here it is http://telugulo.com/news.php?section=2
- String strurl = http://telugulo.com/news.php?section=2;
- URL url = new URL(strurl);
- System.out.println(”Received url is “+url);
- URLConnection hpCon = url.openConnection();
- InputStream ins = hpCon.getInputStream();
- InputStreamReader bis = new InputStreamReader(ins);
- BufferedReader teluStream = new BufferedReader(bis);
- System.out.println(”got the stream”);
Look at the above code,
line 4 opens the connection on the given url and now the hpCon is the connection on the given url,
line 5 gets the content present in the page as a stream
line 7 uses a bufferedreader which is used to read a character stream
so now teluStream.readline() is the function u need to read the content of the page line by line
once u got the line, check if it is the headline and if it is so, extract it and save it in a local file.
I recommend you to use firefox where u can add several plugins that make ur crawling job more easy
Now let me tell you how i extracted the headlines:
I found the headlines as images rather to a text, each headline we read in that page is an image, so now i want to extract those images
and there are many images in the page and to exactly extract the headlines we need to find some difference with other images on the page
i found that these headline images are starting after a particular phrase
alt="taja"
which means fresh.
so i just did teluStream.readline() until the line gets the phrase alt=”taja” and started extracting for the first five images which are none other than the required headlines
i used a while just like below to extract the headline images:
- String streamLine;
- int i = 0;
- String gifStr[] = new String[5];
- while((streamLine = teluStream.readLine())!=null)
- {
- if(streamLine.contains(regx))
- {
- while((streamLine = teluStream.readLine())!= null && i<5)
- {
- if(streamLine.contains(”gif”))
- {
- gifStr[i] = streamLine;
- System.out.println(”gif string: “+i+gifStr[i]);
- i++;
- }
- }
- break;
- }
- }
The while loop iterates for all the lines in the whole page, but i just wanted the images after the phrase alt=”taja” so
line 6 checks if the line contains the taja phrase, if so
line 8 is an another while loop which iterates for i < 5
line 10 checks if the line contains a gif (image), if so
in line 12 we store that particular line into an array
as we got the first image we increament the i in line 14
we are almost done, we found the lines in the page which contains the information we want
now in that each line we have to extract the exact path of the image,
the programmer who designed the page only writes the image path in the page and its the browsers duty to get the image from that path and display in the specified area
so now our duty is to get the path and open a connection on that path and then get the content of that image as similar to that of getting the content of the page that we have done very intially
but first lets c how we get the exact path from the line we extracted out of a big page
the line we extracted looks like this :
<td><a href="view_news.php?id=6528"><img src="./images/Head-chiranjeevi-jeevithaniki-pargu-lankai-enduku.gif" border=0></a></td>
we want that img src in the above line, a simple string manipulation would give us the img src, lets c the manipulation part of the game:
- /*
- * This function returns the image path from given String
- */
- public String getImageurl(String urlLine) throws Exception {
- String str1[],str2[];
- URL url = new URL(siteName);
- str1 = urlLine.split(”<img src=”);
- if (str1[1].contains(”””))
- {
- str2 = str1[1].split(”””);
- }
- System.out.println(str2.length + ” ” + str2[0]);
- int l = str2.length;
- String imageUrl = “http://” + www.telugulo.com + “/” + str2[1];
- System.out.println(imageUrl);
- return imageUrl;
- }
the getImageurl is the function which will return us the exact path of the image
String urlLine the argument passed to the function is the line which contains the img src
when i checked the line clearly i found that the img src is in between the inverted comas, but i found there are another inverted comas also in the line
so i used the split function to split the line into two parts such that the first inverted comas section is removed
the split function splits the line into parts based on the argument u passed to the function and it always returns the array with the splitted values
if u use split(is) on a line “karthik is hero”
String line = “karthik is hero”;
String result[] = line.split(is);
then result[0] will contain “karthik” and result[1] will contain “hero” notice here the result will not include the value that u used to split
so line 8 splits the given line into 2 parts, where the second part will have the img src
the second part look like this
“./images/Head-chiranjeevi-jeevithaniki-pargu-lankai-enduku.gif” border=0></a></td>
notice in the above <img src= is not included, neither it is included in the first part of the split result
line 11 splits the above line with “
then guess into how many parts the line will be split into
exactly, into 3 parts
where the first part is null as “inverted comma” which is the split value is the starting character
the second part is ./images/Head-chiranjeevi-jeevithaniki-pargu-lankai-enduku.gif
and the third part is border=0></a></td>
so what we want is the second part, but the second part is not a complete url, the host name i.e www.telugulo.com is missing
line 15 adds the host name to the img src making it a complete url, upon which u can open connection as we did intially and get the stream and write the stream into a file headline.gif

Recent Comments