GB2456049A

GB2456049A - Visual web crawler

Info

Publication number: GB2456049A
Application number: GB0722043A
Authority: GB
Inventors: Javid Zeeshan
Original assignee: Individual
Current assignee: Individual
Priority date: 2007-11-12
Filing date: 2007-11-12
Publication date: 2009-07-08
Also published as: GB0722043D0

Abstract

A process for automatically crawling, developing visual hierarchy and parsing HTML tags from the website comprising: a website being provided to a web spider that will parse the main page of the website and build visual hierarchy of all referral links with in the provided domain. The spider will parse predefined tags and store parsed tags along with the tags content into a database. After parsing the main page the spider goes on all hierarchal pages one by one and parses all pages in the light of asked tags and store tags data into the database against website link.

Description

1. DescrIption

1.1 Technical Field of Invention

The invention relates to crawling websites, parsing HTML tags of the web pages and presenting website links hierarchy in visual format.

1.2. Background

A web crawler (also known as a web spider or web robot) is a program or automated script which browses the World Wide Web in a methodical, automated manner.

This process is called web crawling or spidering. In particular search engines, use spidering as a means of providing up-to-date data. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches.

Mostly search engines have their own crawler to index website pages; however these spiders do not provide visual website links hierarchy and also do not allow users to specify html tags to parse on specifIc website.

In one scenario the situation becomes complex when a group of companies have multiple internal websites and wants to parse only HTML title tag, or in some scenario wants to query data only against specific html tag using search engine.

In such case there is a clear need of some methodology which gives flexibility to parse specific HTML tags and gives option to add or skip web pages / sites visually.

1.3. Summary of the Invention

The invented visual web spider navigates given LIRL or can fetch website URL from the pre defined list in the database. Visual Web Spider navigates main page of the website and parse predefined HTML tags which includes <Title>, <Meta>, <Body> and <A href="...">, however visual web spider gives an option to define HTML tags in order to parse according to the users requirements. After parsing the main page visual web spider build top level hierarchy of the main page with <a href="..."> links. On completion of the main page parsing visual web spider moves on first parsed link of the main page and run similar operation on the web page and parse HTML tags and build nested hierarchy of referral links. Visual web spider applies the same operation on each parsed link and keeps making nested hierarchy from root URT. till last parsed link.

While parsing web site pages visual web spider store related information in the database and makes index for all web pages which could he use ftr search engines purpose. Visual web spider gives an opportunity to retrieve parsed website sitemap in visual hierarchy format.

1.4. Brief Description of the Diagram

1.4.1. Figure 1 Illustrate the hierarchal representation of the process in the light of present invention.

1.4.2. Figure 2 Illustrate how visual web spider parse html tags and crawl from page to page.

1.4.3. FIgure 3 Visual Web Spider screen shows the links of tree view hierarchy of the website while crawling.

1.5. Detailed Description of the Invention with

Diagrams Figure 1, In order to have better understanding all processes has been assigned with process references numbers.

As shown in Figure 1, Visual web spider starts crawling and on process 2 checks if user wants to select HTML tags from the library, however user can also add desired HTML tags to parse as labeled process 3 in diagram. Visual web spider has following predefined HTML tags in the database to parse I, <Head>. . . </Head> <Title>. . .</Title>

cmeta name="description° content="...">

Once URL has been finalized to parse, in process 6 spider starts navigation of selected ITRL and move into process 7 where spider parses all HTML tags, build tree hierarchy with the help of <A href=".."> tags on process 8. Once complete page navigated spider index all tags and content and put into database in process 9.

In process 10, spider checks if current page has any child link available like if spider navigates news.bbc.co.uk the child link could be sports.htm, business.htm, technology.htm etc. in this scenario spider will get first link of news.hhc.co.uk which is sports.htm and will start parsing, however will go through from the process 6 to 10 again and will index the whole page content. On process 10 spider will again check if sports.htm has some child link which could be cricket.htm, football.htm etc. Tree hierarchy will go nested until spider will not get any child link on process 10. Every time on process 10 spider will add child link into tree hierarchy which has been labelled as process 11.

There will he a situation where spider will not able to find child link on process 10 and will move on process 12 where spider will check if there is any sibling link available like in current scenario after parsing all nested links of cricket.htm the next sibling will be football.htm. On getting sibling spider will start same processes kxp from process 6 to 11 and will parse all tags of the web page and will update database.

On process 13 spider will check if parent link has been parsed successfully until last child, on positive response spider will stop parsing and will update domain status "parsed" in the database.

Figure 2, Illustrates example of parsed html pages, in figure main.htm has been considered as main page which has three links as followed Child 1.htm Child2.htrn Child3.htm As shown in figure in this process spider parse cmeta...>, <title>, <body> and <a href> tags and update in to database.

After parsing main.htm control moves into childi.htm and run similar process and parse desired html tags and update database. Once childl.htm gets parse successfully spider moves into chill-i.htm which is first child of chill.htm. that's how the complete procedure goes from root link to last nested link and spider crawl all pages one by one and parse all links.

Figure 3, This figure represents the main screen of the spider like on most top text box shows current URL of the website where next box shows total parsed links of the website, bottom text box shows current sub link of the website.

As shown browser control has been integrated with the spider screen in order to give current web page view to the user against the website links hierarchy tree under browser control. User also has option to skip the current link or stop / pause crawling process.

Claims

2.Iaims 1. Website crawling means, spider crawls all HREF links of the given or fetched website form the database, develop visual hierarchy of the whole website href links in tree view style from root link till last nested link of the website.
2. On crawling every single website link, invented process parse HTML defined tags from the web page code and put into database.
3. Defined tags according to the claim 2, process has predefined tags <Head>, <Meta>, <Body>, <A href="..">, however other tags can also be included according to the requirements with the help of control panel.
4. Invented process shows website pages visually as Internet browser and also provides current status of navigated link against tree hierarchy on same screen.
5. Spider makes indexes for all web pages and store all tags data in the database.
6. Invented process provides backbone for search engines to run their queries against specific tags text (i.e title, meta, body or generally.
7. Invented process can be scheduled to parse list of websites on desired time automatically.
8. On resume invented process keeps track of last navigated link and start navigation processing from the same link.

S