GB2456049A - Visual web crawler - Google Patents

Visual web crawler Download PDF

Info

Publication number
GB2456049A
GB2456049A GB0722043A GB0722043A GB2456049A GB 2456049 A GB2456049 A GB 2456049A GB 0722043 A GB0722043 A GB 0722043A GB 0722043 A GB0722043 A GB 0722043A GB 2456049 A GB2456049 A GB 2456049A
Authority
GB
United Kingdom
Prior art keywords
tags
website
spider
link
database
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
GB0722043A
Other versions
GB0722043D0 (en
Inventor
Javid Zeeshan
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to GB0722043A priority Critical patent/GB2456049A/en
Publication of GB0722043D0 publication Critical patent/GB0722043D0/en
Publication of GB2456049A publication Critical patent/GB2456049A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A process for automatically crawling, developing visual hierarchy and parsing HTML tags from the website comprising: a website being provided to a web spider that will parse the main page of the website and build visual hierarchy of all referral links with in the provided domain. The spider will parse predefined tags and store parsed tags along with the tags content into a database. After parsing the main page the spider goes on all hierarchal pages one by one and parses all pages in the light of asked tags and store tags data into the database against website link.

Description

1. DescrIption
1.1 Technical Field of Invention
The invention relates to crawling websites, parsing HTML tags of the web pages and presenting website links hierarchy in visual format.
1.2. Background
A web crawler (also known as a web spider or web robot) is a program or automated script which browses the World Wide Web in a methodical, automated manner.
This process is called web crawling or spidering. In particular search engines, use spidering as a means of providing up-to-date data. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches.
Mostly search engines have their own crawler to index website pages; however these spiders do not provide visual website links hierarchy and also do not allow users to specify html tags to parse on specifIc website.
In one scenario the situation becomes complex when a group of companies have multiple internal websites and wants to parse only HTML title tag, or in some scenario wants to query data only against specific html tag using search engine.
In such case there is a clear need of some methodology which gives flexibility to parse specific HTML tags and gives option to add or skip web pages / sites visually.
1.3. Summary of the Invention
The invented visual web spider navigates given LIRL or can fetch website URL from the pre defined list in the database. Visual Web Spider navigates main page of the website and parse predefined HTML tags which includes <Title>, <Meta>, <Body> and <A href="...">, however visual web spider gives an option to define HTML tags in order to parse according to the users requirements. After parsing the main page visual web spider build top level hierarchy of the main page with <a href="..."> links. On completion of the main page parsing visual web spider moves on first parsed link of the main page and run similar operation on the web page and parse HTML tags and build nested hierarchy of referral links. Visual web spider applies the same operation on each parsed link and keeps making nested hierarchy from root URT. till last parsed link.
While parsing web site pages visual web spider store related information in the database and makes index for all web pages which could he use ftr search engines purpose. Visual web spider gives an opportunity to retrieve parsed website sitemap in visual hierarchy format.
1.4. Brief Description of the Diagram
1.4.1. Figure 1 Illustrate the hierarchal representation of the process in the light of present invention.
1.4.2. Figure 2 Illustrate how visual web spider parse html tags and crawl from page to page.
1.4.3. FIgure 3 Visual Web Spider screen shows the links of tree view hierarchy of the website while crawling.
1.5. Detailed Description of the Invention with
Diagrams Figure 1, In order to have better understanding all processes has been assigned with process references numbers.
As shown in Figure 1, Visual web spider starts crawling and on process 2 checks if user wants to select HTML tags from the library, however user can also add desired HTML tags to parse as labeled process 3 in diagram. Visual web spider has following predefined HTML tags in the database to parse I, <Head>. . . </Head> <Title>. . .</Title>
cmeta name="description° content="...">
<meta name = keywords" content="..."> <Body>.. c/Body> <a href="...">.. . .</a> After committing HTML tags by user, spider moves into process 4 where again user either select URL from the library or resume existing process, however spider gives option to add manual URL to start parsing process as labeled process 5 in diagram.
Once URL has been finalized to parse, in process 6 spider starts navigation of selected ITRL and move into process 7 where spider parses all HTML tags, build tree hierarchy with the help of <A href=".."> tags on process 8. Once complete page navigated spider index all tags and content and put into database in process 9.
In process 10, spider checks if current page has any child link available like if spider navigates news.bbc.co.uk the child link could be sports.htm, business.htm, technology.htm etc. in this scenario spider will get first link of news.hhc.co.uk which is sports.htm and will start parsing, however will go through from the process 6 to 10 again and will index the whole page content. On process 10 spider will again check if sports.htm has some child link which could be cricket.htm, football.htm etc. Tree hierarchy will go nested until spider will not get any child link on process 10. Every time on process 10 spider will add child link into tree hierarchy which has been labelled as process 11.
There will he a situation where spider will not able to find child link on process 10 and will move on process 12 where spider will check if there is any sibling link available like in current scenario after parsing all nested links of cricket.htm the next sibling will be football.htm. On getting sibling spider will start same processes kxp from process 6 to 11 and will parse all tags of the web page and will update database.
On process 13 spider will check if parent link has been parsed successfully until last child, on positive response spider will stop parsing and will update domain status "parsed" in the database.
Figure 2, Illustrates example of parsed html pages, in figure main.htm has been considered as main page which has three links as followed Child 1.htm Child2.htrn Child3.htm As shown in figure in this process spider parse cmeta...>, <title>, <body> and <a href> tags and update in to database.
After parsing main.htm control moves into childi.htm and run similar process and parse desired html tags and update database. Once childl.htm gets parse successfully spider moves into chill-i.htm which is first child of chill.htm. that's how the complete procedure goes from root link to last nested link and spider crawl all pages one by one and parse all links.
Figure 3, This figure represents the main screen of the spider like on most top text box shows current URL of the website where next box shows total parsed links of the website, bottom text box shows current sub link of the website.
As shown browser control has been integrated with the spider screen in order to give current web page view to the user against the website links hierarchy tree under browser control. User also has option to skip the current link or stop / pause crawling process.

Claims (8)

  1. 2.Iaims 1. Website crawling means, spider crawls all HREF links of the given or fetched website form the database, develop visual hierarchy of the whole website href links in tree view style from root link till last nested link of the website.
  2. 2. On crawling every single website link, invented process parse HTML defined tags from the web page code and put into database.
  3. 3. Defined tags according to the claim 2, process has predefined tags <Head>, <Meta>, <Body>, <A href="..">, however other tags can also be included according to the requirements with the help of control panel.
  4. 4. Invented process shows website pages visually as Internet browser and also provides current status of navigated link against tree hierarchy on same screen.
  5. 5. Spider makes indexes for all web pages and store all tags data in the database.
  6. 6. Invented process provides backbone for search engines to run their queries against specific tags text (i.e title, meta, body or generally.
  7. 7. Invented process can be scheduled to parse list of websites on desired time automatically.
  8. 8. On resume invented process keeps track of last navigated link and start navigation processing from the same link.
    S
GB0722043A 2007-11-12 2007-11-12 Visual web crawler Withdrawn GB2456049A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
GB0722043A GB2456049A (en) 2007-11-12 2007-11-12 Visual web crawler

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB0722043A GB2456049A (en) 2007-11-12 2007-11-12 Visual web crawler

Publications (2)

Publication Number Publication Date
GB0722043D0 GB0722043D0 (en) 2007-12-19
GB2456049A true GB2456049A (en) 2009-07-08

Family

ID=38858456

Family Applications (1)

Application Number Title Priority Date Filing Date
GB0722043A Withdrawn GB2456049A (en) 2007-11-12 2007-11-12 Visual web crawler

Country Status (1)

Country Link
GB (1) GB2456049A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9286378B1 (en) * 2012-08-31 2016-03-15 Facebook, Inc. System and methods for URL entity extraction
CN105760514A (en) * 2016-02-24 2016-07-13 西安交通大学 Method for automatically obtaining short text of knowledge domain from community question-and-answer website
CN107122389A (en) * 2017-03-03 2017-09-01 杭州电子科技大学 It is a kind of to realize the method that streaming and multi-mode quickly search URL link in webpage

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110765402A (en) * 2019-10-31 2020-02-07 同方知网(北京)技术有限公司 Visual acquisition system and method based on network resources
CN117891992B (en) * 2023-12-22 2024-09-06 赛迪检测认证中心有限公司 Data crawling method and device, electronic equipment and storage medium

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9286378B1 (en) * 2012-08-31 2016-03-15 Facebook, Inc. System and methods for URL entity extraction
CN105760514A (en) * 2016-02-24 2016-07-13 西安交通大学 Method for automatically obtaining short text of knowledge domain from community question-and-answer website
CN105760514B (en) * 2016-02-24 2018-12-07 西安交通大学 A method of ken short text is obtained automatically from community question and answer website
CN107122389A (en) * 2017-03-03 2017-09-01 杭州电子科技大学 It is a kind of to realize the method that streaming and multi-mode quickly search URL link in webpage
CN107122389B (en) * 2017-03-03 2018-05-04 杭州电子科技大学 A kind of method realized streaming and multi-mode and quickly search URL link in webpage

Also Published As

Publication number Publication date
GB0722043D0 (en) 2007-12-19

Similar Documents

Publication Publication Date Title
Seymour et al. History of search engines
US6604099B1 (en) Majority schema in semi-structured data
US6931397B1 (en) System and method for automatic generation of dynamic search abstracts contain metadata by crawler
US8554800B2 (en) System, methods and applications for structured document indexing
Ahmadi-Abkenari et al. An architecture for a focused trend parallel Web crawler with the application of clickstream analysis
US9092756B2 (en) Information-retrieval systems, methods and software with content relevancy enhancements
US8275766B2 (en) Systems and methods for detecting network resource interaction and improved search result reporting
US7698329B2 (en) Method for improving quality of search results by avoiding indexing sections of pages
CN101004762A (en) Network web page system of a dynamic multidimensional Internet
Ankalkoti Survey on search engine optimization tools & techniques
Parikh et al. Search engine optimization
GB2456049A (en) Visual web crawler
US20080208803A1 (en) System and method for characterising a web page
Wukovitz Using internet search engines and library catalogs to locate toxicology information
Jadhav et al. Significant role of search engine in higher education
US8315998B1 (en) Methods and apparatus for focusing search results on the semantic web
Penman et al. Web scraping made simple with sitescraper
Sharma et al. Search engine: a backbone for information extraction in ICT scenario
Enache Optimization Methods And Seo Tools
Hossein Farajpahlou et al. How are XML‐based Marc 21 and Dublin Core records indexed and ranked by general search engines in dynamic online environments?
Lam et al. Web information extraction
Aliyu et al. Google query optimization tool
CA2514165A1 (en) Metadata content management and searching system and method
Alafif et al. Domain and range identifier module for semantic web search engines
Ahuja et al. Hidden web data extraction tools

Legal Events

Date Code Title Description
WAP Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1)