CN103514189A - Implementing method for web crawler based on search engines - Google Patents

Implementing method for web crawler based on search engines Download PDF

Info

Publication number
CN103514189A
CN103514189A CN201210211633.8A CN201210211633A CN103514189A CN 103514189 A CN103514189 A CN 103514189A CN 201210211633 A CN201210211633 A CN 201210211633A CN 103514189 A CN103514189 A CN 103514189A
Authority
CN
China
Prior art keywords
url
functional module
search
webpage
regular expression
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201210211633.8A
Other languages
Chinese (zh)
Inventor
蒋志勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SHANGHAI BETOP INFORMATION TECHNOLOGY Co Ltd
Original Assignee
SHANGHAI BETOP INFORMATION TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SHANGHAI BETOP INFORMATION TECHNOLOGY Co Ltd filed Critical SHANGHAI BETOP INFORMATION TECHNOLOGY Co Ltd
Priority to CN201210211633.8A priority Critical patent/CN103514189A/en
Publication of CN103514189A publication Critical patent/CN103514189A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention provides an implementing method for a web crawler based on search engines. The web crawler comprises a socket function module, an http function module, a regular expression function module, a depth-first search function module and a breadth-first search function module. In order to overcome the defects that a fixed search strategy is adopted in a traditional web crawler technology and is lacking in adaptability, the implementing method for the web crawler based on the search engines can meet various requirements of customers, and therefore the web crawler technology which enables data to be updated in real time is achieved.

Description

A kind of implementation method of the web crawlers based on search engine
Technical field
The present invention relates to a kind of information search technology, relate in particular to a kind of implementation method of the web crawlers based on search engine.
Background technology
Along with developing of internet, the approach of people's obtaining information is substituted by network gradually.At the internet development initial stage, people mainly obtain information needed by browsing the mode of portal website, still, along with the sharply development of Web, find in this way own information needed and become more and more difficult.At present, people obtain useful information by search engine mostly, and therefore, the development of search engine technique will directly affect people and obtain speed and the quality of information needed.
1994, first Web Search Tools Web Crawler came out in the world, and more popular search engine has Baidu, Google, Yahoo, Info seek, Inktomi, Teoma, Live Search etc. at present.For the consideration of trade secret, the technology inside story of the Crawler system that each search engine uses is now generally all underground, and existing documents and materials also only limit to summary introduction.Along with network information resource is exponential growth and network information resource dynamic change, the information retrieval service that traditional search engine provides cannot meet the growing demand to personalized service of people, is being faced with huge challenge.
Tradition reptile, from the URL of one or several Initial pages, obtains the URL on Initial page, and in capturing the process of webpage, constantly from current page, extracting new URL puts into queue, until meet certain stop condition of system.The workflow of focused crawler is comparatively complicated, need to filter and irrelevant the linking of theme according to certain web page analysis algorithm, remains with the link of use and put it into wait for the URL queue capturing.Then, it will select next step webpage URL that will capture from queue according to certain search strategy, and repeats said process, until stop while reaching a certain condition of system.In addition, allly by the webpage of crawler capturing, will be stored by system, carry out certain analysis, filtration, and set up index, so that retrieval and indexing afterwards.
Summary of the invention
For overcoming the employing fixing search strategy of legacy network crawler technology, lack adaptive shortcoming, the method that the present invention proposes can meet the multiple demand of client, realize the crawler technology of real-time update data.
Method disclosed by the invention is comprised of five modules, is respectively socket functional module, http functional module, regular expression functional module, deep search functional module, breadth first search's functional module.
Described Socket functional module, is the background knowledge that web crawlers relies on, and is present in the structure of Knowledge Management System, and client is set up and is connected with service end by socket socket;
Described http functional module, client must define one group of URL and determine the address that will browse, after client computer and server connect, if send a request, to server server, receives after request, give corresponding response message, so just code on webpage can be extracted;
Described regular expression functional module, regular expression has been described a kind of pattern of string matching, can be used for checking whether a string contains certain substring, the substring of coupling is replaced or take out from certain string the substring etc. meet certain condition.Client is after code on service end webpage, and regular expression, as a template, mates URL character pattern with searched for character string, then corresponding URL is extracted;
Described deep search functional module, did not all access all URL on webpage when the inventive method starts.First URL extracting at webpage by regular expression is initial starting point S, and is labeled as and accessed; The chained address W of the webpage being then connected from S search S successively.If W did not access, the W of take proceeds depth-first traversal as new starting point, until meet user's demand.If now still have the not summit of access, select a summit of not yet accessing else and repeat said process as new source point, until all summits in webpage are all accessed under the prerequisite of meeting consumers' demand.
Described breadth first search's functional module, did not all access all URL on webpage when the present invention starts.Reptile, from the URL of initial page p0, by regular expression searching page p0 and extract all URL in the page, is added them in URL queue to.Then, reptile obtains URL by certain order from queue, repeats said process, until meet the requirement of client.
Accompanying drawing explanation
Accompanying drawing is mainly used for providing further to be understood the present invention.Accompanying drawing shows embodiments of the invention, and with together with this instructions, play the effect of explaining the principle of the invention.In accompanying drawing:
Fig. 1 schematically shows process flow diagram of the present invention.
Embodiment
Below in conjunction with accompanying drawing, describe technical scheme of the present invention in detail.
In the embodiment shown in fig. 1,
Method disclosed by the invention is comprised of five modules, is respectively socket functional module (1), http functional module (2), regular expression functional module (3), deep search functional module (4), breadth first search's functional module (5).
Described Socket functional module (1), is the background knowledge that web crawlers relies on, and is present in the structure of Knowledge Management System, and client is set up and is connected with service end by socket socket;
Described http functional module (2), client must define one group of URL and determine the address that will browse, after client computer and server connect, if sending a request receives after request to server server, give corresponding response message, so just code on webpage can be extracted;
Described regular expression functional module (3), regular expression has been described a kind of pattern of string matching, can be used for checking whether a string contains certain substring, the substring of coupling is replaced or take out from certain string the substring etc. meet certain condition.Client is after code on service end webpage, and regular expression, as a template, mates URL character pattern with searched for character string, then corresponding URL is extracted;
Described deep search functional module (4), did not all access all URL on webpage when the inventive method starts.First URL extracting at webpage by regular expression is initial starting point S, and is labeled as and accessed; The chained address W of the webpage being then connected from S search S successively.If W did not access, the W of take proceeds depth-first traversal as new starting point, until meet user's demand.If now still have the not summit of access, select a summit of not yet accessing else and repeat said process as new source point, until all summits in webpage are all accessed under the prerequisite of meeting consumers' demand.
Described breadth first search's functional module (5), did not all access all URL on webpage when the present invention starts.Reptile, from the URL of initial page p0, by regular expression searching page p0 and extract all URL in the page, is added them in URL queue to.Then, reptile obtains URL by certain order from queue, repeats said process, until meet the requirement of client.
In certain embodiments, adopt BFS (Breadth First Search) certain link from webpage, access the all-links in this linked web pages, after having accessed, then by recursive algorithm, realize down the access of one deck.
Above-described embodiment is to provide to those of ordinary skills and realizes or use of the present invention; those of ordinary skills can be without departing from the present invention in the case of the inventive idea; above-described embodiment is made to various modifications or variation; thereby protection scope of the present invention do not limit by above-described embodiment, and it should be the maximum magnitude that meets the inventive features that claims mention.

Claims (6)

1. an implementation method for the web crawlers based on search engine, is characterized in that, comprising:
Socket functional module,
Http functional module,
Regular expression functional module,
Deep search functional module,
Breadth first search's functional module.
2. the implementation method of a kind of web crawlers based on search engine as claimed in claim 1, is characterized in that:
Described Socket functional module, is the background knowledge that web crawlers relies on, and is present in the structure of Knowledge Management System, and client is set up and is connected with service end by socket socket.
3. the implementation method of a kind of web crawlers based on search engine as claimed in claim 1, is characterized in that:
Described http functional module, client must define one group of URL and determine the address that will browse, after client computer and server connect, if send a request, to server server, receives after request, give corresponding response message, so just code on webpage can be extracted.
4. the implementation method of a kind of web crawlers based on search engine as claimed in claim 1, is characterized in that:
Described regular expression functional module, regular expression has been described a kind of pattern of string matching, can be used for checking whether a string contains certain substring, the substring of coupling is replaced or take out from certain string the substring etc. meet certain condition.Client is after code on service end webpage, and regular expression, as a template, mates URL character pattern with searched for character string, then corresponding URL is extracted.
5. the implementation method of a kind of web crawlers based on search engine as claimed in claim 1, is characterized in that:
Described deep search functional module, did not all access all URL on webpage when the inventive method starts.First URL extracting at webpage by regular expression is initial starting point S, and is labeled as and accessed; The chained address W of the webpage being then connected from S search S successively.If W did not access, the W of take proceeds depth-first traversal as new starting point, until meet user's demand.If now still have the not summit of access, select a summit of not yet accessing else and repeat said process as new source point, until all summits in webpage are all accessed under the prerequisite of meeting consumers' demand.
6. the implementation method of a kind of web crawlers based on search engine as claimed in claim 1, is characterized in that:
Described breadth first search's functional module, did not all access all URL on webpage when the present invention starts.Reptile, from the URL of initial page p0, by regular expression searching page p0 and extract all URL in the page, is added them in URL queue to.Then, reptile obtains URL by certain order from queue, repeats said process, until meet the requirement of client.
CN201210211633.8A 2012-06-25 2012-06-25 Implementing method for web crawler based on search engines Pending CN103514189A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210211633.8A CN103514189A (en) 2012-06-25 2012-06-25 Implementing method for web crawler based on search engines

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210211633.8A CN103514189A (en) 2012-06-25 2012-06-25 Implementing method for web crawler based on search engines

Publications (1)

Publication Number Publication Date
CN103514189A true CN103514189A (en) 2014-01-15

Family

ID=49896925

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210211633.8A Pending CN103514189A (en) 2012-06-25 2012-06-25 Implementing method for web crawler based on search engines

Country Status (1)

Country Link
CN (1) CN103514189A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104050281A (en) * 2014-06-26 2014-09-17 北京思特奇信息技术股份有限公司 Webpage information extraction method and device based on http protocol
CN104142985A (en) * 2014-07-23 2014-11-12 哈尔滨工业大学(威海) Semi-automatic vertical crawler generation tool and method
CN105302876A (en) * 2015-09-28 2016-02-03 孙燕群 Regular expression based URL filtering method
CN105740363A (en) * 2016-01-26 2016-07-06 上海晶赞科技发展有限公司 Website target page discovery method and apparatus
CN106055722A (en) * 2016-07-26 2016-10-26 重庆兆光科技股份有限公司 Web crawler capturing method and system
CN106453689A (en) * 2016-11-11 2017-02-22 四川长虹电器股份有限公司 Method for extracting and verifying URL (Uniform Resource Locator)
CN106959976A (en) * 2016-01-12 2017-07-18 腾讯科技(深圳)有限公司 A kind of search processing method and device
CN107423084A (en) * 2017-04-24 2017-12-01 武汉斗鱼网络科技有限公司 Modification of program method and device
CN108228623A (en) * 2016-12-14 2018-06-29 北京国双科技有限公司 A kind of data processing method and client device
CN112445954A (en) * 2019-08-29 2021-03-05 杭州中软安人网络通信股份有限公司 Method and device for automatically extracting webpage

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104050281A (en) * 2014-06-26 2014-09-17 北京思特奇信息技术股份有限公司 Webpage information extraction method and device based on http protocol
CN104142985B (en) * 2014-07-23 2018-02-06 哈尔滨工业大学(威海) A kind of semi-automatic vertical reptile Core Generator and method
CN104142985A (en) * 2014-07-23 2014-11-12 哈尔滨工业大学(威海) Semi-automatic vertical crawler generation tool and method
CN105302876A (en) * 2015-09-28 2016-02-03 孙燕群 Regular expression based URL filtering method
CN106959976B (en) * 2016-01-12 2020-08-14 腾讯科技(深圳)有限公司 Search processing method and device
US10713302B2 (en) 2016-01-12 2020-07-14 Tencent Technology (Shenzhen) Company Limited Search processing method and device
CN106959976A (en) * 2016-01-12 2017-07-18 腾讯科技(深圳)有限公司 A kind of search processing method and device
CN105740363A (en) * 2016-01-26 2016-07-06 上海晶赞科技发展有限公司 Website target page discovery method and apparatus
CN106055722A (en) * 2016-07-26 2016-10-26 重庆兆光科技股份有限公司 Web crawler capturing method and system
CN106453689B (en) * 2016-11-11 2019-05-24 四川长虹电器股份有限公司 The method extracted and verify URL
CN106453689A (en) * 2016-11-11 2017-02-22 四川长虹电器股份有限公司 Method for extracting and verifying URL (Uniform Resource Locator)
CN108228623A (en) * 2016-12-14 2018-06-29 北京国双科技有限公司 A kind of data processing method and client device
CN108228623B (en) * 2016-12-14 2021-12-24 北京国双科技有限公司 Data processing method and client device
CN107423084A (en) * 2017-04-24 2017-12-01 武汉斗鱼网络科技有限公司 Modification of program method and device
CN112445954A (en) * 2019-08-29 2021-03-05 杭州中软安人网络通信股份有限公司 Method and device for automatically extracting webpage

Similar Documents

Publication Publication Date Title
CN103514189A (en) Implementing method for web crawler based on search engines
CN103365924B (en) A kind of method of internet information search, device and terminal
CN102930059B (en) Method for designing focused crawler
CN102591992A (en) Webpage classification identifying system and method based on vertical search and focused crawler technology
CN102098229B (en) Method and device for optimizing and auditing uniform resource locator (URL) as well as network device
CN102163226B (en) Adjacent sorting repetition-reducing method based on Map-Reduce and segmentation
CN103544255A (en) Text semantic relativity based network public opinion information analysis method
CN101520798A (en) Webpage classification technology based on vertical search and focused crawler
CN102355488A (en) Crawler seed obtaining method and equipment and crawler crawling method and equipment
CN102662965A (en) Method and system of automatically discovering hot news theme on the internet
CN101676907A (en) Method and system of directionally acquiring Internet resources
CN104869009A (en) Website data statistics system and method
CN103530429B (en) Webpage content extracting method
CN103116635B (en) Field-oriented method and system for collecting invisible web resources
CN102982118B (en) Searching method and device based on favorites
CN104750704A (en) Webpage uniform resource locator (URL) classification and identification method and device
CN102710795A (en) Hotspot collecting method and device
CN103984749A (en) Focused crawler method based on link analysis
CN102567521B (en) Webpage data capturing and filtering method
CN100477593C (en) Method and device for selecting correlative discussion zone in network community
US9336316B2 (en) Image URL-based junk detection
CN104199893A (en) System and method for publishing omnimedia contents fast
Bojars et al. Weaving sioc into the web of linked data
KR101248186B1 (en) System for generating blog using each content in search result page and method thereof
Wang et al. Research on lda model algorithm of news-oriented web crawler

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20140115