CN103514189A - Implementing method for web crawler based on search engines - Google Patents
Implementing method for web crawler based on search engines Download PDFInfo
- Publication number
- CN103514189A CN103514189A CN201210211633.8A CN201210211633A CN103514189A CN 103514189 A CN103514189 A CN 103514189A CN 201210211633 A CN201210211633 A CN 201210211633A CN 103514189 A CN103514189 A CN 103514189A
- Authority
- CN
- China
- Prior art keywords
- url
- functional module
- search
- webpage
- regular expression
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The invention provides an implementing method for a web crawler based on search engines. The web crawler comprises a socket function module, an http function module, a regular expression function module, a depth-first search function module and a breadth-first search function module. In order to overcome the defects that a fixed search strategy is adopted in a traditional web crawler technology and is lacking in adaptability, the implementing method for the web crawler based on the search engines can meet various requirements of customers, and therefore the web crawler technology which enables data to be updated in real time is achieved.
Description
Technical field
The present invention relates to a kind of information search technology, relate in particular to a kind of implementation method of the web crawlers based on search engine.
Background technology
Along with developing of internet, the approach of people's obtaining information is substituted by network gradually.At the internet development initial stage, people mainly obtain information needed by browsing the mode of portal website, still, along with the sharply development of Web, find in this way own information needed and become more and more difficult.At present, people obtain useful information by search engine mostly, and therefore, the development of search engine technique will directly affect people and obtain speed and the quality of information needed.
1994, first Web Search Tools Web Crawler came out in the world, and more popular search engine has Baidu, Google, Yahoo, Info seek, Inktomi, Teoma, Live Search etc. at present.For the consideration of trade secret, the technology inside story of the Crawler system that each search engine uses is now generally all underground, and existing documents and materials also only limit to summary introduction.Along with network information resource is exponential growth and network information resource dynamic change, the information retrieval service that traditional search engine provides cannot meet the growing demand to personalized service of people, is being faced with huge challenge.
Tradition reptile, from the URL of one or several Initial pages, obtains the URL on Initial page, and in capturing the process of webpage, constantly from current page, extracting new URL puts into queue, until meet certain stop condition of system.The workflow of focused crawler is comparatively complicated, need to filter and irrelevant the linking of theme according to certain web page analysis algorithm, remains with the link of use and put it into wait for the URL queue capturing.Then, it will select next step webpage URL that will capture from queue according to certain search strategy, and repeats said process, until stop while reaching a certain condition of system.In addition, allly by the webpage of crawler capturing, will be stored by system, carry out certain analysis, filtration, and set up index, so that retrieval and indexing afterwards.
Summary of the invention
For overcoming the employing fixing search strategy of legacy network crawler technology, lack adaptive shortcoming, the method that the present invention proposes can meet the multiple demand of client, realize the crawler technology of real-time update data.
Method disclosed by the invention is comprised of five modules, is respectively socket functional module, http functional module, regular expression functional module, deep search functional module, breadth first search's functional module.
Described Socket functional module, is the background knowledge that web crawlers relies on, and is present in the structure of Knowledge Management System, and client is set up and is connected with service end by socket socket;
Described http functional module, client must define one group of URL and determine the address that will browse, after client computer and server connect, if send a request, to server server, receives after request, give corresponding response message, so just code on webpage can be extracted;
Described regular expression functional module, regular expression has been described a kind of pattern of string matching, can be used for checking whether a string contains certain substring, the substring of coupling is replaced or take out from certain string the substring etc. meet certain condition.Client is after code on service end webpage, and regular expression, as a template, mates URL character pattern with searched for character string, then corresponding URL is extracted;
Described deep search functional module, did not all access all URL on webpage when the inventive method starts.First URL extracting at webpage by regular expression is initial starting point S, and is labeled as and accessed; The chained address W of the webpage being then connected from S search S successively.If W did not access, the W of take proceeds depth-first traversal as new starting point, until meet user's demand.If now still have the not summit of access, select a summit of not yet accessing else and repeat said process as new source point, until all summits in webpage are all accessed under the prerequisite of meeting consumers' demand.
Described breadth first search's functional module, did not all access all URL on webpage when the present invention starts.Reptile, from the URL of initial page p0, by regular expression searching page p0 and extract all URL in the page, is added them in URL queue to.Then, reptile obtains URL by certain order from queue, repeats said process, until meet the requirement of client.
Accompanying drawing explanation
Accompanying drawing is mainly used for providing further to be understood the present invention.Accompanying drawing shows embodiments of the invention, and with together with this instructions, play the effect of explaining the principle of the invention.In accompanying drawing:
Fig. 1 schematically shows process flow diagram of the present invention.
Embodiment
Below in conjunction with accompanying drawing, describe technical scheme of the present invention in detail.
In the embodiment shown in fig. 1,
Method disclosed by the invention is comprised of five modules, is respectively socket functional module (1), http functional module (2), regular expression functional module (3), deep search functional module (4), breadth first search's functional module (5).
Described Socket functional module (1), is the background knowledge that web crawlers relies on, and is present in the structure of Knowledge Management System, and client is set up and is connected with service end by socket socket;
Described http functional module (2), client must define one group of URL and determine the address that will browse, after client computer and server connect, if sending a request receives after request to server server, give corresponding response message, so just code on webpage can be extracted;
Described regular expression functional module (3), regular expression has been described a kind of pattern of string matching, can be used for checking whether a string contains certain substring, the substring of coupling is replaced or take out from certain string the substring etc. meet certain condition.Client is after code on service end webpage, and regular expression, as a template, mates URL character pattern with searched for character string, then corresponding URL is extracted;
Described deep search functional module (4), did not all access all URL on webpage when the inventive method starts.First URL extracting at webpage by regular expression is initial starting point S, and is labeled as and accessed; The chained address W of the webpage being then connected from S search S successively.If W did not access, the W of take proceeds depth-first traversal as new starting point, until meet user's demand.If now still have the not summit of access, select a summit of not yet accessing else and repeat said process as new source point, until all summits in webpage are all accessed under the prerequisite of meeting consumers' demand.
Described breadth first search's functional module (5), did not all access all URL on webpage when the present invention starts.Reptile, from the URL of initial page p0, by regular expression searching page p0 and extract all URL in the page, is added them in URL queue to.Then, reptile obtains URL by certain order from queue, repeats said process, until meet the requirement of client.
In certain embodiments, adopt BFS (Breadth First Search) certain link from webpage, access the all-links in this linked web pages, after having accessed, then by recursive algorithm, realize down the access of one deck.
Above-described embodiment is to provide to those of ordinary skills and realizes or use of the present invention; those of ordinary skills can be without departing from the present invention in the case of the inventive idea; above-described embodiment is made to various modifications or variation; thereby protection scope of the present invention do not limit by above-described embodiment, and it should be the maximum magnitude that meets the inventive features that claims mention.
Claims (6)
1. an implementation method for the web crawlers based on search engine, is characterized in that, comprising:
Socket functional module,
Http functional module,
Regular expression functional module,
Deep search functional module,
Breadth first search's functional module.
2. the implementation method of a kind of web crawlers based on search engine as claimed in claim 1, is characterized in that:
Described Socket functional module, is the background knowledge that web crawlers relies on, and is present in the structure of Knowledge Management System, and client is set up and is connected with service end by socket socket.
3. the implementation method of a kind of web crawlers based on search engine as claimed in claim 1, is characterized in that:
Described http functional module, client must define one group of URL and determine the address that will browse, after client computer and server connect, if send a request, to server server, receives after request, give corresponding response message, so just code on webpage can be extracted.
4. the implementation method of a kind of web crawlers based on search engine as claimed in claim 1, is characterized in that:
Described regular expression functional module, regular expression has been described a kind of pattern of string matching, can be used for checking whether a string contains certain substring, the substring of coupling is replaced or take out from certain string the substring etc. meet certain condition.Client is after code on service end webpage, and regular expression, as a template, mates URL character pattern with searched for character string, then corresponding URL is extracted.
5. the implementation method of a kind of web crawlers based on search engine as claimed in claim 1, is characterized in that:
Described deep search functional module, did not all access all URL on webpage when the inventive method starts.First URL extracting at webpage by regular expression is initial starting point S, and is labeled as and accessed; The chained address W of the webpage being then connected from S search S successively.If W did not access, the W of take proceeds depth-first traversal as new starting point, until meet user's demand.If now still have the not summit of access, select a summit of not yet accessing else and repeat said process as new source point, until all summits in webpage are all accessed under the prerequisite of meeting consumers' demand.
6. the implementation method of a kind of web crawlers based on search engine as claimed in claim 1, is characterized in that:
Described breadth first search's functional module, did not all access all URL on webpage when the present invention starts.Reptile, from the URL of initial page p0, by regular expression searching page p0 and extract all URL in the page, is added them in URL queue to.Then, reptile obtains URL by certain order from queue, repeats said process, until meet the requirement of client.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210211633.8A CN103514189A (en) | 2012-06-25 | 2012-06-25 | Implementing method for web crawler based on search engines |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210211633.8A CN103514189A (en) | 2012-06-25 | 2012-06-25 | Implementing method for web crawler based on search engines |
Publications (1)
Publication Number | Publication Date |
---|---|
CN103514189A true CN103514189A (en) | 2014-01-15 |
Family
ID=49896925
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201210211633.8A Pending CN103514189A (en) | 2012-06-25 | 2012-06-25 | Implementing method for web crawler based on search engines |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103514189A (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104050281A (en) * | 2014-06-26 | 2014-09-17 | 北京思特奇信息技术股份有限公司 | Webpage information extraction method and device based on http protocol |
CN104142985A (en) * | 2014-07-23 | 2014-11-12 | 哈尔滨工业大学(威海) | Semi-automatic vertical crawler generation tool and method |
CN105302876A (en) * | 2015-09-28 | 2016-02-03 | 孙燕群 | Regular expression based URL filtering method |
CN105740363A (en) * | 2016-01-26 | 2016-07-06 | 上海晶赞科技发展有限公司 | Website target page discovery method and apparatus |
CN106055722A (en) * | 2016-07-26 | 2016-10-26 | 重庆兆光科技股份有限公司 | Web crawler capturing method and system |
CN106453689A (en) * | 2016-11-11 | 2017-02-22 | 四川长虹电器股份有限公司 | Method for extracting and verifying URL (Uniform Resource Locator) |
CN106959976A (en) * | 2016-01-12 | 2017-07-18 | 腾讯科技(深圳)有限公司 | A kind of search processing method and device |
CN107423084A (en) * | 2017-04-24 | 2017-12-01 | 武汉斗鱼网络科技有限公司 | Modification of program method and device |
CN108228623A (en) * | 2016-12-14 | 2018-06-29 | 北京国双科技有限公司 | A kind of data processing method and client device |
CN112445954A (en) * | 2019-08-29 | 2021-03-05 | 杭州中软安人网络通信股份有限公司 | Method and device for automatically extracting webpage |
-
2012
- 2012-06-25 CN CN201210211633.8A patent/CN103514189A/en active Pending
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104050281A (en) * | 2014-06-26 | 2014-09-17 | 北京思特奇信息技术股份有限公司 | Webpage information extraction method and device based on http protocol |
CN104142985B (en) * | 2014-07-23 | 2018-02-06 | 哈尔滨工业大学(威海) | A kind of semi-automatic vertical reptile Core Generator and method |
CN104142985A (en) * | 2014-07-23 | 2014-11-12 | 哈尔滨工业大学(威海) | Semi-automatic vertical crawler generation tool and method |
CN105302876A (en) * | 2015-09-28 | 2016-02-03 | 孙燕群 | Regular expression based URL filtering method |
CN106959976B (en) * | 2016-01-12 | 2020-08-14 | 腾讯科技(深圳)有限公司 | Search processing method and device |
US10713302B2 (en) | 2016-01-12 | 2020-07-14 | Tencent Technology (Shenzhen) Company Limited | Search processing method and device |
CN106959976A (en) * | 2016-01-12 | 2017-07-18 | 腾讯科技(深圳)有限公司 | A kind of search processing method and device |
CN105740363A (en) * | 2016-01-26 | 2016-07-06 | 上海晶赞科技发展有限公司 | Website target page discovery method and apparatus |
CN106055722A (en) * | 2016-07-26 | 2016-10-26 | 重庆兆光科技股份有限公司 | Web crawler capturing method and system |
CN106453689B (en) * | 2016-11-11 | 2019-05-24 | 四川长虹电器股份有限公司 | The method extracted and verify URL |
CN106453689A (en) * | 2016-11-11 | 2017-02-22 | 四川长虹电器股份有限公司 | Method for extracting and verifying URL (Uniform Resource Locator) |
CN108228623A (en) * | 2016-12-14 | 2018-06-29 | 北京国双科技有限公司 | A kind of data processing method and client device |
CN108228623B (en) * | 2016-12-14 | 2021-12-24 | 北京国双科技有限公司 | Data processing method and client device |
CN107423084A (en) * | 2017-04-24 | 2017-12-01 | 武汉斗鱼网络科技有限公司 | Modification of program method and device |
CN112445954A (en) * | 2019-08-29 | 2021-03-05 | 杭州中软安人网络通信股份有限公司 | Method and device for automatically extracting webpage |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103514189A (en) | Implementing method for web crawler based on search engines | |
CN103365924B (en) | A kind of method of internet information search, device and terminal | |
CN102930059B (en) | Method for designing focused crawler | |
CN102591992A (en) | Webpage classification identifying system and method based on vertical search and focused crawler technology | |
CN102098229B (en) | Method and device for optimizing and auditing uniform resource locator (URL) as well as network device | |
CN102163226B (en) | Adjacent sorting repetition-reducing method based on Map-Reduce and segmentation | |
CN103544255A (en) | Text semantic relativity based network public opinion information analysis method | |
CN101520798A (en) | Webpage classification technology based on vertical search and focused crawler | |
CN102355488A (en) | Crawler seed obtaining method and equipment and crawler crawling method and equipment | |
CN102662965A (en) | Method and system of automatically discovering hot news theme on the internet | |
CN101676907A (en) | Method and system of directionally acquiring Internet resources | |
CN104869009A (en) | Website data statistics system and method | |
CN103530429B (en) | Webpage content extracting method | |
CN103116635B (en) | Field-oriented method and system for collecting invisible web resources | |
CN102982118B (en) | Searching method and device based on favorites | |
CN104750704A (en) | Webpage uniform resource locator (URL) classification and identification method and device | |
CN102710795A (en) | Hotspot collecting method and device | |
CN103984749A (en) | Focused crawler method based on link analysis | |
CN102567521B (en) | Webpage data capturing and filtering method | |
CN100477593C (en) | Method and device for selecting correlative discussion zone in network community | |
US9336316B2 (en) | Image URL-based junk detection | |
CN104199893A (en) | System and method for publishing omnimedia contents fast | |
Bojars et al. | Weaving sioc into the web of linked data | |
KR101248186B1 (en) | System for generating blog using each content in search result page and method thereof | |
Wang et al. | Research on lda model algorithm of news-oriented web crawler |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C02 | Deemed withdrawal of patent application after publication (patent law 2001) | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20140115 |