CN103514189A

CN103514189A - Implementing method for web crawler based on search engines

Info

Publication number: CN103514189A
Application number: CN201210211633.8A
Authority: CN
Inventors: 蒋志勇
Original assignee: SHANGHAI BETOP INFORMATION TECHNOLOGY Co Ltd
Current assignee: SHANGHAI BETOP INFORMATION TECHNOLOGY Co Ltd
Priority date: 2012-06-25
Filing date: 2012-06-25
Publication date: 2014-01-15

Abstract

The invention provides an implementing method for a web crawler based on search engines. The web crawler comprises a socket function module, an http function module, a regular expression function module, a depth-first search function module and a breadth-first search function module. In order to overcome the defects that a fixed search strategy is adopted in a traditional web crawler technology and is lacking in adaptability, the implementing method for the web crawler based on the search engines can meet various requirements of customers, and therefore the web crawler technology which enables data to be updated in real time is achieved.

Description

A kind of implementation method of the web crawlers based on search engine

Technical field

The present invention relates to a kind of information search technology, relate in particular to a kind of implementation method of the web crawlers based on search engine.

Background technology

Along with developing of internet, the approach of people's obtaining information is substituted by network gradually.At the internet development initial stage, people mainly obtain information needed by browsing the mode of portal website, still, along with the sharply development of Web, find in this way own information needed and become more and more difficult.At present, people obtain useful information by search engine mostly, and therefore, the development of search engine technique will directly affect people and obtain speed and the quality of information needed.

1994, first Web Search Tools Web Crawler came out in the world, and more popular search engine has Baidu, Google, Yahoo, Info seek, Inktomi, Teoma, Live Search etc. at present.For the consideration of trade secret, the technology inside story of the Crawler system that each search engine uses is now generally all underground, and existing documents and materials also only limit to summary introduction.Along with network information resource is exponential growth and network information resource dynamic change, the information retrieval service that traditional search engine provides cannot meet the growing demand to personalized service of people, is being faced with huge challenge.

Tradition reptile, from the URL of one or several Initial pages, obtains the URL on Initial page, and in capturing the process of webpage, constantly from current page, extracting new URL puts into queue, until meet certain stop condition of system.The workflow of focused crawler is comparatively complicated, need to filter and irrelevant the linking of theme according to certain web page analysis algorithm, remains with the link of use and put it into wait for the URL queue capturing.Then, it will select next step webpage URL that will capture from queue according to certain search strategy, and repeats said process, until stop while reaching a certain condition of system.In addition, allly by the webpage of crawler capturing, will be stored by system, carry out certain analysis, filtration, and set up index, so that retrieval and indexing afterwards.

Summary of the invention

For overcoming the employing fixing search strategy of legacy network crawler technology, lack adaptive shortcoming, the method that the present invention proposes can meet the multiple demand of client, realize the crawler technology of real-time update data.

Method disclosed by the invention is comprised of five modules, is respectively socket functional module, http functional module, regular expression functional module, deep search functional module, breadth first search's functional module.

Described Socket functional module, is the background knowledge that web crawlers relies on, and is present in the structure of Knowledge Management System, and client is set up and is connected with service end by socket socket;

Described http functional module, client must define one group of URL and determine the address that will browse, after client computer and server connect, if send a request, to server server, receives after request, give corresponding response message, so just code on webpage can be extracted;

Described regular expression functional module, regular expression has been described a kind of pattern of string matching, can be used for checking whether a string contains certain substring, the substring of coupling is replaced or take out from certain string the substring etc. meet certain condition.Client is after code on service end webpage, and regular expression, as a template, mates URL character pattern with searched for character string, then corresponding URL is extracted;

Described deep search functional module, did not all access all URL on webpage when the inventive method starts.First URL extracting at webpage by regular expression is initial starting point S, and is labeled as and accessed; The chained address W of the webpage being then connected from S search S successively.If W did not access, the W of take proceeds depth-first traversal as new starting point, until meet user's demand.If now still have the not summit of access, select a summit of not yet accessing else and repeat said process as new source point, until all summits in webpage are all accessed under the prerequisite of meeting consumers' demand.

Described breadth first search's functional module, did not all access all URL on webpage when the present invention starts.Reptile, from the URL of initial page p0, by regular expression searching page p0 and extract all URL in the page, is added them in URL queue to.Then, reptile obtains URL by certain order from queue, repeats said process, until meet the requirement of client.

Accompanying drawing explanation

Accompanying drawing is mainly used for providing further to be understood the present invention.Accompanying drawing shows embodiments of the invention, and with together with this instructions, play the effect of explaining the principle of the invention.In accompanying drawing:

Fig. 1 schematically shows process flow diagram of the present invention.

Embodiment

Below in conjunction with accompanying drawing, describe technical scheme of the present invention in detail.

In the embodiment shown in fig. 1,

Method disclosed by the invention is comprised of five modules, is respectively socket functional module (1), http functional module (2), regular expression functional module (3), deep search functional module (4), breadth first search's functional module (5).

Described Socket functional module (1), is the background knowledge that web crawlers relies on, and is present in the structure of Knowledge Management System, and client is set up and is connected with service end by socket socket;

Described http functional module (2), client must define one group of URL and determine the address that will browse, after client computer and server connect, if sending a request receives after request to server server, give corresponding response message, so just code on webpage can be extracted;

Described regular expression functional module (3), regular expression has been described a kind of pattern of string matching, can be used for checking whether a string contains certain substring, the substring of coupling is replaced or take out from certain string the substring etc. meet certain condition.Client is after code on service end webpage, and regular expression, as a template, mates URL character pattern with searched for character string, then corresponding URL is extracted;

Described deep search functional module (4), did not all access all URL on webpage when the inventive method starts.First URL extracting at webpage by regular expression is initial starting point S, and is labeled as and accessed; The chained address W of the webpage being then connected from S search S successively.If W did not access, the W of take proceeds depth-first traversal as new starting point, until meet user's demand.If now still have the not summit of access, select a summit of not yet accessing else and repeat said process as new source point, until all summits in webpage are all accessed under the prerequisite of meeting consumers' demand.

Described breadth first search's functional module (5), did not all access all URL on webpage when the present invention starts.Reptile, from the URL of initial page p0, by regular expression searching page p0 and extract all URL in the page, is added them in URL queue to.Then, reptile obtains URL by certain order from queue, repeats said process, until meet the requirement of client.

In certain embodiments, adopt BFS (Breadth First Search) certain link from webpage, access the all-links in this linked web pages, after having accessed, then by recursive algorithm, realize down the access of one deck.

Above-described embodiment is to provide to those of ordinary skills and realizes or use of the present invention; those of ordinary skills can be without departing from the present invention in the case of the inventive idea; above-described embodiment is made to various modifications or variation; thereby protection scope of the present invention do not limit by above-described embodiment, and it should be the maximum magnitude that meets the inventive features that claims mention.

Claims

1. an implementation method for the web crawlers based on search engine, is characterized in that, comprising:

Socket functional module,

Http functional module,

Regular expression functional module,

Deep search functional module,

Breadth first search's functional module.

2. the implementation method of a kind of web crawlers based on search engine as claimed in claim 1, is characterized in that:

Described Socket functional module, is the background knowledge that web crawlers relies on, and is present in the structure of Knowledge Management System, and client is set up and is connected with service end by socket socket.

3. the implementation method of a kind of web crawlers based on search engine as claimed in claim 1, is characterized in that:

Described http functional module, client must define one group of URL and determine the address that will browse, after client computer and server connect, if send a request, to server server, receives after request, give corresponding response message, so just code on webpage can be extracted.

4. the implementation method of a kind of web crawlers based on search engine as claimed in claim 1, is characterized in that:

Described regular expression functional module, regular expression has been described a kind of pattern of string matching, can be used for checking whether a string contains certain substring, the substring of coupling is replaced or take out from certain string the substring etc. meet certain condition.Client is after code on service end webpage, and regular expression, as a template, mates URL character pattern with searched for character string, then corresponding URL is extracted.

5. the implementation method of a kind of web crawlers based on search engine as claimed in claim 1, is characterized in that:

6. the implementation method of a kind of web crawlers based on search engine as claimed in claim 1, is characterized in that: