CN109299392A

CN109299392A - A kind of optimization method of web crawlers crawl data

Info

Publication number: CN109299392A
Application number: CN201811391944.0A
Authority: CN
Inventors: 胡翔
Original assignee: Anhui Cloud Finance Information Technology Co Ltd
Current assignee: Anhui Cloud Finance Information Technology Co Ltd
Priority date: 2018-11-21
Filing date: 2018-11-21
Publication date: 2019-02-01

Abstract

The invention discloses a kind of optimization methods of web crawlers crawl data, are related to field of network data transmission.The present invention includes the following steps: step S01: establishing high quality template resources bank and template matching algorithm；Step S02: target data URL is met by depth-first traversal strategy and backward chaining number Policy Filtering, establishes primary URL queue；Step S03: primary URL queuing data is grabbed into respective objects website data by breadth-first algorithm policy；Step S04: according to template matching algorithm, target data is matched with the template data in template storage repository；Step S05: the template data of successful match is stored to resources bank.The present invention establishes primary URL queue by depth optimization traversal strategies and backward chaining number strategy, the template in template library is matched with the data of crawl using template matching algorithm, the accuracy and efficiency for improving web crawlers crawl data avoid invalid data redundancy and storage resource waste.

Description

A kind of optimization method of web crawlers crawl data

Technical field

The invention belongs to field of network data transmission, more particularly to a kind of optimization method of web crawlers crawl data.

Background technique

With the development of internet, the information content for including in internet is more and more, and search engine can help people Oneself interested content, general search engine, such as Baidu, Google and Bing are found in massive information, are all faces Internet content search service is provided to all users.These search engines are needed through crawler technology constantly from internet Information is obtained, and these information preservations are got up, so that people easily retrieve these information.Due to the data volume for needing to crawl It is huge, therefore large-scale search engine often takes distributed treatment mechanism, that is, establishes distributed reptile service system.These Crawler obtains target URL from unified downloading queue, then by target webpage downloading storage to specified position.

Search engine is more and more, and the crawler of these search engines constantly obtains the content of website, the pressure of web station system Power is increasing, on June 30th, 1994, begs for jointly in the site owner grabbed by search engine personnel and searched engine After, a industry standard, i.e. Robots agreement have formally been issued.Robots agreement (also referred to as crawler agreement, the robot protocol Deng) full name be " web crawlers exclusion criteria " (RobotsExclusionProtocol), website is told by Robots agreement Which page of search engine can grab, which page cannot grab.

Therefore, nowadays web crawlers has the following disadvantages:

1, it largely repeats resolved come out of URL and feeds back to control centre, brought to control centre's URL deduplication module very big Pressure, and a large amount of duplicate URL passback takes many unnecessary network flows, reduces the work effect of crawler system Rate；

2, data warehouse is returned to after webpage is crawled, and is then analyzed again in page parsing, and it is big during this The webpage that amount does not update also wherein, wastes many times and flow to crawler system.

Summary of the invention

The purpose of the present invention is to provide a kind of optimization methods of web crawlers crawl data, traverse plan by depth optimization Slightly and backward chaining number strategy establishes primary URL queue, using template matching algorithm by the number of template and crawl in template library According to being matched, solve the accuracy and efficiency deficiency and system is ineffective asks of existing web crawlers crawl data Topic.

In order to solve the above technical problems, the present invention is achieved by the following technical solutions:

The present invention is the optimization method that a kind of web crawlers grabs data, is included the following steps:

Step S01: high quality template resources bank and template matching algorithm are established；

Step S02: meeting target data URL by depth-first traversal strategy and backward chaining number Policy Filtering, establishes Primary URL queue；

Step S03: primary URL queuing data is grabbed into respective objects website data by breadth-first algorithm policy；

Step S04: according to template matching algorithm, by target data with the template data progress in template storage repository Match；

Step S05: the template data of successful match is stored to resources bank.

Preferably, in the step S02, depth-first traversal strategy is inserted into extract the connection in downloading webpage The end of URL queue to be grabbed.

Preferably, in the step S02, number of the backward chaining number strategy to obtain other web page interlinkages in downloading webpage Amount, according to the arrangement from high to low of web page interlinkage quantity to page queue to be grabbed.

Preferably, it is described downloading webpage extract information include: target URL, page type, downloading page storing path, JavaScript function name, page analysis type used in dynamic page delay loading.

Preferably, in the step S04, template matching algorithm will be in the target data and resources bank of crawl by filter Template be compared duplicate removal, establishing in filter has a white list module, and white list stores the uncertain address URL.

Preferably, in the step S05, resources bank uses the distributed storage mode of HDFS.

The invention has the following advantages:

The present invention establishes primary URL queue by depth optimization traversal strategies and backward chaining number strategy, utilizes template The template in template library is matched with the data of crawl with algorithm, improves the accuracy and effect of web crawlers crawl data Rate avoids invalid data redundancy and storage resource waste.

Certainly, it implements any of the products of the present invention and does not necessarily require achieving all the advantages described above at the same time.

Detailed description of the invention

In order to illustrate the technical solution of the embodiments of the present invention more clearly, will be described below to embodiment required Attached drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for ability For the those of ordinary skill of domain, without creative efforts, it can also be obtained according to these attached drawings other attached Figure.

Fig. 1 is the optimization method block diagram that a kind of web crawlers of the invention grabs data.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts all other Embodiment shall fall within the protection scope of the present invention.

Refering to Figure 1, the present invention is the optimization method that a kind of web crawlers grabs data, include the following steps:

Step S05: the template data of successful match is stored to resources bank.

Wherein, in step S02, depth-first traversal strategy is inserted into extract the connection in downloading webpage wait grab The end of URL queue；Depth-first traversal strategy refers to that web crawlers can be since start page, one link tracking of a link Go down, handled this route and be transferred to next start page again later, continues tracking link.

Wherein, in step S02, quantity of the backward chaining number strategy to obtain other web page interlinkages in downloading webpage, root According to the arrangement from high to low of web page interlinkage quantity to page queue to be grabbed；What backward chaining number indicated is the content of a webpage Degree by other people recommendation；Therefore, many times the grasping system of search engine will use this index to evaluate net The significance level of page, to determine the crawl sequencing of different web pages.

Wherein, the information that downloading webpage extracts includes: target URL, page type, downloading page storing path, dynamic page JavaScript function name, page analysis type used in the delay loading of face.

Wherein, in step S04, template matching algorithm passes through filter for the template in the target data and resources bank of crawl Duplicate removal is compared, establishing in filter has a white list module, and white list stores the uncertain address URL, works as appearance When the uncertain address URL, system stores the uncertain address URL to white list, and user can select according to actual needs Whether URL address white list in is called.

Wherein, in step S05, resources bank uses the distributed storage mode of HDFS.

It is worth noting that, included each unit is only drawn according to function logic in the above system embodiment Point, but be not limited to the above division, as long as corresponding functions can be realized；In addition, each functional unit is specific Title is also only for convenience of distinguishing each other, the protection scope being not intended to restrict the invention.

In addition, those of ordinary skill in the art will appreciate that realizing all or part of the steps in the various embodiments described above method It is that relevant hardware can be instructed to complete by program, corresponding program can store to be situated between in a computer-readable storage In matter.

Present invention disclosed above preferred embodiment is only intended to help to illustrate the present invention.There is no detailed for preferred embodiment All details are described, are not limited the invention to the specific embodiments described.Obviously, according to the content of this specification, It can make many modifications and variations.These embodiments are chosen and specifically described to this specification, is in order to better explain the present invention Principle and practical application, so that skilled artisan be enable to better understand and utilize the present invention.The present invention is only It is limited by claims and its full scope and equivalent.

Claims

1. a kind of optimization method of web crawlers crawl data, which comprises the steps of:

Step S04: according to template matching algorithm, target data is matched with the template data in template storage repository；

Step S05: the template data of successful match is stored to resources bank.

2. a kind of optimization method of web crawlers crawl data according to claim 1, which is characterized in that the step In S02, depth-first traversal strategy is inserted into the end of URL queue to be grabbed to extract the connection in downloading webpage.

3. a kind of optimization method of web crawlers crawl data according to claim 1, which is characterized in that the step In S02, quantity of the backward chaining number strategy to obtain other web page interlinkages in downloading webpage, according to web page interlinkage quantity by It is high to Low to arrange to page queue to be grabbed.

4. a kind of optimization method of web crawlers crawl data according to claim 2, which is characterized in that the lower support grid The information that page extracts includes: target URL, page type, downloads used in page storing path, dynamic page delay loading JavaScript function name, page analysis type.

5. a kind of optimization method of web crawlers crawl data according to claim 1, which is characterized in that the step In S04, by filter duplicate removal is compared with the template in resources bank in the target data of crawl by template matching algorithm, filtering Establishing in device has a white list module, and white list stores the uncertain address URL.

6. a kind of optimization method of web crawlers crawl data according to claim 1, which is characterized in that the step In S05, resources bank uses the distributed storage mode of HDFS.