CN109299392A - A kind of optimization method of web crawlers crawl data - Google Patents

A kind of optimization method of web crawlers crawl data Download PDF

Info

Publication number
CN109299392A
CN109299392A CN201811391944.0A CN201811391944A CN109299392A CN 109299392 A CN109299392 A CN 109299392A CN 201811391944 A CN201811391944 A CN 201811391944A CN 109299392 A CN109299392 A CN 109299392A
Authority
CN
China
Prior art keywords
data
template
url
page
web crawlers
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811391944.0A
Other languages
Chinese (zh)
Inventor
胡翔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui Cloud Finance Information Technology Co Ltd
Original Assignee
Anhui Cloud Finance Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui Cloud Finance Information Technology Co Ltd filed Critical Anhui Cloud Finance Information Technology Co Ltd
Priority to CN201811391944.0A priority Critical patent/CN109299392A/en
Publication of CN109299392A publication Critical patent/CN109299392A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a kind of optimization methods of web crawlers crawl data, are related to field of network data transmission.The present invention includes the following steps: step S01: establishing high quality template resources bank and template matching algorithm;Step S02: target data URL is met by depth-first traversal strategy and backward chaining number Policy Filtering, establishes primary URL queue;Step S03: primary URL queuing data is grabbed into respective objects website data by breadth-first algorithm policy;Step S04: according to template matching algorithm, target data is matched with the template data in template storage repository;Step S05: the template data of successful match is stored to resources bank.The present invention establishes primary URL queue by depth optimization traversal strategies and backward chaining number strategy, the template in template library is matched with the data of crawl using template matching algorithm, the accuracy and efficiency for improving web crawlers crawl data avoid invalid data redundancy and storage resource waste.

Description

A kind of optimization method of web crawlers crawl data
Technical field
The invention belongs to field of network data transmission, more particularly to a kind of optimization method of web crawlers crawl data.
Background technique
With the development of internet, the information content for including in internet is more and more, and search engine can help people Oneself interested content, general search engine, such as Baidu, Google and Bing are found in massive information, are all faces Internet content search service is provided to all users.These search engines are needed through crawler technology constantly from internet Information is obtained, and these information preservations are got up, so that people easily retrieve these information.Due to the data volume for needing to crawl It is huge, therefore large-scale search engine often takes distributed treatment mechanism, that is, establishes distributed reptile service system.These Crawler obtains target URL from unified downloading queue, then by target webpage downloading storage to specified position.
Search engine is more and more, and the crawler of these search engines constantly obtains the content of website, the pressure of web station system Power is increasing, on June 30th, 1994, begs for jointly in the site owner grabbed by search engine personnel and searched engine After, a industry standard, i.e. Robots agreement have formally been issued.Robots agreement (also referred to as crawler agreement, the robot protocol Deng) full name be " web crawlers exclusion criteria " (RobotsExclusionProtocol), website is told by Robots agreement Which page of search engine can grab, which page cannot grab.
Therefore, nowadays web crawlers has the following disadvantages:
1, it largely repeats resolved come out of URL and feeds back to control centre, brought to control centre's URL deduplication module very big Pressure, and a large amount of duplicate URL passback takes many unnecessary network flows, reduces the work effect of crawler system Rate;
2, data warehouse is returned to after webpage is crawled, and is then analyzed again in page parsing, and it is big during this The webpage that amount does not update also wherein, wastes many times and flow to crawler system.
Summary of the invention
The purpose of the present invention is to provide a kind of optimization methods of web crawlers crawl data, traverse plan by depth optimization Slightly and backward chaining number strategy establishes primary URL queue, using template matching algorithm by the number of template and crawl in template library According to being matched, solve the accuracy and efficiency deficiency and system is ineffective asks of existing web crawlers crawl data Topic.
In order to solve the above technical problems, the present invention is achieved by the following technical solutions:
The present invention is the optimization method that a kind of web crawlers grabs data, is included the following steps:
Step S01: high quality template resources bank and template matching algorithm are established;
Step S02: meeting target data URL by depth-first traversal strategy and backward chaining number Policy Filtering, establishes Primary URL queue;
Step S03: primary URL queuing data is grabbed into respective objects website data by breadth-first algorithm policy;
Step S04: according to template matching algorithm, by target data with the template data progress in template storage repository Match;
Step S05: the template data of successful match is stored to resources bank.
Preferably, in the step S02, depth-first traversal strategy is inserted into extract the connection in downloading webpage The end of URL queue to be grabbed.
Preferably, in the step S02, number of the backward chaining number strategy to obtain other web page interlinkages in downloading webpage Amount, according to the arrangement from high to low of web page interlinkage quantity to page queue to be grabbed.
Preferably, it is described downloading webpage extract information include: target URL, page type, downloading page storing path, JavaScript function name, page analysis type used in dynamic page delay loading.
Preferably, in the step S04, template matching algorithm will be in the target data and resources bank of crawl by filter Template be compared duplicate removal, establishing in filter has a white list module, and white list stores the uncertain address URL.
Preferably, in the step S05, resources bank uses the distributed storage mode of HDFS.
The invention has the following advantages:
The present invention establishes primary URL queue by depth optimization traversal strategies and backward chaining number strategy, utilizes template The template in template library is matched with the data of crawl with algorithm, improves the accuracy and effect of web crawlers crawl data Rate avoids invalid data redundancy and storage resource waste.
Certainly, it implements any of the products of the present invention and does not necessarily require achieving all the advantages described above at the same time.
Detailed description of the invention
In order to illustrate the technical solution of the embodiments of the present invention more clearly, will be described below to embodiment required Attached drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for ability For the those of ordinary skill of domain, without creative efforts, it can also be obtained according to these attached drawings other attached Figure.
Fig. 1 is the optimization method block diagram that a kind of web crawlers of the invention grabs data.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts all other Embodiment shall fall within the protection scope of the present invention.
Refering to Figure 1, the present invention is the optimization method that a kind of web crawlers grabs data, include the following steps:
Step S01: high quality template resources bank and template matching algorithm are established;
Step S02: meeting target data URL by depth-first traversal strategy and backward chaining number Policy Filtering, establishes Primary URL queue;
Step S03: primary URL queuing data is grabbed into respective objects website data by breadth-first algorithm policy;
Step S04: according to template matching algorithm, by target data with the template data progress in template storage repository Match;
Step S05: the template data of successful match is stored to resources bank.
Wherein, in step S02, depth-first traversal strategy is inserted into extract the connection in downloading webpage wait grab The end of URL queue;Depth-first traversal strategy refers to that web crawlers can be since start page, one link tracking of a link Go down, handled this route and be transferred to next start page again later, continues tracking link.
Wherein, in step S02, quantity of the backward chaining number strategy to obtain other web page interlinkages in downloading webpage, root According to the arrangement from high to low of web page interlinkage quantity to page queue to be grabbed;What backward chaining number indicated is the content of a webpage Degree by other people recommendation;Therefore, many times the grasping system of search engine will use this index to evaluate net The significance level of page, to determine the crawl sequencing of different web pages.
Wherein, the information that downloading webpage extracts includes: target URL, page type, downloading page storing path, dynamic page JavaScript function name, page analysis type used in the delay loading of face.
Wherein, in step S04, template matching algorithm passes through filter for the template in the target data and resources bank of crawl Duplicate removal is compared, establishing in filter has a white list module, and white list stores the uncertain address URL, works as appearance When the uncertain address URL, system stores the uncertain address URL to white list, and user can select according to actual needs Whether URL address white list in is called.
Wherein, in step S05, resources bank uses the distributed storage mode of HDFS.
It is worth noting that, included each unit is only drawn according to function logic in the above system embodiment Point, but be not limited to the above division, as long as corresponding functions can be realized;In addition, each functional unit is specific Title is also only for convenience of distinguishing each other, the protection scope being not intended to restrict the invention.
In addition, those of ordinary skill in the art will appreciate that realizing all or part of the steps in the various embodiments described above method It is that relevant hardware can be instructed to complete by program, corresponding program can store to be situated between in a computer-readable storage In matter.
Present invention disclosed above preferred embodiment is only intended to help to illustrate the present invention.There is no detailed for preferred embodiment All details are described, are not limited the invention to the specific embodiments described.Obviously, according to the content of this specification, It can make many modifications and variations.These embodiments are chosen and specifically described to this specification, is in order to better explain the present invention Principle and practical application, so that skilled artisan be enable to better understand and utilize the present invention.The present invention is only It is limited by claims and its full scope and equivalent.

Claims (6)

1. a kind of optimization method of web crawlers crawl data, which comprises the steps of:
Step S01: high quality template resources bank and template matching algorithm are established;
Step S02: meeting target data URL by depth-first traversal strategy and backward chaining number Policy Filtering, establishes primary URL queue;
Step S03: primary URL queuing data is grabbed into respective objects website data by breadth-first algorithm policy;
Step S04: according to template matching algorithm, target data is matched with the template data in template storage repository;
Step S05: the template data of successful match is stored to resources bank.
2. a kind of optimization method of web crawlers crawl data according to claim 1, which is characterized in that the step In S02, depth-first traversal strategy is inserted into the end of URL queue to be grabbed to extract the connection in downloading webpage.
3. a kind of optimization method of web crawlers crawl data according to claim 1, which is characterized in that the step In S02, quantity of the backward chaining number strategy to obtain other web page interlinkages in downloading webpage, according to web page interlinkage quantity by It is high to Low to arrange to page queue to be grabbed.
4. a kind of optimization method of web crawlers crawl data according to claim 2, which is characterized in that the lower support grid The information that page extracts includes: target URL, page type, downloads used in page storing path, dynamic page delay loading JavaScript function name, page analysis type.
5. a kind of optimization method of web crawlers crawl data according to claim 1, which is characterized in that the step In S04, by filter duplicate removal is compared with the template in resources bank in the target data of crawl by template matching algorithm, filtering Establishing in device has a white list module, and white list stores the uncertain address URL.
6. a kind of optimization method of web crawlers crawl data according to claim 1, which is characterized in that the step In S05, resources bank uses the distributed storage mode of HDFS.
CN201811391944.0A 2018-11-21 2018-11-21 A kind of optimization method of web crawlers crawl data Pending CN109299392A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811391944.0A CN109299392A (en) 2018-11-21 2018-11-21 A kind of optimization method of web crawlers crawl data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811391944.0A CN109299392A (en) 2018-11-21 2018-11-21 A kind of optimization method of web crawlers crawl data

Publications (1)

Publication Number Publication Date
CN109299392A true CN109299392A (en) 2019-02-01

Family

ID=65143260

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811391944.0A Pending CN109299392A (en) 2018-11-21 2018-11-21 A kind of optimization method of web crawlers crawl data

Country Status (1)

Country Link
CN (1) CN109299392A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060277175A1 (en) * 2000-08-18 2006-12-07 Dongming Jiang Method and Apparatus for Focused Crawling
US20130204981A1 (en) * 2010-06-24 2013-08-08 Amazon Technologies, Inc. Url rescue by identifying information related to an item referenced in an invalid url
CN103761330A (en) * 2014-02-10 2014-04-30 赛特斯信息科技股份有限公司 System and method for achieving automatic Internet information extraction based on template configuration
CN105045838A (en) * 2015-07-01 2015-11-11 华东师范大学 Network crawler system based on distributed storage system
CN107025296A (en) * 2017-04-17 2017-08-08 山东辰华科技信息有限公司 Based on science service information intelligent grasping system method of data capture
CN108121706A (en) * 2016-11-28 2018-06-05 央视国际网络无锡有限公司 A kind of optimization method of distributed reptile

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060277175A1 (en) * 2000-08-18 2006-12-07 Dongming Jiang Method and Apparatus for Focused Crawling
US20130204981A1 (en) * 2010-06-24 2013-08-08 Amazon Technologies, Inc. Url rescue by identifying information related to an item referenced in an invalid url
CN103761330A (en) * 2014-02-10 2014-04-30 赛特斯信息科技股份有限公司 System and method for achieving automatic Internet information extraction based on template configuration
CN105045838A (en) * 2015-07-01 2015-11-11 华东师范大学 Network crawler system based on distributed storage system
CN108121706A (en) * 2016-11-28 2018-06-05 央视国际网络无锡有限公司 A kind of optimization method of distributed reptile
CN107025296A (en) * 2017-04-17 2017-08-08 山东辰华科技信息有限公司 Based on science service information intelligent grasping system method of data capture

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李松: "基于Hadoop的分布式网络爬虫研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Similar Documents

Publication Publication Date Title
CN105243159B (en) A kind of distributed network crawler system based on visualization script editing machine
CN102054028B (en) Method for implementing web-rendering function by using web crawler system
CN102073683A (en) Distributed real-time news information acquisition system
CN103970788A (en) Webpage-crawling-based crawler technology
CN108052632A (en) A kind of method for obtaining network information, system and company information search system
US20130132439A1 (en) Organizing versioning according to permissions
CN105260388A (en) Optimization method of distributed vertical crawler service system
CN107391775A (en) A kind of general web crawlers model implementation method and system
US20130238980A1 (en) Method and Apparatus for Processing World Wide Web Page
CN102270331A (en) Network shopping navigating method based on visual search
CN109600385B (en) Access control method and device
CN103258017B (en) A kind of parallel square crossing network data acquisition method and system
CN106033428B (en) The selection method of uniform resource locator and the selection device of uniform resource locator
CN103745006A (en) Internet information searching system and internet information searching method
CN106156319A (en) Telescopic distributed resource description framework data storage method and device
CN103778156A (en) Method and device for searching for data and server for data search
Mehta et al. A comparative study of various approaches to adaptive web scraping
Shimpi et al. An overview of graph databases
CN109783471A (en) Enterprise's portrait small routine method, apparatus, computer equipment and storage medium
CN105930385A (en) Data crawling method and system
CN105426407A (en) Web data acquisition method based on content analysis
CN112597369A (en) Webpage spider theme type search system based on improved cloud platform
CN109299392A (en) A kind of optimization method of web crawlers crawl data
CN106919600A (en) One kind failure network address access method and terminal
CN112579853A (en) Method and device for sequencing crawling links and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190201