CN109299392A - A kind of optimization method of web crawlers crawl data - Google Patents
A kind of optimization method of web crawlers crawl data Download PDFInfo
- Publication number
- CN109299392A CN109299392A CN201811391944.0A CN201811391944A CN109299392A CN 109299392 A CN109299392 A CN 109299392A CN 201811391944 A CN201811391944 A CN 201811391944A CN 109299392 A CN109299392 A CN 109299392A
- Authority
- CN
- China
- Prior art keywords
- data
- template
- url
- page
- web crawlers
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Information Transfer Between Computers (AREA)
Abstract
The invention discloses a kind of optimization methods of web crawlers crawl data, are related to field of network data transmission.The present invention includes the following steps: step S01: establishing high quality template resources bank and template matching algorithm;Step S02: target data URL is met by depth-first traversal strategy and backward chaining number Policy Filtering, establishes primary URL queue;Step S03: primary URL queuing data is grabbed into respective objects website data by breadth-first algorithm policy;Step S04: according to template matching algorithm, target data is matched with the template data in template storage repository;Step S05: the template data of successful match is stored to resources bank.The present invention establishes primary URL queue by depth optimization traversal strategies and backward chaining number strategy, the template in template library is matched with the data of crawl using template matching algorithm, the accuracy and efficiency for improving web crawlers crawl data avoid invalid data redundancy and storage resource waste.
Description
Technical field
The invention belongs to field of network data transmission, more particularly to a kind of optimization method of web crawlers crawl data.
Background technique
With the development of internet, the information content for including in internet is more and more, and search engine can help people
Oneself interested content, general search engine, such as Baidu, Google and Bing are found in massive information, are all faces
Internet content search service is provided to all users.These search engines are needed through crawler technology constantly from internet
Information is obtained, and these information preservations are got up, so that people easily retrieve these information.Due to the data volume for needing to crawl
It is huge, therefore large-scale search engine often takes distributed treatment mechanism, that is, establishes distributed reptile service system.These
Crawler obtains target URL from unified downloading queue, then by target webpage downloading storage to specified position.
Search engine is more and more, and the crawler of these search engines constantly obtains the content of website, the pressure of web station system
Power is increasing, on June 30th, 1994, begs for jointly in the site owner grabbed by search engine personnel and searched engine
After, a industry standard, i.e. Robots agreement have formally been issued.Robots agreement (also referred to as crawler agreement, the robot protocol
Deng) full name be " web crawlers exclusion criteria " (RobotsExclusionProtocol), website is told by Robots agreement
Which page of search engine can grab, which page cannot grab.
Therefore, nowadays web crawlers has the following disadvantages:
1, it largely repeats resolved come out of URL and feeds back to control centre, brought to control centre's URL deduplication module very big
Pressure, and a large amount of duplicate URL passback takes many unnecessary network flows, reduces the work effect of crawler system
Rate;
2, data warehouse is returned to after webpage is crawled, and is then analyzed again in page parsing, and it is big during this
The webpage that amount does not update also wherein, wastes many times and flow to crawler system.
Summary of the invention
The purpose of the present invention is to provide a kind of optimization methods of web crawlers crawl data, traverse plan by depth optimization
Slightly and backward chaining number strategy establishes primary URL queue, using template matching algorithm by the number of template and crawl in template library
According to being matched, solve the accuracy and efficiency deficiency and system is ineffective asks of existing web crawlers crawl data
Topic.
In order to solve the above technical problems, the present invention is achieved by the following technical solutions:
The present invention is the optimization method that a kind of web crawlers grabs data, is included the following steps:
Step S01: high quality template resources bank and template matching algorithm are established;
Step S02: meeting target data URL by depth-first traversal strategy and backward chaining number Policy Filtering, establishes
Primary URL queue;
Step S03: primary URL queuing data is grabbed into respective objects website data by breadth-first algorithm policy;
Step S04: according to template matching algorithm, by target data with the template data progress in template storage repository
Match;
Step S05: the template data of successful match is stored to resources bank.
Preferably, in the step S02, depth-first traversal strategy is inserted into extract the connection in downloading webpage
The end of URL queue to be grabbed.
Preferably, in the step S02, number of the backward chaining number strategy to obtain other web page interlinkages in downloading webpage
Amount, according to the arrangement from high to low of web page interlinkage quantity to page queue to be grabbed.
Preferably, it is described downloading webpage extract information include: target URL, page type, downloading page storing path,
JavaScript function name, page analysis type used in dynamic page delay loading.
Preferably, in the step S04, template matching algorithm will be in the target data and resources bank of crawl by filter
Template be compared duplicate removal, establishing in filter has a white list module, and white list stores the uncertain address URL.
Preferably, in the step S05, resources bank uses the distributed storage mode of HDFS.
The invention has the following advantages:
The present invention establishes primary URL queue by depth optimization traversal strategies and backward chaining number strategy, utilizes template
The template in template library is matched with the data of crawl with algorithm, improves the accuracy and effect of web crawlers crawl data
Rate avoids invalid data redundancy and storage resource waste.
Certainly, it implements any of the products of the present invention and does not necessarily require achieving all the advantages described above at the same time.
Detailed description of the invention
In order to illustrate the technical solution of the embodiments of the present invention more clearly, will be described below to embodiment required
Attached drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for ability
For the those of ordinary skill of domain, without creative efforts, it can also be obtained according to these attached drawings other attached
Figure.
Fig. 1 is the optimization method block diagram that a kind of web crawlers of the invention grabs data.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on
Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts all other
Embodiment shall fall within the protection scope of the present invention.
Refering to Figure 1, the present invention is the optimization method that a kind of web crawlers grabs data, include the following steps:
Step S01: high quality template resources bank and template matching algorithm are established;
Step S02: meeting target data URL by depth-first traversal strategy and backward chaining number Policy Filtering, establishes
Primary URL queue;
Step S03: primary URL queuing data is grabbed into respective objects website data by breadth-first algorithm policy;
Step S04: according to template matching algorithm, by target data with the template data progress in template storage repository
Match;
Step S05: the template data of successful match is stored to resources bank.
Wherein, in step S02, depth-first traversal strategy is inserted into extract the connection in downloading webpage wait grab
The end of URL queue;Depth-first traversal strategy refers to that web crawlers can be since start page, one link tracking of a link
Go down, handled this route and be transferred to next start page again later, continues tracking link.
Wherein, in step S02, quantity of the backward chaining number strategy to obtain other web page interlinkages in downloading webpage, root
According to the arrangement from high to low of web page interlinkage quantity to page queue to be grabbed;What backward chaining number indicated is the content of a webpage
Degree by other people recommendation;Therefore, many times the grasping system of search engine will use this index to evaluate net
The significance level of page, to determine the crawl sequencing of different web pages.
Wherein, the information that downloading webpage extracts includes: target URL, page type, downloading page storing path, dynamic page
JavaScript function name, page analysis type used in the delay loading of face.
Wherein, in step S04, template matching algorithm passes through filter for the template in the target data and resources bank of crawl
Duplicate removal is compared, establishing in filter has a white list module, and white list stores the uncertain address URL, works as appearance
When the uncertain address URL, system stores the uncertain address URL to white list, and user can select according to actual needs
Whether URL address white list in is called.
Wherein, in step S05, resources bank uses the distributed storage mode of HDFS.
It is worth noting that, included each unit is only drawn according to function logic in the above system embodiment
Point, but be not limited to the above division, as long as corresponding functions can be realized;In addition, each functional unit is specific
Title is also only for convenience of distinguishing each other, the protection scope being not intended to restrict the invention.
In addition, those of ordinary skill in the art will appreciate that realizing all or part of the steps in the various embodiments described above method
It is that relevant hardware can be instructed to complete by program, corresponding program can store to be situated between in a computer-readable storage
In matter.
Present invention disclosed above preferred embodiment is only intended to help to illustrate the present invention.There is no detailed for preferred embodiment
All details are described, are not limited the invention to the specific embodiments described.Obviously, according to the content of this specification,
It can make many modifications and variations.These embodiments are chosen and specifically described to this specification, is in order to better explain the present invention
Principle and practical application, so that skilled artisan be enable to better understand and utilize the present invention.The present invention is only
It is limited by claims and its full scope and equivalent.
Claims (6)
1. a kind of optimization method of web crawlers crawl data, which comprises the steps of:
Step S01: high quality template resources bank and template matching algorithm are established;
Step S02: meeting target data URL by depth-first traversal strategy and backward chaining number Policy Filtering, establishes primary
URL queue;
Step S03: primary URL queuing data is grabbed into respective objects website data by breadth-first algorithm policy;
Step S04: according to template matching algorithm, target data is matched with the template data in template storage repository;
Step S05: the template data of successful match is stored to resources bank.
2. a kind of optimization method of web crawlers crawl data according to claim 1, which is characterized in that the step
In S02, depth-first traversal strategy is inserted into the end of URL queue to be grabbed to extract the connection in downloading webpage.
3. a kind of optimization method of web crawlers crawl data according to claim 1, which is characterized in that the step
In S02, quantity of the backward chaining number strategy to obtain other web page interlinkages in downloading webpage, according to web page interlinkage quantity by
It is high to Low to arrange to page queue to be grabbed.
4. a kind of optimization method of web crawlers crawl data according to claim 2, which is characterized in that the lower support grid
The information that page extracts includes: target URL, page type, downloads used in page storing path, dynamic page delay loading
JavaScript function name, page analysis type.
5. a kind of optimization method of web crawlers crawl data according to claim 1, which is characterized in that the step
In S04, by filter duplicate removal is compared with the template in resources bank in the target data of crawl by template matching algorithm, filtering
Establishing in device has a white list module, and white list stores the uncertain address URL.
6. a kind of optimization method of web crawlers crawl data according to claim 1, which is characterized in that the step
In S05, resources bank uses the distributed storage mode of HDFS.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811391944.0A CN109299392A (en) | 2018-11-21 | 2018-11-21 | A kind of optimization method of web crawlers crawl data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811391944.0A CN109299392A (en) | 2018-11-21 | 2018-11-21 | A kind of optimization method of web crawlers crawl data |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109299392A true CN109299392A (en) | 2019-02-01 |
Family
ID=65143260
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811391944.0A Pending CN109299392A (en) | 2018-11-21 | 2018-11-21 | A kind of optimization method of web crawlers crawl data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109299392A (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060277175A1 (en) * | 2000-08-18 | 2006-12-07 | Dongming Jiang | Method and Apparatus for Focused Crawling |
US20130204981A1 (en) * | 2010-06-24 | 2013-08-08 | Amazon Technologies, Inc. | Url rescue by identifying information related to an item referenced in an invalid url |
CN103761330A (en) * | 2014-02-10 | 2014-04-30 | 赛特斯信息科技股份有限公司 | System and method for achieving automatic Internet information extraction based on template configuration |
CN105045838A (en) * | 2015-07-01 | 2015-11-11 | 华东师范大学 | Network crawler system based on distributed storage system |
CN107025296A (en) * | 2017-04-17 | 2017-08-08 | 山东辰华科技信息有限公司 | Based on science service information intelligent grasping system method of data capture |
CN108121706A (en) * | 2016-11-28 | 2018-06-05 | 央视国际网络无锡有限公司 | A kind of optimization method of distributed reptile |
-
2018
- 2018-11-21 CN CN201811391944.0A patent/CN109299392A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060277175A1 (en) * | 2000-08-18 | 2006-12-07 | Dongming Jiang | Method and Apparatus for Focused Crawling |
US20130204981A1 (en) * | 2010-06-24 | 2013-08-08 | Amazon Technologies, Inc. | Url rescue by identifying information related to an item referenced in an invalid url |
CN103761330A (en) * | 2014-02-10 | 2014-04-30 | 赛特斯信息科技股份有限公司 | System and method for achieving automatic Internet information extraction based on template configuration |
CN105045838A (en) * | 2015-07-01 | 2015-11-11 | 华东师范大学 | Network crawler system based on distributed storage system |
CN108121706A (en) * | 2016-11-28 | 2018-06-05 | 央视国际网络无锡有限公司 | A kind of optimization method of distributed reptile |
CN107025296A (en) * | 2017-04-17 | 2017-08-08 | 山东辰华科技信息有限公司 | Based on science service information intelligent grasping system method of data capture |
Non-Patent Citations (1)
Title |
---|
李松: "基于Hadoop的分布式网络爬虫研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105243159B (en) | A kind of distributed network crawler system based on visualization script editing machine | |
CN102054028B (en) | Method for implementing web-rendering function by using web crawler system | |
CN102073683A (en) | Distributed real-time news information acquisition system | |
CN103970788A (en) | Webpage-crawling-based crawler technology | |
CN108052632A (en) | A kind of method for obtaining network information, system and company information search system | |
US20130132439A1 (en) | Organizing versioning according to permissions | |
CN105260388A (en) | Optimization method of distributed vertical crawler service system | |
CN107391775A (en) | A kind of general web crawlers model implementation method and system | |
US20130238980A1 (en) | Method and Apparatus for Processing World Wide Web Page | |
CN102270331A (en) | Network shopping navigating method based on visual search | |
CN109600385B (en) | Access control method and device | |
CN103258017B (en) | A kind of parallel square crossing network data acquisition method and system | |
CN106033428B (en) | The selection method of uniform resource locator and the selection device of uniform resource locator | |
CN103745006A (en) | Internet information searching system and internet information searching method | |
CN106156319A (en) | Telescopic distributed resource description framework data storage method and device | |
CN103778156A (en) | Method and device for searching for data and server for data search | |
Mehta et al. | A comparative study of various approaches to adaptive web scraping | |
Shimpi et al. | An overview of graph databases | |
CN109783471A (en) | Enterprise's portrait small routine method, apparatus, computer equipment and storage medium | |
CN105930385A (en) | Data crawling method and system | |
CN105426407A (en) | Web data acquisition method based on content analysis | |
CN112597369A (en) | Webpage spider theme type search system based on improved cloud platform | |
CN109299392A (en) | A kind of optimization method of web crawlers crawl data | |
CN106919600A (en) | One kind failure network address access method and terminal | |
CN112579853A (en) | Method and device for sequencing crawling links and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190201 |