CN109063019A - A kind of implementation method of the lightweight perpendicular network crawler based on producer consumer mode - Google Patents

A kind of implementation method of the lightweight perpendicular network crawler based on producer consumer mode Download PDF

Info

Publication number
CN109063019A
CN109063019A CN201810763260.2A CN201810763260A CN109063019A CN 109063019 A CN109063019 A CN 109063019A CN 201810763260 A CN201810763260 A CN 201810763260A CN 109063019 A CN109063019 A CN 109063019A
Authority
CN
China
Prior art keywords
url
interface
consumed
producer
parsing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810763260.2A
Other languages
Chinese (zh)
Inventor
张晓双
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chaozhou Zhuoshu Big Data Industry Development Co Ltd
Original Assignee
Shandong Hui Trade Electronic Port Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Hui Trade Electronic Port Co Ltd filed Critical Shandong Hui Trade Electronic Port Co Ltd
Priority to CN201810763260.2A priority Critical patent/CN109063019A/en
Publication of CN109063019A publication Critical patent/CN109063019A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The implementation method of the invention discloses a kind of lightweight perpendicular network crawler based on producer consumer mode, specific method includes that configuration file configures URL pattern and resolution rules according to actual needs;By constructing URL queue container to be consumed, producers and consumers' mode of URL is formed;Interface is parsed to URL queue container to be consumed configuration URL, and formulates the resolution rules of URL parsing interface according to actual needs;The incoming target URL for meeting mode of interface parsing is parsed to database by URL.Compared to the prior art a kind of implementation method of the lightweight perpendicular network crawler based on producer consumer mode, improves the scalability of system, reduces the degree of coupling of system.

Description

A kind of realization of the lightweight perpendicular network crawler based on producer consumer mode Method
Technical field
The present invention relates to the technical fields of vertical field web crawlers, specifically a kind of to be based on producer consumer mould The implementation method of the lightweight perpendicular network crawler of formula.
Background technique
With the rapid development of information technology, people's lives and job information become to digitize, various digitlizations Information is flooded with whole network, and the potential value of data is inestimable, how efficiently to extract and utilizes these a large amount of numbers According to critically important practice significance.
Web crawlers, be it is a kind of according to certain rules, crawl the program or script of internet information automatically.Common net Network crawler has universal web crawlers, vertical-type web crawlers, increment type web crawlers etc..Each crawler has different applications Scene, therefore there is different types of crawler to possess different functions, crawler can be divided into 3 seed types:
Batch-type crawler: if the webpage being directed to be known and range be it is determining, batch-type crawler can be used, When crawler, which completes, crawls required webpage, it can both stop crawl.Objectives may be different, can also As long as can reach specific webpage amount, it is also possible to as long as completing crawl within the specific time Etc..
Incremental crawler: this kind of web crawlers are with crawler presented hereinbefore the difference lies in that it can go always to crawl Webpage.If the webpage grabbed changes, this crawler can also be crawled at once, because of webpage in the entire network It is all changing all the time, the webpage of generation addition suddenly, webpage is deleted or web page contents are modified, these operations All be but incremental crawler is needed to timely update the thing that it crawls for meeting user experience, so It in this process, does not need to remove to grab new webpage again, but goes to update the webpage being crawled.This crawler is applicable in In general commercial search engine.
Vertical-type crawler: vertical-type web crawlers, also known as focused web crawler, it crawls target according to set, there is choosing Data required for crawling to selecting property the related pages of WWW and obtaining.This kind of web crawlers is the content according to webpage itself Come what is crawled, the content that it generally crawls some webpages around some theme is crawled or around some required row The webpage of industry crawls, as long as it removes to crawl the related web page of only some industry, does not have to understand any of other industry Information.This kind of crawler needs focus of attention to be: how because of the limited of system resource, we cannot be all resources It uses and crawls all webpages, look for the utilization that our required resources are considerably reduced resource so again from the inside Rate, thus we need crawler can accomplish to look for as far as possible as early as possible with the most matched webpage of this theme, preferably not It goes to grab those webpages unnecessary completely, so that resource can be saved.So such crawler will be confined to vertically Search for website or vertical industry website.
But traditional vertical-type web crawlers, the system degree of coupling is high, and low efficiency is a problem to be solved.
Summary of the invention
Technical assignment of the invention is place against the above deficiency, provides a kind of lightweight based on producer consumer mode and hangs down Straight network crawler system and its implementation.
The technical solution adopted by the present invention to solve the technical problems is: a kind of light weight based on producer consumer mode The implementation method of grade perpendicular network crawler, method includes,
Configuration file configures URL pattern and resolution rules according to actual needs;
By constructing URL queue container to be consumed, producers and consumers' mode of URL is formed;
Interface is parsed to URL queue container to be consumed configuration URL, and formulates the parsing rule of URL parsing interface according to actual needs Then;
The incoming target URL for meeting mode of interface parsing is parsed to database by URL.
Further, preferred method is that configuration file configures multithreading URL pattern.
Further, preferred method is, specific method includes,
S1, loading configuration file extract the entrance URL of configuration file;
S2, when queuing data length to be consumed is less than 0, wait the producer organize production;When queue length to be consumed is greater than 1, Consumer organizes 1 data of consumption, is passed to URL to be consumed by analytic method and meets the resolution rules of the URL pattern, thus Obtain more URL;
S3, judge whether resulting URL meets the target URL pattern that configuration file provides and target URL team is added if meeting Column;It is no, then URL is added to queue end to be consumed;
S4, producer's group read entrance URL from configuration file, and parse interface by URL and parse incoming relevant parameter production URL, and wake up consumer and organize consumption.
Further, preferred method is that when method further includes that URL parsing interface parsing generates abnormal, label is abnormal secondary Number, is added queue to be consumed, when frequency of abnormity reaches setting value, is thrown into exception queue.
Further, preferred method is that the queue container to be consumed is singleton pattern, provides the producer of message Group interface and consumer's group interface, interface method are synchronous method.
A kind of lightweight perpendicular network crawler system based on producer consumer mode, including start with system and load Profile module, singleton Queue module to be consumed;
The profile module, for configuring URL pattern according to actual needs and corresponding with the URL pattern parsing rule Then;
The Queue module to be consumed, including producer's group interface and consumer's group interface and one connect for producer's group Mouth and the URL of consumer's group interface parsing message parse interface, and the URL parsing interface passes through actual needs and formulates URL solution Analyse the resolution rules of interface.
Further, preferred structure be further include abnormal monitoring module, the abnormal monitoring module, when URL parse When interface parsing generates abnormal, frequency of abnormity is marked, Queue module to be consumed is added, when frequency of abnormity reaches setting value, throws away To exception queue.
A kind of lightweight perpendicular network crawler system and its implementation based on producer consumer mode of the invention Compared to the prior art, multithreading execution is organized using producer's group, consumer and crawl task, improve the effect of crawler data acquisition Rate, and the Lightweight component mainly configures the parameters such as URL pattern, corresponding resolution rules by configuration file, starts creation group Part can obtain corresponding target URL to database, improve the scalability of system, reduce the degree of coupling of system.
Detailed description of the invention
The following further describes the present invention with reference to the drawings.
Attached drawing 1 is the functional block diagram of the implementation method of the lightweight perpendicular network crawler based on producer consumer mode.
Specific embodiment
The present invention will be further explained below with reference to the attached drawings and specific examples.
Uniform resource locator (Uniform Resource Locator, URL) is to can obtain from internet The position of resource and a kind of succinct expression of access method, are the addresses of standard resource on internet.Each of on internet File has a unique URL, and the information that it includes points out how the position of file and browser should handle it.
The present invention is a kind of implementation method of lightweight perpendicular network crawler based on producer consumer mode,
Due to vertical-type web crawlers be frequently necessary to by parsing the page produce URL, by consumption URL parse the page, constantly into The process of row productive consumption URL data required for obtained, when a project to customize multiple web crawlers or One web crawlers has multiple similar when crawling task, and the productive consumption mode for being related to multithread scheduling realized crawler Journey is a bit cumbersome.This process is extracted, component is provided into and is more advantageous to efficiently realization crawler and data acquisition, improves system Scalability reduces the complexity of system.
Embodiment 1:
This patent is related to the technologies such as multithread scheduling, message queue, producer consumer mode, web analysis, can be applied to The technical fields such as small-sized vertical-type web crawlers, data pick-up.
Specific embodiment:
URL queue container to be consumed is constructed, and the container is singleton pattern, provides producers and consumers' interface of message, His module can call the interface to carry out the production and consumption of message, and interface method is synchronous method, prevents from leading because of concurrent problem Cause information drop-out.
The producer's class and consumer's class of the building parsing page, and page parsing interface is externally provided, it is passed to be resolved URL and corresponding resolution rules.
Loading configuration file when component creates, extracts the beginning URL (entrance URL) of configuration file, meets of target URL With mode, resolution rules between different URL patterns, queue message quantity, the parameter informations such as database.
When URL queue length to be consumed is less than 0, producer's production is waited, a data is otherwise consumed, by parsing side Method is passed to URL to be consumed, and meets the resolution rules of its mode, to obtain more URL, judges whether URL meets configuration The target URL pattern that file provides, is just added final target URL queue if met, is otherwise just added to URL to be consumed Queue end.
Producer's group since configuration file read start URL, can be one be also possible to it is a plurality of, and by parse class interface It parses incoming relevant parameter and produces URL, and wake up consumer spending.
When URL, which is parsed, generates abnormal, frequency of abnormity is marked, queue to be consumed is added, is tasted when frequency of abnormity reaches setting When examination value, it is thrown into exception queue.
The technical personnel in the technical field can readily realize the present invention with the above specific embodiments,.But it answers Work as understanding, the present invention is not limited to above-mentioned several specific embodiments.On the basis of the disclosed embodiments, the technology The technical staff in field can arbitrarily combine different technical features, to realize different technical solutions.

Claims (7)

1. a kind of implementation method of the lightweight perpendicular network crawler based on producer consumer mode, which is characterized in that method Including,
Configuration file configures URL pattern and resolution rules according to actual needs;
By constructing URL queue container to be consumed, producers and consumers' mode of URL is formed;
Interface is parsed to URL queue container to be consumed configuration URL, and formulates the parsing rule of URL parsing interface according to actual needs Then;
The incoming target URL for meeting mode of interface parsing is parsed to database by URL.
2. a kind of realization side of lightweight perpendicular network crawler based on producer consumer mode according to claim 1 Method, which is characterized in that configuration file configures multithreading URL pattern.
3. a kind of realization side of lightweight perpendicular network crawler based on producer consumer mode according to claim 1 Method, which is characterized in that specific method includes,
S1, loading configuration file extract the entrance URL of configuration file;
S2, when queuing data length to be consumed is less than 0, wait the producer organize production;When queue length to be consumed is greater than 1, Consumer organizes 1 data of consumption, is passed to URL to be consumed by analytic method and meets the resolution rules of the URL pattern, thus Obtain more URL;
S3, judge whether resulting URL meets the target URL pattern that configuration file provides and target URL team is added if meeting Column;It is no, then URL is added to queue end to be consumed;
S4, producer's group read entrance URL from configuration file, and parse interface by URL and parse incoming relevant parameter production URL, and wake up consumer and organize consumption.
4. a kind of realization side of lightweight perpendicular network crawler based on producer consumer mode according to claim 3 Method, which is characterized in that when method further includes that URL parsing interface parsing generates abnormal, mark frequency of abnormity, team to be consumed is added Column, when frequency of abnormity reaches setting value, are thrown into exception queue.
5. a kind of realization side of lightweight perpendicular network crawler based on producer consumer mode according to claim 1 Method, which is characterized in that the queue container to be consumed is singleton pattern, provides the producer's group interface and consumer's group of message Interface, interface method are synchronous method.
6. a kind of lightweight perpendicular network crawler system based on producer consumer mode, which is characterized in that including with system Profile module, the singleton Queue module to be consumed of starting and load;
The profile module, for configuring URL pattern according to actual needs and corresponding with the URL pattern parsing rule Then;
The Queue module to be consumed, including producer's group interface and consumer's group interface and one connect for producer's group Mouth and the URL of consumer's group interface parsing message parse interface, and the URL parsing interface passes through actual needs and formulates URL solution Analyse the resolution rules of interface.
7. a kind of realization side of lightweight perpendicular network crawler based on producer consumer mode according to claim 6 Method, which is characterized in that it further include abnormal monitoring module, the abnormal monitoring module, when URL parsing interface parsing generation is different Chang Shi marks frequency of abnormity, Queue module to be consumed is added, when frequency of abnormity reaches setting value, is thrown into exception queue.
CN201810763260.2A 2018-07-12 2018-07-12 A kind of implementation method of the lightweight perpendicular network crawler based on producer consumer mode Pending CN109063019A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810763260.2A CN109063019A (en) 2018-07-12 2018-07-12 A kind of implementation method of the lightweight perpendicular network crawler based on producer consumer mode

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810763260.2A CN109063019A (en) 2018-07-12 2018-07-12 A kind of implementation method of the lightweight perpendicular network crawler based on producer consumer mode

Publications (1)

Publication Number Publication Date
CN109063019A true CN109063019A (en) 2018-12-21

Family

ID=64816152

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810763260.2A Pending CN109063019A (en) 2018-07-12 2018-07-12 A kind of implementation method of the lightweight perpendicular network crawler based on producer consumer mode

Country Status (1)

Country Link
CN (1) CN109063019A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111404643A (en) * 2020-03-10 2020-07-10 山东汇贸电子口岸有限公司 Data receiving and transmitting processing method based on message queue
CN113362144A (en) * 2021-07-19 2021-09-07 海南炳祥投资咨询有限公司 E-commerce shopping recommendation method and system based on big data

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105677918A (en) * 2016-03-03 2016-06-15 浪潮软件股份有限公司 Distributed crawler architecture based on Kafka and Quartz and implementation method thereof
CN107943991A (en) * 2017-12-01 2018-04-20 成都嗨翻屋文化传播有限公司 A kind of distributed reptile frame and implementation method based on memory database

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105677918A (en) * 2016-03-03 2016-06-15 浪潮软件股份有限公司 Distributed crawler architecture based on Kafka and Quartz and implementation method thereof
CN107943991A (en) * 2017-12-01 2018-04-20 成都嗨翻屋文化传播有限公司 A kind of distributed reptile frame and implementation method based on memory database

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111404643A (en) * 2020-03-10 2020-07-10 山东汇贸电子口岸有限公司 Data receiving and transmitting processing method based on message queue
CN113362144A (en) * 2021-07-19 2021-09-07 海南炳祥投资咨询有限公司 E-commerce shopping recommendation method and system based on big data

Similar Documents

Publication Publication Date Title
US6698013B1 (en) Real time monitoring system for tracking and documenting changes made by programmer's during maintenance or development of computer readable code on a line by line basis and/or by point of focus
JP4146347B2 (en) Access log analysis apparatus and access log analysis method
US20030040887A1 (en) System and process for constructing and analyzing profiles for an application
US20080098300A1 (en) Method and system for extracting information from web pages
Cortes-Cornax et al. Evaluating choreographies in BPMN 2.0 using an extended quality framework
CN102880607A (en) Dynamic network content grabbing method and dynamic network content crawler system
CN107291778B (en) Data collection method and device
CN109063019A (en) A kind of implementation method of the lightweight perpendicular network crawler based on producer consumer mode
JP4861020B2 (en) Environmental load evaluation system operating method, environmental load evaluation system, and environmental load evaluation program
CN109829092A (en) The method that a kind of pair of webpage is oriented monitoring
Buchmann Modeling product-service systems for the internet of things: The comvantage method
US8266140B2 (en) Tagging system using internet search engine
KR20080035427A (en) A system and method for generating the business process which mapped the logical process and the physical process
CN112861010A (en) Accurate matching recommendation system and method for domain experts
Liu Evaluating design review meetings and the use of virtual reality for post-occupancy analysis
Kalchgruber et al. Factcheck-identify and fix conflicting data on the web
JP4869115B2 (en) Information processing system, information processing method, information processing program, and recording medium
Cortes-Cornax et al. Choreographies in BPMN 2.0: new challenges and open questions
Soibelman et al. Data fusion and modeling for construction management knowledge discovery
US20160092458A1 (en) System for automatically generating wrapper for entire websites
Norta Web supported enactment of petri-net based workflows with XRL/Flower
JP2011186579A (en) User portal screen management device, method, and program
Azzam et al. The cityspin platform: A CPSS environment for city-wide infrastructures
Khomyakov et al. A novel approach for collecting and sharing software metrics data
Stadlbauer et al. De‐bottlenecking open innovation: turning patent‐based technology network analysis into value

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20190715

Address after: 214029 No. 999 Gaolang East Road, Binhu District, Wuxi City, Jiangsu Province (Software Development Building) 707

Applicant after: Chaozhou Zhuoshu Big Data Industry Development Co.,Ltd.

Address before: 250100 S06 Floor, No. 1036 Tidal Road, Jinan High-tech Zone, Shandong Province

Applicant before: SHANDONG HUIMAO ELECTRONIC PORT Co.,Ltd.

TA01 Transfer of patent application right
RJ01 Rejection of invention patent application after publication

Application publication date: 20181221

RJ01 Rejection of invention patent application after publication