CN109063019A - A kind of implementation method of the lightweight perpendicular network crawler based on producer consumer mode - Google Patents
A kind of implementation method of the lightweight perpendicular network crawler based on producer consumer mode Download PDFInfo
- Publication number
- CN109063019A CN109063019A CN201810763260.2A CN201810763260A CN109063019A CN 109063019 A CN109063019 A CN 109063019A CN 201810763260 A CN201810763260 A CN 201810763260A CN 109063019 A CN109063019 A CN 109063019A
- Authority
- CN
- China
- Prior art keywords
- url
- interface
- consumed
- producer
- parsing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Information Transfer Between Computers (AREA)
Abstract
The implementation method of the invention discloses a kind of lightweight perpendicular network crawler based on producer consumer mode, specific method includes that configuration file configures URL pattern and resolution rules according to actual needs;By constructing URL queue container to be consumed, producers and consumers' mode of URL is formed;Interface is parsed to URL queue container to be consumed configuration URL, and formulates the resolution rules of URL parsing interface according to actual needs;The incoming target URL for meeting mode of interface parsing is parsed to database by URL.Compared to the prior art a kind of implementation method of the lightweight perpendicular network crawler based on producer consumer mode, improves the scalability of system, reduces the degree of coupling of system.
Description
Technical field
The present invention relates to the technical fields of vertical field web crawlers, specifically a kind of to be based on producer consumer mould
The implementation method of the lightweight perpendicular network crawler of formula.
Background technique
With the rapid development of information technology, people's lives and job information become to digitize, various digitlizations
Information is flooded with whole network, and the potential value of data is inestimable, how efficiently to extract and utilizes these a large amount of numbers
According to critically important practice significance.
Web crawlers, be it is a kind of according to certain rules, crawl the program or script of internet information automatically.Common net
Network crawler has universal web crawlers, vertical-type web crawlers, increment type web crawlers etc..Each crawler has different applications
Scene, therefore there is different types of crawler to possess different functions, crawler can be divided into 3 seed types:
Batch-type crawler: if the webpage being directed to be known and range be it is determining, batch-type crawler can be used,
When crawler, which completes, crawls required webpage, it can both stop crawl.Objectives may be different, can also
As long as can reach specific webpage amount, it is also possible to as long as completing crawl within the specific time
Etc..
Incremental crawler: this kind of web crawlers are with crawler presented hereinbefore the difference lies in that it can go always to crawl
Webpage.If the webpage grabbed changes, this crawler can also be crawled at once, because of webpage in the entire network
It is all changing all the time, the webpage of generation addition suddenly, webpage is deleted or web page contents are modified, these operations
All be but incremental crawler is needed to timely update the thing that it crawls for meeting user experience, so
It in this process, does not need to remove to grab new webpage again, but goes to update the webpage being crawled.This crawler is applicable in
In general commercial search engine.
Vertical-type crawler: vertical-type web crawlers, also known as focused web crawler, it crawls target according to set, there is choosing
Data required for crawling to selecting property the related pages of WWW and obtaining.This kind of web crawlers is the content according to webpage itself
Come what is crawled, the content that it generally crawls some webpages around some theme is crawled or around some required row
The webpage of industry crawls, as long as it removes to crawl the related web page of only some industry, does not have to understand any of other industry
Information.This kind of crawler needs focus of attention to be: how because of the limited of system resource, we cannot be all resources
It uses and crawls all webpages, look for the utilization that our required resources are considerably reduced resource so again from the inside
Rate, thus we need crawler can accomplish to look for as far as possible as early as possible with the most matched webpage of this theme, preferably not
It goes to grab those webpages unnecessary completely, so that resource can be saved.So such crawler will be confined to vertically
Search for website or vertical industry website.
But traditional vertical-type web crawlers, the system degree of coupling is high, and low efficiency is a problem to be solved.
Summary of the invention
Technical assignment of the invention is place against the above deficiency, provides a kind of lightweight based on producer consumer mode and hangs down
Straight network crawler system and its implementation.
The technical solution adopted by the present invention to solve the technical problems is: a kind of light weight based on producer consumer mode
The implementation method of grade perpendicular network crawler, method includes,
Configuration file configures URL pattern and resolution rules according to actual needs;
By constructing URL queue container to be consumed, producers and consumers' mode of URL is formed;
Interface is parsed to URL queue container to be consumed configuration URL, and formulates the parsing rule of URL parsing interface according to actual needs
Then;
The incoming target URL for meeting mode of interface parsing is parsed to database by URL.
Further, preferred method is that configuration file configures multithreading URL pattern.
Further, preferred method is, specific method includes,
S1, loading configuration file extract the entrance URL of configuration file;
S2, when queuing data length to be consumed is less than 0, wait the producer organize production;When queue length to be consumed is greater than 1,
Consumer organizes 1 data of consumption, is passed to URL to be consumed by analytic method and meets the resolution rules of the URL pattern, thus
Obtain more URL;
S3, judge whether resulting URL meets the target URL pattern that configuration file provides and target URL team is added if meeting
Column;It is no, then URL is added to queue end to be consumed;
S4, producer's group read entrance URL from configuration file, and parse interface by URL and parse incoming relevant parameter production
URL, and wake up consumer and organize consumption.
Further, preferred method is that when method further includes that URL parsing interface parsing generates abnormal, label is abnormal secondary
Number, is added queue to be consumed, when frequency of abnormity reaches setting value, is thrown into exception queue.
Further, preferred method is that the queue container to be consumed is singleton pattern, provides the producer of message
Group interface and consumer's group interface, interface method are synchronous method.
A kind of lightweight perpendicular network crawler system based on producer consumer mode, including start with system and load
Profile module, singleton Queue module to be consumed;
The profile module, for configuring URL pattern according to actual needs and corresponding with the URL pattern parsing rule
Then;
The Queue module to be consumed, including producer's group interface and consumer's group interface and one connect for producer's group
Mouth and the URL of consumer's group interface parsing message parse interface, and the URL parsing interface passes through actual needs and formulates URL solution
Analyse the resolution rules of interface.
Further, preferred structure be further include abnormal monitoring module, the abnormal monitoring module, when URL parse
When interface parsing generates abnormal, frequency of abnormity is marked, Queue module to be consumed is added, when frequency of abnormity reaches setting value, throws away
To exception queue.
A kind of lightweight perpendicular network crawler system and its implementation based on producer consumer mode of the invention
Compared to the prior art, multithreading execution is organized using producer's group, consumer and crawl task, improve the effect of crawler data acquisition
Rate, and the Lightweight component mainly configures the parameters such as URL pattern, corresponding resolution rules by configuration file, starts creation group
Part can obtain corresponding target URL to database, improve the scalability of system, reduce the degree of coupling of system.
Detailed description of the invention
The following further describes the present invention with reference to the drawings.
Attached drawing 1 is the functional block diagram of the implementation method of the lightweight perpendicular network crawler based on producer consumer mode.
Specific embodiment
The present invention will be further explained below with reference to the attached drawings and specific examples.
Uniform resource locator (Uniform Resource Locator, URL) is to can obtain from internet
The position of resource and a kind of succinct expression of access method, are the addresses of standard resource on internet.Each of on internet
File has a unique URL, and the information that it includes points out how the position of file and browser should handle it.
The present invention is a kind of implementation method of lightweight perpendicular network crawler based on producer consumer mode,
Due to vertical-type web crawlers be frequently necessary to by parsing the page produce URL, by consumption URL parse the page, constantly into
The process of row productive consumption URL data required for obtained, when a project to customize multiple web crawlers or
One web crawlers has multiple similar when crawling task, and the productive consumption mode for being related to multithread scheduling realized crawler
Journey is a bit cumbersome.This process is extracted, component is provided into and is more advantageous to efficiently realization crawler and data acquisition, improves system
Scalability reduces the complexity of system.
Embodiment 1:
This patent is related to the technologies such as multithread scheduling, message queue, producer consumer mode, web analysis, can be applied to
The technical fields such as small-sized vertical-type web crawlers, data pick-up.
Specific embodiment:
URL queue container to be consumed is constructed, and the container is singleton pattern, provides producers and consumers' interface of message,
His module can call the interface to carry out the production and consumption of message, and interface method is synchronous method, prevents from leading because of concurrent problem
Cause information drop-out.
The producer's class and consumer's class of the building parsing page, and page parsing interface is externally provided, it is passed to be resolved
URL and corresponding resolution rules.
Loading configuration file when component creates, extracts the beginning URL (entrance URL) of configuration file, meets of target URL
With mode, resolution rules between different URL patterns, queue message quantity, the parameter informations such as database.
When URL queue length to be consumed is less than 0, producer's production is waited, a data is otherwise consumed, by parsing side
Method is passed to URL to be consumed, and meets the resolution rules of its mode, to obtain more URL, judges whether URL meets configuration
The target URL pattern that file provides, is just added final target URL queue if met, is otherwise just added to URL to be consumed
Queue end.
Producer's group since configuration file read start URL, can be one be also possible to it is a plurality of, and by parse class interface
It parses incoming relevant parameter and produces URL, and wake up consumer spending.
When URL, which is parsed, generates abnormal, frequency of abnormity is marked, queue to be consumed is added, is tasted when frequency of abnormity reaches setting
When examination value, it is thrown into exception queue.
The technical personnel in the technical field can readily realize the present invention with the above specific embodiments,.But it answers
Work as understanding, the present invention is not limited to above-mentioned several specific embodiments.On the basis of the disclosed embodiments, the technology
The technical staff in field can arbitrarily combine different technical features, to realize different technical solutions.
Claims (7)
1. a kind of implementation method of the lightweight perpendicular network crawler based on producer consumer mode, which is characterized in that method
Including,
Configuration file configures URL pattern and resolution rules according to actual needs;
By constructing URL queue container to be consumed, producers and consumers' mode of URL is formed;
Interface is parsed to URL queue container to be consumed configuration URL, and formulates the parsing rule of URL parsing interface according to actual needs
Then;
The incoming target URL for meeting mode of interface parsing is parsed to database by URL.
2. a kind of realization side of lightweight perpendicular network crawler based on producer consumer mode according to claim 1
Method, which is characterized in that configuration file configures multithreading URL pattern.
3. a kind of realization side of lightweight perpendicular network crawler based on producer consumer mode according to claim 1
Method, which is characterized in that specific method includes,
S1, loading configuration file extract the entrance URL of configuration file;
S2, when queuing data length to be consumed is less than 0, wait the producer organize production;When queue length to be consumed is greater than 1,
Consumer organizes 1 data of consumption, is passed to URL to be consumed by analytic method and meets the resolution rules of the URL pattern, thus
Obtain more URL;
S3, judge whether resulting URL meets the target URL pattern that configuration file provides and target URL team is added if meeting
Column;It is no, then URL is added to queue end to be consumed;
S4, producer's group read entrance URL from configuration file, and parse interface by URL and parse incoming relevant parameter production
URL, and wake up consumer and organize consumption.
4. a kind of realization side of lightweight perpendicular network crawler based on producer consumer mode according to claim 3
Method, which is characterized in that when method further includes that URL parsing interface parsing generates abnormal, mark frequency of abnormity, team to be consumed is added
Column, when frequency of abnormity reaches setting value, are thrown into exception queue.
5. a kind of realization side of lightweight perpendicular network crawler based on producer consumer mode according to claim 1
Method, which is characterized in that the queue container to be consumed is singleton pattern, provides the producer's group interface and consumer's group of message
Interface, interface method are synchronous method.
6. a kind of lightweight perpendicular network crawler system based on producer consumer mode, which is characterized in that including with system
Profile module, the singleton Queue module to be consumed of starting and load;
The profile module, for configuring URL pattern according to actual needs and corresponding with the URL pattern parsing rule
Then;
The Queue module to be consumed, including producer's group interface and consumer's group interface and one connect for producer's group
Mouth and the URL of consumer's group interface parsing message parse interface, and the URL parsing interface passes through actual needs and formulates URL solution
Analyse the resolution rules of interface.
7. a kind of realization side of lightweight perpendicular network crawler based on producer consumer mode according to claim 6
Method, which is characterized in that it further include abnormal monitoring module, the abnormal monitoring module, when URL parsing interface parsing generation is different
Chang Shi marks frequency of abnormity, Queue module to be consumed is added, when frequency of abnormity reaches setting value, is thrown into exception queue.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810763260.2A CN109063019A (en) | 2018-07-12 | 2018-07-12 | A kind of implementation method of the lightweight perpendicular network crawler based on producer consumer mode |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810763260.2A CN109063019A (en) | 2018-07-12 | 2018-07-12 | A kind of implementation method of the lightweight perpendicular network crawler based on producer consumer mode |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109063019A true CN109063019A (en) | 2018-12-21 |
Family
ID=64816152
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810763260.2A Pending CN109063019A (en) | 2018-07-12 | 2018-07-12 | A kind of implementation method of the lightweight perpendicular network crawler based on producer consumer mode |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109063019A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111404643A (en) * | 2020-03-10 | 2020-07-10 | 山东汇贸电子口岸有限公司 | Data receiving and transmitting processing method based on message queue |
CN113362144A (en) * | 2021-07-19 | 2021-09-07 | 海南炳祥投资咨询有限公司 | E-commerce shopping recommendation method and system based on big data |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105677918A (en) * | 2016-03-03 | 2016-06-15 | 浪潮软件股份有限公司 | Distributed crawler architecture based on Kafka and Quartz and implementation method thereof |
CN107943991A (en) * | 2017-12-01 | 2018-04-20 | 成都嗨翻屋文化传播有限公司 | A kind of distributed reptile frame and implementation method based on memory database |
-
2018
- 2018-07-12 CN CN201810763260.2A patent/CN109063019A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105677918A (en) * | 2016-03-03 | 2016-06-15 | 浪潮软件股份有限公司 | Distributed crawler architecture based on Kafka and Quartz and implementation method thereof |
CN107943991A (en) * | 2017-12-01 | 2018-04-20 | 成都嗨翻屋文化传播有限公司 | A kind of distributed reptile frame and implementation method based on memory database |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111404643A (en) * | 2020-03-10 | 2020-07-10 | 山东汇贸电子口岸有限公司 | Data receiving and transmitting processing method based on message queue |
CN113362144A (en) * | 2021-07-19 | 2021-09-07 | 海南炳祥投资咨询有限公司 | E-commerce shopping recommendation method and system based on big data |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US6698013B1 (en) | Real time monitoring system for tracking and documenting changes made by programmer's during maintenance or development of computer readable code on a line by line basis and/or by point of focus | |
JP4146347B2 (en) | Access log analysis apparatus and access log analysis method | |
US20030040887A1 (en) | System and process for constructing and analyzing profiles for an application | |
US20080098300A1 (en) | Method and system for extracting information from web pages | |
Cortes-Cornax et al. | Evaluating choreographies in BPMN 2.0 using an extended quality framework | |
CN102880607A (en) | Dynamic network content grabbing method and dynamic network content crawler system | |
CN107291778B (en) | Data collection method and device | |
CN109063019A (en) | A kind of implementation method of the lightweight perpendicular network crawler based on producer consumer mode | |
JP4861020B2 (en) | Environmental load evaluation system operating method, environmental load evaluation system, and environmental load evaluation program | |
CN109829092A (en) | The method that a kind of pair of webpage is oriented monitoring | |
Buchmann | Modeling product-service systems for the internet of things: The comvantage method | |
US8266140B2 (en) | Tagging system using internet search engine | |
KR20080035427A (en) | A system and method for generating the business process which mapped the logical process and the physical process | |
CN112861010A (en) | Accurate matching recommendation system and method for domain experts | |
Liu | Evaluating design review meetings and the use of virtual reality for post-occupancy analysis | |
Kalchgruber et al. | Factcheck-identify and fix conflicting data on the web | |
JP4869115B2 (en) | Information processing system, information processing method, information processing program, and recording medium | |
Cortes-Cornax et al. | Choreographies in BPMN 2.0: new challenges and open questions | |
Soibelman et al. | Data fusion and modeling for construction management knowledge discovery | |
US20160092458A1 (en) | System for automatically generating wrapper for entire websites | |
Norta | Web supported enactment of petri-net based workflows with XRL/Flower | |
JP2011186579A (en) | User portal screen management device, method, and program | |
Azzam et al. | The cityspin platform: A CPSS environment for city-wide infrastructures | |
Khomyakov et al. | A novel approach for collecting and sharing software metrics data | |
Stadlbauer et al. | De‐bottlenecking open innovation: turning patent‐based technology network analysis into value |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20190715 Address after: 214029 No. 999 Gaolang East Road, Binhu District, Wuxi City, Jiangsu Province (Software Development Building) 707 Applicant after: Chaozhou Zhuoshu Big Data Industry Development Co.,Ltd. Address before: 250100 S06 Floor, No. 1036 Tidal Road, Jinan High-tech Zone, Shandong Province Applicant before: SHANDONG HUIMAO ELECTRONIC PORT Co.,Ltd. |
|
TA01 | Transfer of patent application right | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20181221 |
|
RJ01 | Rejection of invention patent application after publication |