CN109063216A - A kind of distributed vertical service search crawler frame - Google Patents
A kind of distributed vertical service search crawler frame Download PDFInfo
- Publication number
- CN109063216A CN109063216A CN201811208977.7A CN201811208977A CN109063216A CN 109063216 A CN109063216 A CN 109063216A CN 201811208977 A CN201811208977 A CN 201811208977A CN 109063216 A CN109063216 A CN 109063216A
- Authority
- CN
- China
- Prior art keywords
- url
- crawler
- message queue
- consolidated storage
- downloading task
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Abstract
The invention discloses a kind of distributed vertical service search crawler frames, specific step is as follows: step 1: using crawler static distribution form by the crawler of same target according to configuration, it copies on different network computers, request is then crawled using the sending of different IP resources;Step 2: target pages URL converges to consolidated storage by message queue pipeline;Step 3: URL is scheduled by consolidated storage load program, and is pushed to multiple network computer terminals by message queue pipeline, carries out dynamic dispatching execution URL downloading task by monitoring crawler.The present invention utilizes distributed vertical service search crawler, carry out reasonable layout task using more IP resources, share tasks will be crawled to carry out to network multiple stage computers, there is provided a kind of solutions of low cost for the web crawlers of high resource consumption, therefore distributed reptile is to realize in line business search and continue the good technological means free of discontinuities crawled.
Description
Technical field
The present invention relates to search crawler framework technology field, in particular to a kind of distributed vertical service search crawler frames
Frame.
Background technique
Most of website can have certain preventative strategies for the behavior of web crawlers, prevent because excessively frequently consuming
More Internet resources and I/O resource, cause the decline of web site performance.In order to be applicable in the rule of anti-crawler, crawler needs more
IP resource carrys out reasonable layout task, will crawl share tasks and carries out to network multiple stage computers.Public cloud is more and more common at present,
Internet resources cost is lower and lower, and there is provided a kind of solution party of low cost for the web crawlers of high resource consumption
Case, therefore distributed reptile is to realize in line business search and continue the good technological means free of discontinuities crawled.
Therefore, it is necessary to solve the above problems to invent a kind of distributed vertical service search crawler frame.
Summary of the invention
The purpose of the present invention is to provide a kind of distributed vertical service search crawler frames, by using crawler static state point
Cloth form according to configuration, copies to the crawler of same target on different network computers, is climbed using the sending of different IP resources
Request is taken, target pages URL converges to consolidated storage by message queue pipeline, and URL is passed through message team by consolidated storage load program
Tubulation road is pushed to network computer terminal, executes URL downloading task by monitoring crawler, the present invention utilizes distributed vertical business
Crawler is searched for, carrys out reasonable layout task using more IP resources, share tasks will be crawled and carried out to network multiple stage computers, mutually
Cost is relatively low for networked resources, and there is provided a kind of solutions of low cost for the web crawlers of high resource consumption, therefore
Distributed reptile is to realize in line business search and continue the good technological means free of discontinuities crawled, to solve above-mentioned back
The problem of being proposed in scape technology.
To achieve the above object, the invention provides the following technical scheme: a kind of distributed vertical service search crawler frame,
Specific step is as follows:
Step 1: the crawler of same target is copied to according to configuration by different networks using crawler static distribution form
On computer, request is then crawled using the sending of different IP resources;
Step 2: target pages URL converges to consolidated storage by message queue pipeline;
Step 3: URL is scheduled by consolidated storage load program, and is pushed to multiple networks by message queue pipeline
Terminal carries out dynamic dispatching execution URL downloading task by monitoring crawler;
Step 4: when URL is not denied access to, being indexed, and passs consolidated storage, and the URL is marked in consolidated storage
It has been performed task;
Step 5: when there is network computer terminal that cannot execute URL downloading task, by the URL through message queue pipeline
Consolidated storage is fed back to, this URL is scheduled by consolidated storage again at this time, is come back to message queue and is pushed to other nets again
Network terminal, then dynamic dispatching is carried out by monitoring crawler, execute URL downloading task;
Step 6: when the message that cannot execute URL downloading task feeds back to consolidated storage through message queue pipeline, simultaneously will
This URL and corresponding network computer terminal carry out record preservation, when searching again for this URL next time, do not transfer this net
Network terminal executes URL downloading task.
Preferably, the message queue is managed collectively and is dispatched by consolidated storage, when network computer crawler finds URL
When be pushed in consolidated storage by message queue, consolidated storage judges that the URL whether there is or not being downloaded, has, abandons by duplicate removal, without then
New URL is added in the message queue to be crawled by consolidated storage and executes URL downloading task for network computer terminal, works as network
When terminal executes the failure of URL downloading task, which can be backed in message queue, be moved by monitoring crawler
State scheduling is downloaded again, and when network computer terminal, which executes URL downloading task, to be completed, which can be recorded as climbing by consolidated storage
State is taken, avoids repeating to crawl.
Technical effect and advantage of the invention:
The present invention uses crawler static distribution form that the crawler of same target according to configuration, is copied to different network meters
On calculation machine, request is crawled using the sending of different IP resources, target pages URL converges to consolidated storage by message queue pipeline, in
URL is pushed to network computer terminal by message queue pipeline by heart library load program, executes URL downloading times by monitoring crawler
Business, the present invention utilize distributed vertical service search crawler, carry out reasonable layout task using more IP resources, will crawl task
It is distributed to the progress of network multiple stage computers, cost is relatively low for Internet resources, is to provide for the web crawlers of high resource consumption
A kind of solution of low cost, therefore distributed reptile is realized and continues free of discontinuities crawl in the line business search
Good technological means.
Detailed description of the invention
Fig. 1 is overall structure diagram of the invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on
Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other
Embodiment shall fall within the protection scope of the present invention.
Embodiment 1:
A kind of distributed vertical service search crawler frame according to figure 1, the specific steps are as follows:
Step 1: the crawler of same target is copied to according to configuration by different networks using crawler static distribution form
On computer, request is then crawled using the sending of different IP resources;
Step 2: target pages URL converges to consolidated storage by message queue pipeline;
Step 3: URL is scheduled by consolidated storage load program, and is pushed to multiple networks by message queue pipeline
Terminal carries out dynamic dispatching execution URL downloading task by monitoring crawler;
Step 4: when URL is not denied access to, being indexed, and passs consolidated storage, and the URL is marked in consolidated storage
It has been performed task;
Step 5: when there is network computer terminal that cannot execute URL downloading task, by the URL through message queue pipeline
Consolidated storage is fed back to, this URL is scheduled by consolidated storage again at this time, is come back to message queue and is pushed to other nets again
Network terminal, then dynamic dispatching is carried out by monitoring crawler, execute URL downloading task;
Step 6: when the message that cannot execute URL downloading task feeds back to consolidated storage through message queue pipeline, simultaneously will
This URL and corresponding network computer terminal carry out record preservation, when searching again for this URL next time, do not transfer this net
Network terminal executes URL downloading task.
The crawler of same target is copied into different network query functions according to configuration by using crawler static distribution form
On machine, request is crawled using the sending of different IP resources, target pages URL converges to consolidated storage, center by message queue pipeline
Data are pushed to network computer terminal by message queue pipeline by library load program, execute URL downloading times by monitoring crawler
Business, the present invention utilize distributed vertical service search crawler, carry out reasonable layout task using more IP resources, will crawl task
It is distributed to the progress of network multiple stage computers, cost is relatively low for Internet resources, is to provide for the web crawlers of high resource consumption
A kind of solution of low cost, therefore distributed reptile is realized and continues free of discontinuities crawl in the line business search
Good technological means.
Embodiment 2:
A kind of distributed vertical service search crawler frame according to figure 1, the message queue pipeline include two
Message queue manager, one of them is used as main message queue and uses, another message queue manager is used for
Enabling when main message queue manager breaks down, as spare;
Embodiment 3:
A kind of distributed vertical service search crawler frame according to figure 1, the message queue pipeline include three
Message queue manager, one of them is used as main message queue and uses, other two message queue manager difference
For the enabling when main message queue manager breaks down with congestion, used as spare and dredging;
Embodiment 4:
A kind of distributed vertical service search crawler frame according to figure 1, the message queue pipeline include three
Message above queue management device, one of them is used as main message queue and uses, and a message queue manager is used for
Spare when main message queue manager breaks down, multiple message queue managers in addition are used in main message
Enabling when queue management device gets congestion, according to crawler number evenly distribute, as dredging use;
Working principle of the present invention:
Referring to Figure of description 1, in use, first using crawler static distribution form by the crawler of same target according to matching
It sets, copies on different network computers, request is then crawled using the sending of different IP resources, target pages URL is by disappearing
Breath queue pipeline converges to consolidated storage, at this point, URL is scheduled by consolidated storage load program, and passes through message queue pipeline
Multiple network computer terminals are pushed to, carry out dynamic dispatching execution URL downloading task by monitoring crawler: when URL is not refused
When accessing absolutely, it is indexed, passs consolidated storage, consolidated storage is marked the URL and has been performed task;When there is network computer whole
The URL is fed back to consolidated storage through message queue pipeline when cannot execute URL downloading task by end, at this time consolidated storage by this URL again
It is secondary to be scheduled, and other network computer terminals are pushed to by message queue pipeline, then carry out dynamic by monitoring crawler
Scheduling executes URL downloading task, the library URL is finally passed, when the URL that cannot execute URL downloading task is anti-through message queue pipeline
Feed consolidated storage when, while this URL and corresponding network computer terminal are subjected to record preservation, are searched again for when next time
When this URL, this network computer terminal is not transferred and executes URL downloading task, the message queue pipeline includes two or two
Above message queue manager, one of them is used as main message queue and uses, other message queue managers
For breaking down in main message queue manager or enabling when congestion, used as spare and dredging.
Finally, it should be noted that the foregoing is only a preferred embodiment of the present invention, it is not intended to restrict the invention,
Although the present invention is described in detail referring to the foregoing embodiments, for those skilled in the art, still may be used
To modify the technical solutions described in the foregoing embodiments or equivalent replacement of some of the technical features,
All within the spirits and principles of the present invention, any modification, equivalent replacement, improvement and so on should be included in of the invention
Within protection scope.
Claims (2)
1. a kind of distributed vertical service search crawler frame, it is characterised in that: specific step is as follows:
Step 1: the crawler of same target is copied to according to configuration by different network query functions using crawler static distribution form
On machine, request is then crawled using the sending of different IP resources;
Step 2: target pages URL converges to consolidated storage by message queue pipeline;
Step 3: URL is scheduled by consolidated storage load program, and is pushed to multiple network query functions by message queue pipeline
Machine terminal carries out dynamic dispatching execution URL downloading task by monitoring crawler;
Step 4: when URL is not denied access to, being indexed, and passs consolidated storage, consolidated storage be marked the URL by
Execution task;
Step 5: when there is network computer terminal that cannot execute URL downloading task, which is fed back through message queue pipeline
To consolidated storage, this URL is scheduled by consolidated storage again at this time, is come back to message queue and is pushed to other network meters again
Calculation machine terminal, then dynamic dispatching is carried out by monitoring crawler, execute URL downloading task;
Step 6: when the message that cannot execute URL downloading task feeds back to consolidated storage through message queue pipeline, while by this
URL and corresponding network computer terminal carry out record preservation, when searching again for this URL next time, do not transfer this network
Terminal executes URL downloading task.
2. a kind of distributed vertical service search crawler frame according to claim 1, it is characterised in that: the message team
Column are managed collectively and are dispatched by consolidated storage, are pushed to center by message queue when network computer crawler finds URL
In library, consolidated storage judges that the URL whether there is or not being downloaded, has, abandons by duplicate removal, and nothing is then added to new URL to climb by consolidated storage
URL downloading task is executed for network computer terminal in the message queue taken, when network computer terminal executes URL downloading task
When failure, which can be backed in message queue, downloaded again by monitoring crawler progress dynamic dispatching, worked as network query function
When machine terminal executes the completion of URL downloading task, which can be recorded as crawling state by consolidated storage, avoid repeating to crawl.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811208977.7A CN109063216A (en) | 2018-10-17 | 2018-10-17 | A kind of distributed vertical service search crawler frame |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811208977.7A CN109063216A (en) | 2018-10-17 | 2018-10-17 | A kind of distributed vertical service search crawler frame |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109063216A true CN109063216A (en) | 2018-12-21 |
Family
ID=64764112
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811208977.7A Pending CN109063216A (en) | 2018-10-17 | 2018-10-17 | A kind of distributed vertical service search crawler frame |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109063216A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111475728A (en) * | 2020-04-07 | 2020-07-31 | 腾讯云计算(北京)有限责任公司 | Cloud resource information searching method, device, equipment and storage medium |
CN111897825A (en) * | 2020-06-01 | 2020-11-06 | 中国人民财产保险股份有限公司 | Distributed transaction processing method and device |
CN113821705A (en) * | 2021-08-30 | 2021-12-21 | 湖南大学 | Webpage content acquisition method, terminal equipment and readable storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102314463A (en) * | 2010-07-07 | 2012-01-11 | 北京瑞信在线系统技术有限公司 | Distributed crawler system and webpage data extraction method for the same |
US20120109927A1 (en) * | 2010-10-29 | 2012-05-03 | Fujitsu Limited | Architecture for distributed, parallel crawling of interactive client-server applications |
CN105260388A (en) * | 2015-09-11 | 2016-01-20 | 广州极数宝数据服务有限公司 | Optimization method of distributed vertical crawler service system |
CN105354337A (en) * | 2015-12-08 | 2016-02-24 | 北京奇虎科技有限公司 | Web crawler implementation method and web crawler system |
CN106021608A (en) * | 2016-06-22 | 2016-10-12 | 广东亿迅科技有限公司 | Distributed crawler system and implementing method thereof |
CN106874487A (en) * | 2017-02-21 | 2017-06-20 | 国信优易数据有限公司 | A kind of distributed reptile management system and its method |
-
2018
- 2018-10-17 CN CN201811208977.7A patent/CN109063216A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102314463A (en) * | 2010-07-07 | 2012-01-11 | 北京瑞信在线系统技术有限公司 | Distributed crawler system and webpage data extraction method for the same |
US20120109927A1 (en) * | 2010-10-29 | 2012-05-03 | Fujitsu Limited | Architecture for distributed, parallel crawling of interactive client-server applications |
CN105260388A (en) * | 2015-09-11 | 2016-01-20 | 广州极数宝数据服务有限公司 | Optimization method of distributed vertical crawler service system |
CN105354337A (en) * | 2015-12-08 | 2016-02-24 | 北京奇虎科技有限公司 | Web crawler implementation method and web crawler system |
CN106021608A (en) * | 2016-06-22 | 2016-10-12 | 广东亿迅科技有限公司 | Distributed crawler system and implementing method thereof |
CN106874487A (en) * | 2017-02-21 | 2017-06-20 | 国信优易数据有限公司 | A kind of distributed reptile management system and its method |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111475728A (en) * | 2020-04-07 | 2020-07-31 | 腾讯云计算(北京)有限责任公司 | Cloud resource information searching method, device, equipment and storage medium |
CN111475728B (en) * | 2020-04-07 | 2023-04-07 | 腾讯云计算(北京)有限责任公司 | Cloud resource information searching method, device, equipment and storage medium |
CN111897825A (en) * | 2020-06-01 | 2020-11-06 | 中国人民财产保险股份有限公司 | Distributed transaction processing method and device |
CN113821705A (en) * | 2021-08-30 | 2021-12-21 | 湖南大学 | Webpage content acquisition method, terminal equipment and readable storage medium |
CN113821705B (en) * | 2021-08-30 | 2024-02-20 | 湖南大学 | Webpage content acquisition method, terminal equipment and readable storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106708622B (en) | Cluster resource processing method and system and resource processing cluster | |
US10853046B2 (en) | Deployment of software applications on server clusters | |
US20210294634A1 (en) | Service Creation and Management | |
US10275281B2 (en) | Scheduling jobs for processing log files using a database system | |
US10332129B2 (en) | Methods and systems for processing a log file | |
US11831682B2 (en) | Highly scalable distributed connection interface for data capture from multiple network service and cloud-based sources | |
CN107071009A (en) | A kind of distributed big data crawler system of load balancing | |
CN109063216A (en) | A kind of distributed vertical service search crawler frame | |
DE112010004062T5 (en) | OPTIMIZING AN ARCHIVE MANAGEMENT PLANNING | |
US11106497B2 (en) | Distributed scheduling in a virtual machine environment | |
US10282175B2 (en) | Methods and systems for performing a partial build | |
CN107391775A (en) | A kind of general web crawlers model implementation method and system | |
CN112114950A (en) | Task scheduling method and device and cluster management system | |
US10924334B1 (en) | Monitoring distributed systems with auto-remediation | |
CN103618762A (en) | System and method for enterprise service bus state pretreatment based on AOP | |
US20220350587A1 (en) | Methods and systems for deployment of services | |
CN109614227A (en) | Task resource concocting method, device, electronic equipment and computer-readable medium | |
CN107239563A (en) | Public feelings information dynamic monitoring and controlling method | |
CN108390786A (en) | A kind of business O&M method, apparatus and electronic equipment | |
US11720406B2 (en) | System and method for determining and tracking cloud capacity metrics | |
US20180096003A1 (en) | Merging along object hierarchies | |
CN106254452A (en) | The big data access method of medical treatment under cloud platform | |
US10819557B1 (en) | Systems and methods for selective discovery of services | |
US11805146B2 (en) | System and method for detection promotion | |
Ali et al. | Probabilistic normed load monitoring in large scale distributed systems using mobile agents |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20181221 |
|
RJ01 | Rejection of invention patent application after publication |