CN109063216A - A kind of distributed vertical service search crawler frame - Google Patents

A kind of distributed vertical service search crawler frame Download PDF

Info

Publication number
CN109063216A
CN109063216A CN201811208977.7A CN201811208977A CN109063216A CN 109063216 A CN109063216 A CN 109063216A CN 201811208977 A CN201811208977 A CN 201811208977A CN 109063216 A CN109063216 A CN 109063216A
Authority
CN
China
Prior art keywords
url
crawler
message queue
consolidated storage
downloading task
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811208977.7A
Other languages
Chinese (zh)
Inventor
邓炽成
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhuhai Zhitu Digital Research Information Technology Co Ltd
Original Assignee
Zhuhai Zhitu Digital Research Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhuhai Zhitu Digital Research Information Technology Co Ltd filed Critical Zhuhai Zhitu Digital Research Information Technology Co Ltd
Priority to CN201811208977.7A priority Critical patent/CN109063216A/en
Publication of CN109063216A publication Critical patent/CN109063216A/en
Pending legal-status Critical Current

Links

Abstract

The invention discloses a kind of distributed vertical service search crawler frames, specific step is as follows: step 1: using crawler static distribution form by the crawler of same target according to configuration, it copies on different network computers, request is then crawled using the sending of different IP resources;Step 2: target pages URL converges to consolidated storage by message queue pipeline;Step 3: URL is scheduled by consolidated storage load program, and is pushed to multiple network computer terminals by message queue pipeline, carries out dynamic dispatching execution URL downloading task by monitoring crawler.The present invention utilizes distributed vertical service search crawler, carry out reasonable layout task using more IP resources, share tasks will be crawled to carry out to network multiple stage computers, there is provided a kind of solutions of low cost for the web crawlers of high resource consumption, therefore distributed reptile is to realize in line business search and continue the good technological means free of discontinuities crawled.

Description

A kind of distributed vertical service search crawler frame
Technical field
The present invention relates to search crawler framework technology field, in particular to a kind of distributed vertical service search crawler frames Frame.
Background technique
Most of website can have certain preventative strategies for the behavior of web crawlers, prevent because excessively frequently consuming More Internet resources and I/O resource, cause the decline of web site performance.In order to be applicable in the rule of anti-crawler, crawler needs more IP resource carrys out reasonable layout task, will crawl share tasks and carries out to network multiple stage computers.Public cloud is more and more common at present, Internet resources cost is lower and lower, and there is provided a kind of solution party of low cost for the web crawlers of high resource consumption Case, therefore distributed reptile is to realize in line business search and continue the good technological means free of discontinuities crawled.
Therefore, it is necessary to solve the above problems to invent a kind of distributed vertical service search crawler frame.
Summary of the invention
The purpose of the present invention is to provide a kind of distributed vertical service search crawler frames, by using crawler static state point Cloth form according to configuration, copies to the crawler of same target on different network computers, is climbed using the sending of different IP resources Request is taken, target pages URL converges to consolidated storage by message queue pipeline, and URL is passed through message team by consolidated storage load program Tubulation road is pushed to network computer terminal, executes URL downloading task by monitoring crawler, the present invention utilizes distributed vertical business Crawler is searched for, carrys out reasonable layout task using more IP resources, share tasks will be crawled and carried out to network multiple stage computers, mutually Cost is relatively low for networked resources, and there is provided a kind of solutions of low cost for the web crawlers of high resource consumption, therefore Distributed reptile is to realize in line business search and continue the good technological means free of discontinuities crawled, to solve above-mentioned back The problem of being proposed in scape technology.
To achieve the above object, the invention provides the following technical scheme: a kind of distributed vertical service search crawler frame, Specific step is as follows:
Step 1: the crawler of same target is copied to according to configuration by different networks using crawler static distribution form On computer, request is then crawled using the sending of different IP resources;
Step 2: target pages URL converges to consolidated storage by message queue pipeline;
Step 3: URL is scheduled by consolidated storage load program, and is pushed to multiple networks by message queue pipeline Terminal carries out dynamic dispatching execution URL downloading task by monitoring crawler;
Step 4: when URL is not denied access to, being indexed, and passs consolidated storage, and the URL is marked in consolidated storage It has been performed task;
Step 5: when there is network computer terminal that cannot execute URL downloading task, by the URL through message queue pipeline Consolidated storage is fed back to, this URL is scheduled by consolidated storage again at this time, is come back to message queue and is pushed to other nets again Network terminal, then dynamic dispatching is carried out by monitoring crawler, execute URL downloading task;
Step 6: when the message that cannot execute URL downloading task feeds back to consolidated storage through message queue pipeline, simultaneously will This URL and corresponding network computer terminal carry out record preservation, when searching again for this URL next time, do not transfer this net Network terminal executes URL downloading task.
Preferably, the message queue is managed collectively and is dispatched by consolidated storage, when network computer crawler finds URL When be pushed in consolidated storage by message queue, consolidated storage judges that the URL whether there is or not being downloaded, has, abandons by duplicate removal, without then New URL is added in the message queue to be crawled by consolidated storage and executes URL downloading task for network computer terminal, works as network When terminal executes the failure of URL downloading task, which can be backed in message queue, be moved by monitoring crawler State scheduling is downloaded again, and when network computer terminal, which executes URL downloading task, to be completed, which can be recorded as climbing by consolidated storage State is taken, avoids repeating to crawl.
Technical effect and advantage of the invention:
The present invention uses crawler static distribution form that the crawler of same target according to configuration, is copied to different network meters On calculation machine, request is crawled using the sending of different IP resources, target pages URL converges to consolidated storage by message queue pipeline, in URL is pushed to network computer terminal by message queue pipeline by heart library load program, executes URL downloading times by monitoring crawler Business, the present invention utilize distributed vertical service search crawler, carry out reasonable layout task using more IP resources, will crawl task It is distributed to the progress of network multiple stage computers, cost is relatively low for Internet resources, is to provide for the web crawlers of high resource consumption A kind of solution of low cost, therefore distributed reptile is realized and continues free of discontinuities crawl in the line business search Good technological means.
Detailed description of the invention
Fig. 1 is overall structure diagram of the invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.
Embodiment 1:
A kind of distributed vertical service search crawler frame according to figure 1, the specific steps are as follows:
Step 1: the crawler of same target is copied to according to configuration by different networks using crawler static distribution form On computer, request is then crawled using the sending of different IP resources;
Step 2: target pages URL converges to consolidated storage by message queue pipeline;
Step 3: URL is scheduled by consolidated storage load program, and is pushed to multiple networks by message queue pipeline Terminal carries out dynamic dispatching execution URL downloading task by monitoring crawler;
Step 4: when URL is not denied access to, being indexed, and passs consolidated storage, and the URL is marked in consolidated storage It has been performed task;
Step 5: when there is network computer terminal that cannot execute URL downloading task, by the URL through message queue pipeline Consolidated storage is fed back to, this URL is scheduled by consolidated storage again at this time, is come back to message queue and is pushed to other nets again Network terminal, then dynamic dispatching is carried out by monitoring crawler, execute URL downloading task;
Step 6: when the message that cannot execute URL downloading task feeds back to consolidated storage through message queue pipeline, simultaneously will This URL and corresponding network computer terminal carry out record preservation, when searching again for this URL next time, do not transfer this net Network terminal executes URL downloading task.
The crawler of same target is copied into different network query functions according to configuration by using crawler static distribution form On machine, request is crawled using the sending of different IP resources, target pages URL converges to consolidated storage, center by message queue pipeline Data are pushed to network computer terminal by message queue pipeline by library load program, execute URL downloading times by monitoring crawler Business, the present invention utilize distributed vertical service search crawler, carry out reasonable layout task using more IP resources, will crawl task It is distributed to the progress of network multiple stage computers, cost is relatively low for Internet resources, is to provide for the web crawlers of high resource consumption A kind of solution of low cost, therefore distributed reptile is realized and continues free of discontinuities crawl in the line business search Good technological means.
Embodiment 2:
A kind of distributed vertical service search crawler frame according to figure 1, the message queue pipeline include two Message queue manager, one of them is used as main message queue and uses, another message queue manager is used for Enabling when main message queue manager breaks down, as spare;
Embodiment 3:
A kind of distributed vertical service search crawler frame according to figure 1, the message queue pipeline include three Message queue manager, one of them is used as main message queue and uses, other two message queue manager difference For the enabling when main message queue manager breaks down with congestion, used as spare and dredging;
Embodiment 4:
A kind of distributed vertical service search crawler frame according to figure 1, the message queue pipeline include three Message above queue management device, one of them is used as main message queue and uses, and a message queue manager is used for Spare when main message queue manager breaks down, multiple message queue managers in addition are used in main message Enabling when queue management device gets congestion, according to crawler number evenly distribute, as dredging use;
Working principle of the present invention:
Referring to Figure of description 1, in use, first using crawler static distribution form by the crawler of same target according to matching It sets, copies on different network computers, request is then crawled using the sending of different IP resources, target pages URL is by disappearing Breath queue pipeline converges to consolidated storage, at this point, URL is scheduled by consolidated storage load program, and passes through message queue pipeline Multiple network computer terminals are pushed to, carry out dynamic dispatching execution URL downloading task by monitoring crawler: when URL is not refused When accessing absolutely, it is indexed, passs consolidated storage, consolidated storage is marked the URL and has been performed task;When there is network computer whole The URL is fed back to consolidated storage through message queue pipeline when cannot execute URL downloading task by end, at this time consolidated storage by this URL again It is secondary to be scheduled, and other network computer terminals are pushed to by message queue pipeline, then carry out dynamic by monitoring crawler Scheduling executes URL downloading task, the library URL is finally passed, when the URL that cannot execute URL downloading task is anti-through message queue pipeline Feed consolidated storage when, while this URL and corresponding network computer terminal are subjected to record preservation, are searched again for when next time When this URL, this network computer terminal is not transferred and executes URL downloading task, the message queue pipeline includes two or two Above message queue manager, one of them is used as main message queue and uses, other message queue managers For breaking down in main message queue manager or enabling when congestion, used as spare and dredging.
Finally, it should be noted that the foregoing is only a preferred embodiment of the present invention, it is not intended to restrict the invention, Although the present invention is described in detail referring to the foregoing embodiments, for those skilled in the art, still may be used To modify the technical solutions described in the foregoing embodiments or equivalent replacement of some of the technical features, All within the spirits and principles of the present invention, any modification, equivalent replacement, improvement and so on should be included in of the invention Within protection scope.

Claims (2)

1. a kind of distributed vertical service search crawler frame, it is characterised in that: specific step is as follows:
Step 1: the crawler of same target is copied to according to configuration by different network query functions using crawler static distribution form On machine, request is then crawled using the sending of different IP resources;
Step 2: target pages URL converges to consolidated storage by message queue pipeline;
Step 3: URL is scheduled by consolidated storage load program, and is pushed to multiple network query functions by message queue pipeline Machine terminal carries out dynamic dispatching execution URL downloading task by monitoring crawler;
Step 4: when URL is not denied access to, being indexed, and passs consolidated storage, consolidated storage be marked the URL by Execution task;
Step 5: when there is network computer terminal that cannot execute URL downloading task, which is fed back through message queue pipeline To consolidated storage, this URL is scheduled by consolidated storage again at this time, is come back to message queue and is pushed to other network meters again Calculation machine terminal, then dynamic dispatching is carried out by monitoring crawler, execute URL downloading task;
Step 6: when the message that cannot execute URL downloading task feeds back to consolidated storage through message queue pipeline, while by this URL and corresponding network computer terminal carry out record preservation, when searching again for this URL next time, do not transfer this network Terminal executes URL downloading task.
2. a kind of distributed vertical service search crawler frame according to claim 1, it is characterised in that: the message team Column are managed collectively and are dispatched by consolidated storage, are pushed to center by message queue when network computer crawler finds URL In library, consolidated storage judges that the URL whether there is or not being downloaded, has, abandons by duplicate removal, and nothing is then added to new URL to climb by consolidated storage URL downloading task is executed for network computer terminal in the message queue taken, when network computer terminal executes URL downloading task When failure, which can be backed in message queue, downloaded again by monitoring crawler progress dynamic dispatching, worked as network query function When machine terminal executes the completion of URL downloading task, which can be recorded as crawling state by consolidated storage, avoid repeating to crawl.
CN201811208977.7A 2018-10-17 2018-10-17 A kind of distributed vertical service search crawler frame Pending CN109063216A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811208977.7A CN109063216A (en) 2018-10-17 2018-10-17 A kind of distributed vertical service search crawler frame

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811208977.7A CN109063216A (en) 2018-10-17 2018-10-17 A kind of distributed vertical service search crawler frame

Publications (1)

Publication Number Publication Date
CN109063216A true CN109063216A (en) 2018-12-21

Family

ID=64764112

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811208977.7A Pending CN109063216A (en) 2018-10-17 2018-10-17 A kind of distributed vertical service search crawler frame

Country Status (1)

Country Link
CN (1) CN109063216A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111475728A (en) * 2020-04-07 2020-07-31 腾讯云计算(北京)有限责任公司 Cloud resource information searching method, device, equipment and storage medium
CN111897825A (en) * 2020-06-01 2020-11-06 中国人民财产保险股份有限公司 Distributed transaction processing method and device
CN113821705A (en) * 2021-08-30 2021-12-21 湖南大学 Webpage content acquisition method, terminal equipment and readable storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102314463A (en) * 2010-07-07 2012-01-11 北京瑞信在线系统技术有限公司 Distributed crawler system and webpage data extraction method for the same
US20120109927A1 (en) * 2010-10-29 2012-05-03 Fujitsu Limited Architecture for distributed, parallel crawling of interactive client-server applications
CN105260388A (en) * 2015-09-11 2016-01-20 广州极数宝数据服务有限公司 Optimization method of distributed vertical crawler service system
CN105354337A (en) * 2015-12-08 2016-02-24 北京奇虎科技有限公司 Web crawler implementation method and web crawler system
CN106021608A (en) * 2016-06-22 2016-10-12 广东亿迅科技有限公司 Distributed crawler system and implementing method thereof
CN106874487A (en) * 2017-02-21 2017-06-20 国信优易数据有限公司 A kind of distributed reptile management system and its method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102314463A (en) * 2010-07-07 2012-01-11 北京瑞信在线系统技术有限公司 Distributed crawler system and webpage data extraction method for the same
US20120109927A1 (en) * 2010-10-29 2012-05-03 Fujitsu Limited Architecture for distributed, parallel crawling of interactive client-server applications
CN105260388A (en) * 2015-09-11 2016-01-20 广州极数宝数据服务有限公司 Optimization method of distributed vertical crawler service system
CN105354337A (en) * 2015-12-08 2016-02-24 北京奇虎科技有限公司 Web crawler implementation method and web crawler system
CN106021608A (en) * 2016-06-22 2016-10-12 广东亿迅科技有限公司 Distributed crawler system and implementing method thereof
CN106874487A (en) * 2017-02-21 2017-06-20 国信优易数据有限公司 A kind of distributed reptile management system and its method

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111475728A (en) * 2020-04-07 2020-07-31 腾讯云计算(北京)有限责任公司 Cloud resource information searching method, device, equipment and storage medium
CN111475728B (en) * 2020-04-07 2023-04-07 腾讯云计算(北京)有限责任公司 Cloud resource information searching method, device, equipment and storage medium
CN111897825A (en) * 2020-06-01 2020-11-06 中国人民财产保险股份有限公司 Distributed transaction processing method and device
CN113821705A (en) * 2021-08-30 2021-12-21 湖南大学 Webpage content acquisition method, terminal equipment and readable storage medium
CN113821705B (en) * 2021-08-30 2024-02-20 湖南大学 Webpage content acquisition method, terminal equipment and readable storage medium

Similar Documents

Publication Publication Date Title
CN106708622B (en) Cluster resource processing method and system and resource processing cluster
US10853046B2 (en) Deployment of software applications on server clusters
US20210294634A1 (en) Service Creation and Management
US10275281B2 (en) Scheduling jobs for processing log files using a database system
US10332129B2 (en) Methods and systems for processing a log file
US11831682B2 (en) Highly scalable distributed connection interface for data capture from multiple network service and cloud-based sources
CN107071009A (en) A kind of distributed big data crawler system of load balancing
CN109063216A (en) A kind of distributed vertical service search crawler frame
DE112010004062T5 (en) OPTIMIZING AN ARCHIVE MANAGEMENT PLANNING
US11106497B2 (en) Distributed scheduling in a virtual machine environment
US10282175B2 (en) Methods and systems for performing a partial build
CN107391775A (en) A kind of general web crawlers model implementation method and system
CN112114950A (en) Task scheduling method and device and cluster management system
US10924334B1 (en) Monitoring distributed systems with auto-remediation
CN103618762A (en) System and method for enterprise service bus state pretreatment based on AOP
US20220350587A1 (en) Methods and systems for deployment of services
CN109614227A (en) Task resource concocting method, device, electronic equipment and computer-readable medium
CN107239563A (en) Public feelings information dynamic monitoring and controlling method
CN108390786A (en) A kind of business O&M method, apparatus and electronic equipment
US11720406B2 (en) System and method for determining and tracking cloud capacity metrics
US20180096003A1 (en) Merging along object hierarchies
CN106254452A (en) The big data access method of medical treatment under cloud platform
US10819557B1 (en) Systems and methods for selective discovery of services
US11805146B2 (en) System and method for detection promotion
Ali et al. Probabilistic normed load monitoring in large scale distributed systems using mobile agents

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20181221

RJ01 Rejection of invention patent application after publication