CN106168985A - A kind of can the reptile method of fast distributed deployment - Google Patents

A kind of can the reptile method of fast distributed deployment Download PDF

Info

Publication number
CN106168985A
CN106168985A CN201610751104.5A CN201610751104A CN106168985A CN 106168985 A CN106168985 A CN 106168985A CN 201610751104 A CN201610751104 A CN 201610751104A CN 106168985 A CN106168985 A CN 106168985A
Authority
CN
China
Prior art keywords
task
reptile
thread
server
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610751104.5A
Other languages
Chinese (zh)
Inventor
章水鑫
许伟
叶丹青
左强翔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Car Easy To Amoy Network Information Technology Co Ltd
Original Assignee
Nanjing Car Easy To Amoy Network Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Car Easy To Amoy Network Information Technology Co Ltd filed Critical Nanjing Car Easy To Amoy Network Information Technology Co Ltd
Priority to CN201610751104.5A priority Critical patent/CN106168985A/en
Publication of CN106168985A publication Critical patent/CN106168985A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer And Data Communications (AREA)

Abstract

The invention provides a kind of can the reptile method of fast distributed deployment, Dispatch module is responsible for generation task, is deployed on a station server;Dispose module is responsible for process task, it is deployed on all of crawler server, the present invention has the following advantages, reptile task is generated by single-point, all crawler servers obtain task by queue, and reptile task is divided into Lists task and details page task, different types of task corresponding different types of message queue respectively, Dispose Module cycle Test database configures, and is managed reptile number of threads.All of crawler server is all the task that takes in message queue, any station server fault, and task can be shared by other servers.The configuration of every crawler server is the most identical, and increase and decrease server need not revise project configuration.

Description

A kind of can the reptile method of fast distributed deployment
Technical field
The present invention relates to areas of information technology, a kind of can the crawler system of fast distributed deployment
Background technology
For capturing the crawler system of mass data, in order to avoid crawl frequency is too high, IP is sealed, general employing multiple stage clothes Business device realizes.Common practice is crawl task to be split, and then specifies every service by the way of file configuration The crawl task performed, the system realized in this way has the disadvantage that
Task is divided by server configures, if increasing or reducing server, needs to repartition and appoints Business, often to modify to multiple servers configuration;
Certain station server increased or reduces reptile thread, need amendment configuration and restart service;
If the ip of certain station server is shielded by some websites, the program that this server runs is transferred to other servers Upper needs are copied in the lump together with configuration, cumbersome;
If certain station server operation irregularity, other servers are difficult to take over its work easily, generally require manpower intervention;
Every station server is had been carried out how much capturing task, also having how many crawl tasks to need execution to lack visually Chemical industry has.
Summary of the invention
In order to solve above-mentioned technical problem, the present invention propose a kind of can the reptile method of fast distributed deployment, it is achieved Reptile task is generated by single-point, and all crawler servers obtain the purpose of task by queue, and reptile task is divided into Lists task With details page task, different types of task realizes corresponding different types of message queue respectively;Dispose Module cycle detects Data base configures, and is managed reptile number of threads.
Realizing above-mentioned purpose, the technical solution used in the present invention is, Dispatch module is responsible for generation task, is deployed in one On station server;Dispose module is responsible for process task, is deployed on all of crawler server.
Task queue is divided into Lists task (ListTask), details page task (PageTask), Dispatch module and Decoupled by message queue between Dispose module.
Further, concrete operations flow process is to generate Lists task thread and details page task during the startup of Dispatch module Thread, each thread constantly detects respective queue, if finding, message queue is empty, i.e. generates task and puts into queue;Dispose Lists task thread management thread and details page task management thread, both threads timing detection data are generated when module starts The reptile Thread Count of middle configuration, compares with the current reptile Thread Count run, and starts/destroy reptile thread.
Use technique scheme, beneficial effect of the present invention including, but not limited to: 1. reptile task is by Dispatch mould Block single-point generates, the amendment of Mission Rules Guidelines, it is only necessary to update Dispatch module;
2. pair certain station server increases or minimizing reptile thread has only to revise data base's configuration, and reptile service need not Restart;
3. it is all the task that takes in message queue due to all of crawler server, any station server fault, task Can be shared by other servers;
4. the configuration of every crawler server is the most identical, and increase and decrease server need not revise project configuration;
5. message queue provides the visualization web-based management page, it may be clearly seen that tasks carrying situation, for measuring and calculating is No need to increase server and have well effect.
Accompanying drawing explanation
Fig. 1 is Dispatch/Dispose module diagram of the present invention;
Fig. 2 is Dispatch Booting sequence figure of the present invention;
Fig. 3 is that task of the present invention generates thread work flow chart;
Fig. 4 is Dispose Booting sequence figure of the present invention;
Fig. 5 is that ListTask/PageTask of the present invention manages thread work flow process.
Detailed description of the invention
Below in conjunction with accompanying drawing, the present invention is made further description.
As it is shown in figure 1, Dispatch module is responsible for generation task, it is deployed on a station server;Dispose module is responsible for Process task, is deployed on all of crawler server.
Task queue is divided into Lists task (ListTask) and details page task (PageTask), Dispatch module and Decoupled by message queue between Dispose module.
Further, as shown in Figure 2 and Figure 3, Dispatch module generates Lists task thread and details page task when starting Thread, each thread constantly detects respective queue, if finding, message queue is empty, i.e. generates task and puts into queue;
Further, as shown in Figure 4, Figure 5, Dispose module generates Lists task thread management thread with detailed when starting Feelings page task management thread, the reptile Thread Count of configuration in both threads timing detection data, with the current reptile line run Number of passes compares, and starts/destroy reptile thread.
The foregoing is only the preferred embodiments of the present invention, be not limited to the present invention, for the skill of this area For art personnel, the present invention can have various modifications and variations.All within the spirit and principles in the present invention, that is made any repaiies Change, equivalent, improvement etc., should be included within the scope of the present invention.

Claims (4)

1. one kind can the reptile method of fast distributed deployment, it is characterised in that it is raw that the method includes that Dispatch module is responsible for One-tenth task, is deployed on a station server;Dispose module is responsible for process task, is deployed on all of crawler server.
The most according to claim 1 can the reptile method of fast distributed deployment, it is characterised in that task queue is divided into list Task, details page task, decoupled by message queue between Dispatch module and Dispose module.
The most according to claim 2 can the reptile method of fast distributed deployment, it is characterised in that Dispatch module starts Shi Shengcheng Lists task thread and details page mission thread, each thread constantly detects respective queue, if finding, message queue is Sky, i.e. generates task and puts into queue.
The most according to claim 2 can the reptile method of fast distributed deployment, it is characterised in that Dispose module starts Shi Shengcheng Lists task thread management thread and details page task management thread, configuration in both threads timing detection data Reptile Thread Count, compares with the current reptile Thread Count run, and starts/destroy reptile thread.
CN201610751104.5A 2016-08-26 2016-08-26 A kind of can the reptile method of fast distributed deployment Pending CN106168985A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610751104.5A CN106168985A (en) 2016-08-26 2016-08-26 A kind of can the reptile method of fast distributed deployment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610751104.5A CN106168985A (en) 2016-08-26 2016-08-26 A kind of can the reptile method of fast distributed deployment

Publications (1)

Publication Number Publication Date
CN106168985A true CN106168985A (en) 2016-11-30

Family

ID=57376088

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610751104.5A Pending CN106168985A (en) 2016-08-26 2016-08-26 A kind of can the reptile method of fast distributed deployment

Country Status (1)

Country Link
CN (1) CN106168985A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112650570A (en) * 2020-12-29 2021-04-13 百果园技术(新加坡)有限公司 Dynamically expandable distributed crawler system, data processing method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102314463A (en) * 2010-07-07 2012-01-11 北京瑞信在线系统技术有限公司 Distributed crawler system and webpage data extraction method for the same
CN103514301A (en) * 2013-10-24 2014-01-15 深圳市同洲电子股份有限公司 Method and system for scheduling tasks of distributed network crawlers
CN105243159A (en) * 2015-10-28 2016-01-13 福建亿榕信息技术有限公司 Visual script editor-based distributed web crawler system
CN105677918A (en) * 2016-03-03 2016-06-15 浪潮软件股份有限公司 Distributed crawler architecture based on Kafka and Quartz and implementation method thereof
CN105868258A (en) * 2015-12-28 2016-08-17 乐视网信息技术(北京)股份有限公司 Crawler system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102314463A (en) * 2010-07-07 2012-01-11 北京瑞信在线系统技术有限公司 Distributed crawler system and webpage data extraction method for the same
CN103514301A (en) * 2013-10-24 2014-01-15 深圳市同洲电子股份有限公司 Method and system for scheduling tasks of distributed network crawlers
CN105243159A (en) * 2015-10-28 2016-01-13 福建亿榕信息技术有限公司 Visual script editor-based distributed web crawler system
CN105868258A (en) * 2015-12-28 2016-08-17 乐视网信息技术(北京)股份有限公司 Crawler system
CN105677918A (en) * 2016-03-03 2016-06-15 浪潮软件股份有限公司 Distributed crawler architecture based on Kafka and Quartz and implementation method thereof

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112650570A (en) * 2020-12-29 2021-04-13 百果园技术(新加坡)有限公司 Dynamically expandable distributed crawler system, data processing method and device

Similar Documents

Publication Publication Date Title
CN106817408B (en) Distributed server cluster scheduling method and device
CN104699541A (en) Method, device, data transmission assembly and system for synchronizing data
CN104915259A (en) Task scheduling method applied to distributed acquisition system
CN103176892B (en) A kind of page monitoring method and system
CN106874189A (en) A kind of implementation method of the automatization test system of real-time data of power grid storehouse system
US20190065249A1 (en) Method and system for managing data stream processing
CN105786611A (en) Method and device for task scheduling of distributed cluster
CN105338045A (en) Cloud computing resource processing device, method and cloud computing system
CN103345386A (en) Software production method, device and operation system
CN102789394B (en) Method, device and nodes for parallelly processing information and server cluster
CN110392106A (en) A kind of method for pushing and device of job state
CN106293945A (en) A kind of resource perception method and system across virtual machine
DE102022120616A1 (en) Self-healing and data centers
CN104410511B (en) A kind of server management method and system
CN111444309B (en) System for learning graph
CN103678488B (en) Distributed mass dynamic task engine and method for processing data with same
CN106168985A (en) A kind of can the reptile method of fast distributed deployment
CN104503885A (en) Timing door watching device and system
US20180081941A1 (en) Static hierarchy based query execution
CN114185502B (en) Log printing method, device, equipment and medium based on production line environment
CN105827445A (en) Laser marking machine monitoring method, device and system
DE102022119581A1 (en) MOTION DATA FOR FAILURE DETECTION
CN103067450B (en) Application control method and system for cloud environment
CN104166317A (en) Method and system for controlling automatic dispatch of photo-masks
CN106598721A (en) Media asset data circulating method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 210000 Gulou District, Nanjing, Jiangsu Province, 518, Block C, No. 4, Gupinggang, Nanjing

Applicant after: Nanjing Sanbaiyun Information Technology Co., Ltd.

Address before: 210000 Gulou District, Nanjing, Jiangsu Province, 518, Block C, No. 4, Gupinggang, Nanjing

Applicant before: Nanjing car easy to Amoy network information technology Co., Ltd.

RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20161130