CN103778165A

CN103778165A - Dynamic collecting adjusting algorithm for spider dispatching center

Info

Publication number: CN103778165A
Application number: CN201210414966.0A
Authority: CN
Inventors: 李旭日
Original assignee: GUANGZHOU BANGFU SOFTWARE Co Ltd
Current assignee: GUANGZHOU BANGFU SOFTWARE Co Ltd
Priority date: 2012-10-26
Filing date: 2012-10-26
Publication date: 2014-05-07

Abstract

The invention discloses a dynamic collecting adjusting algorithm for a spider dispatching center. The algorithm is achieved mainly through two aspects that firstly, dynamic adjustment is automatically carried out on the task collecting interval according to collected data during the spider running process so that the purpose that the longer the running time is, the more superior the performance becomes can be achieved; secondly, the time quantum in which the maximum data size is collected is excavated by analyzing the collecting log of a spider program, and the task configuration in the time quantum should be emphasized; two important parameters including the collecting interval and the key updating time quantum exist in the task dispatching, manual intervention is not needed, and automatic adaptation is carried out according to the updating frequencies and the updating time quantum of different websites. The purpose of the maximum collecting efficiency is achieved.

Description

A kind of spider dispatching center gathers dynamic adjustment algorithm

Technical field

The present invention relates to Internet technical field, particularly spider dispatching center gathers dynamic adjustment algorithm.

Background technology

In search engine, web data all is automatically asked to gather by spider and is provided.Because website on internet has a lot, for convenient management, defining conventionally of spider task decided by website.

Spider is distributed on the server of different clusters conventionally, and for unified coordination and administration, spider can be automatically to unified task scheduling center requests task.There is different renewal frequencies different websites, therefore performance on spider of the scheduling time of task and scheduling interval and efficiently have a very great impact.

Therefore the most important work in task scheduling center is exactly the setting of task distribution and correlation parameter.

Task scheduling center adopts poll, the mechanism of first in first out.All collected websites are waited for collected according to the time sequencing of finding.

The renewal frequency of not considering different web sites is inconsistent, causes inefficiency.

On the basis of technology one, adopt for different web sites the mode that fixing acquisition time interval is manually set.

Need manual intervention adjustment, because Websites quantity is numerous, maintenance cost is very high.The renewal frequency of a lot of websites self often changes, and cannot adjust timely.

Summary of the invention

The object of the invention is to solve problem, provide the dispatching center's robotization of a kind of spider acquisition interval adjust mechanism, without manual intervention, according to the renewal frequency of different web sites and update time section automatically adapt to.Reach the object that gathers maximum efficiency.

For achieving the above object, the technical solution adopted in the present invention is: a kind of spider dispatching center gathers dynamic adjustment algorithm, it is characterized in that: this algorithm is mainly started with from two aspects: the one, the operational process of spider, according to the data that collect, the acquisition interval of task is dynamically adjusted automatically, longer to reach working time, performance is more excellent; The 2nd, by analyzing the collection daily record of spider, mining data amount gathers the maximum time period, and this time period is carried out to emphasis treatment in task configuration; In task scheduling, there are two important parameters: acquisition interval, emphasis section update time.

Algorithm steps:

1, spider initiates a request of obtaining task to task scheduling center.

2, dispatching center once divides into groups to all websites, and a point set condition is whether current slot is emphasis section update time of this website.

3, the data of two groupings are sorted respectively, sort criteria is that last acquisition time adds acquisition interval.Getting task minimal time and that be less than current time returns.If do not have qualifiedly, be directly returned as sky.The priority of returning of two groupings is the group that is first taken at emphasis section update time.

If 4, spider does not successfully get task, Returning process 1, if successfully obtained, carries out data acquisition.

5, record current collection capacity to daily record.If have current and collect new data, notice dispatching center turns acquisition interval down automatically, as is multiplied by one and is less than 1 weighted value as 0.9, if do not collect new data, increases certain acquisition interval, enters to be multiplied by the weighted value 1.1 that is greater than 1.

6, continue to return 1 and obtain next task.

7, there is in addition a background program simultaneously according to gathering daily record, draw collection capacity statistical graph, the corresponding time period of crest of the data acquisition collection capacity of comprehensive many days, be updated to dispatching center.

This algorithm is without manual intervention, according to the renewal frequency of different web sites and update time section automatically adapt to.Reach the object that gathers maximum efficiency.

Embodiment:

For making technical scheme of the present invention be convenient to understand, below in conjunction with embodiment, the present invention is further illustrated.

Embodiment: a kind of spider dispatching center gathers dynamic adjustment algorithm, it is characterized in that: this algorithm is mainly started with from two aspects: the one, the operational process of spider, according to the data that collect, the acquisition interval of task is dynamically adjusted automatically, longer to reach working time, performance is more excellent; The 2nd, by analyzing the collection daily record of spider, mining data amount gathers the maximum time period, and this time period is carried out to emphasis treatment in task configuration; In task scheduling, there are two important parameters: acquisition interval, emphasis section update time.

Algorithm steps:

1, spider initiates a request of obtaining task to task scheduling center.

6, continue to return 1 and obtain next task.

The above, it is only preferred embodiment of the present invention, not the present invention is done to any formal and substantial restriction, all those skilled in the art, do not departing within the scope of technical solution of the present invention, when utilizing disclosed above technology contents, and the equivalent variations of a little change of making, modification and differentiation is equivalent embodiment of the present invention; Meanwhile, the change of any equivalent variations that all foundations essence technology of the present invention is done above embodiment, modification and differentiation, all still belong in the scope of technical scheme of the present invention.

Claims

1. a spider dispatching center gathers dynamic adjustment algorithm, it is characterized in that: this algorithm is mainly started with from two aspects: the one, the operational process of spider, according to the data that collect, the acquisition interval of task is dynamically adjusted automatically, longer to reach working time, performance is more excellent; The 2nd, by analyzing the collection daily record of spider, mining data amount gathers the maximum time period, and this time period is carried out to emphasis treatment in task configuration; In task scheduling, there are two important parameters: acquisition interval, emphasis section update time.

Algorithm steps:

1, spider initiates a request of obtaining task to task scheduling center.

6, continue to return 1 and obtain next task.