CN103778165A - Dynamic collecting adjusting algorithm for spider dispatching center - Google Patents

Dynamic collecting adjusting algorithm for spider dispatching center Download PDF

Info

Publication number
CN103778165A
CN103778165A CN201210414966.0A CN201210414966A CN103778165A CN 103778165 A CN103778165 A CN 103778165A CN 201210414966 A CN201210414966 A CN 201210414966A CN 103778165 A CN103778165 A CN 103778165A
Authority
CN
China
Prior art keywords
spider
task
time
dispatching center
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201210414966.0A
Other languages
Chinese (zh)
Inventor
李旭日
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
GUANGZHOU BANGFU SOFTWARE Co Ltd
Original Assignee
GUANGZHOU BANGFU SOFTWARE Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by GUANGZHOU BANGFU SOFTWARE Co Ltd filed Critical GUANGZHOU BANGFU SOFTWARE Co Ltd
Priority to CN201210414966.0A priority Critical patent/CN103778165A/en
Publication of CN103778165A publication Critical patent/CN103778165A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a dynamic collecting adjusting algorithm for a spider dispatching center. The algorithm is achieved mainly through two aspects that firstly, dynamic adjustment is automatically carried out on the task collecting interval according to collected data during the spider running process so that the purpose that the longer the running time is, the more superior the performance becomes can be achieved; secondly, the time quantum in which the maximum data size is collected is excavated by analyzing the collecting log of a spider program, and the task configuration in the time quantum should be emphasized; two important parameters including the collecting interval and the key updating time quantum exist in the task dispatching, manual intervention is not needed, and automatic adaptation is carried out according to the updating frequencies and the updating time quantum of different websites. The purpose of the maximum collecting efficiency is achieved.

Description

A kind of spider dispatching center gathers dynamic adjustment algorithm
Technical field
The present invention relates to Internet technical field, particularly spider dispatching center gathers dynamic adjustment algorithm.
Background technology
In search engine, web data all is automatically asked to gather by spider and is provided.Because website on internet has a lot, for convenient management, defining conventionally of spider task decided by website.
Spider is distributed on the server of different clusters conventionally, and for unified coordination and administration, spider can be automatically to unified task scheduling center requests task.There is different renewal frequencies different websites, therefore performance on spider of the scheduling time of task and scheduling interval and efficiently have a very great impact.
Therefore the most important work in task scheduling center is exactly the setting of task distribution and correlation parameter.
Task scheduling center adopts poll, the mechanism of first in first out.All collected websites are waited for collected according to the time sequencing of finding.
The renewal frequency of not considering different web sites is inconsistent, causes inefficiency.
On the basis of technology one, adopt for different web sites the mode that fixing acquisition time interval is manually set.
Need manual intervention adjustment, because Websites quantity is numerous, maintenance cost is very high.The renewal frequency of a lot of websites self often changes, and cannot adjust timely.
Summary of the invention
The object of the invention is to solve problem, provide the dispatching center's robotization of a kind of spider acquisition interval adjust mechanism, without manual intervention, according to the renewal frequency of different web sites and update time section automatically adapt to.Reach the object that gathers maximum efficiency.
For achieving the above object, the technical solution adopted in the present invention is: a kind of spider dispatching center gathers dynamic adjustment algorithm, it is characterized in that: this algorithm is mainly started with from two aspects: the one, the operational process of spider, according to the data that collect, the acquisition interval of task is dynamically adjusted automatically, longer to reach working time, performance is more excellent; The 2nd, by analyzing the collection daily record of spider, mining data amount gathers the maximum time period, and this time period is carried out to emphasis treatment in task configuration; In task scheduling, there are two important parameters: acquisition interval, emphasis section update time.
Algorithm steps:
1, spider initiates a request of obtaining task to task scheduling center.
2, dispatching center once divides into groups to all websites, and a point set condition is whether current slot is emphasis section update time of this website.
3, the data of two groupings are sorted respectively, sort criteria is that last acquisition time adds acquisition interval.Getting task minimal time and that be less than current time returns.If do not have qualifiedly, be directly returned as sky.The priority of returning of two groupings is the group that is first taken at emphasis section update time.
If 4, spider does not successfully get task, Returning process 1, if successfully obtained, carries out data acquisition.
5, record current collection capacity to daily record.If have current and collect new data, notice dispatching center turns acquisition interval down automatically, as is multiplied by one and is less than 1 weighted value as 0.9, if do not collect new data, increases certain acquisition interval, enters to be multiplied by the weighted value 1.1 that is greater than 1.
6, continue to return 1 and obtain next task.
7, there is in addition a background program simultaneously according to gathering daily record, draw collection capacity statistical graph, the corresponding time period of crest of the data acquisition collection capacity of comprehensive many days, be updated to dispatching center.
This algorithm is without manual intervention, according to the renewal frequency of different web sites and update time section automatically adapt to.Reach the object that gathers maximum efficiency.
Embodiment:
For making technical scheme of the present invention be convenient to understand, below in conjunction with embodiment, the present invention is further illustrated.
Embodiment: a kind of spider dispatching center gathers dynamic adjustment algorithm, it is characterized in that: this algorithm is mainly started with from two aspects: the one, the operational process of spider, according to the data that collect, the acquisition interval of task is dynamically adjusted automatically, longer to reach working time, performance is more excellent; The 2nd, by analyzing the collection daily record of spider, mining data amount gathers the maximum time period, and this time period is carried out to emphasis treatment in task configuration; In task scheduling, there are two important parameters: acquisition interval, emphasis section update time.
Algorithm steps:
1, spider initiates a request of obtaining task to task scheduling center.
2, dispatching center once divides into groups to all websites, and a point set condition is whether current slot is emphasis section update time of this website.
3, the data of two groupings are sorted respectively, sort criteria is that last acquisition time adds acquisition interval.Getting task minimal time and that be less than current time returns.If do not have qualifiedly, be directly returned as sky.The priority of returning of two groupings is the group that is first taken at emphasis section update time.
If 4, spider does not successfully get task, Returning process 1, if successfully obtained, carries out data acquisition.
5, record current collection capacity to daily record.If have current and collect new data, notice dispatching center turns acquisition interval down automatically, as is multiplied by one and is less than 1 weighted value as 0.9, if do not collect new data, increases certain acquisition interval, enters to be multiplied by the weighted value 1.1 that is greater than 1.
6, continue to return 1 and obtain next task.
7, there is in addition a background program simultaneously according to gathering daily record, draw collection capacity statistical graph, the corresponding time period of crest of the data acquisition collection capacity of comprehensive many days, be updated to dispatching center.
This algorithm is without manual intervention, according to the renewal frequency of different web sites and update time section automatically adapt to.Reach the object that gathers maximum efficiency.
The above, it is only preferred embodiment of the present invention, not the present invention is done to any formal and substantial restriction, all those skilled in the art, do not departing within the scope of technical solution of the present invention, when utilizing disclosed above technology contents, and the equivalent variations of a little change of making, modification and differentiation is equivalent embodiment of the present invention; Meanwhile, the change of any equivalent variations that all foundations essence technology of the present invention is done above embodiment, modification and differentiation, all still belong in the scope of technical scheme of the present invention.

Claims (1)

1. a spider dispatching center gathers dynamic adjustment algorithm, it is characterized in that: this algorithm is mainly started with from two aspects: the one, the operational process of spider, according to the data that collect, the acquisition interval of task is dynamically adjusted automatically, longer to reach working time, performance is more excellent; The 2nd, by analyzing the collection daily record of spider, mining data amount gathers the maximum time period, and this time period is carried out to emphasis treatment in task configuration; In task scheduling, there are two important parameters: acquisition interval, emphasis section update time.
Algorithm steps:
1, spider initiates a request of obtaining task to task scheduling center.
2, dispatching center once divides into groups to all websites, and a point set condition is whether current slot is emphasis section update time of this website.
3, the data of two groupings are sorted respectively, sort criteria is that last acquisition time adds acquisition interval.Getting task minimal time and that be less than current time returns.If do not have qualifiedly, be directly returned as sky.The priority of returning of two groupings is the group that is first taken at emphasis section update time.
If 4, spider does not successfully get task, Returning process 1, if successfully obtained, carries out data acquisition.
5, record current collection capacity to daily record.If have current and collect new data, notice dispatching center turns acquisition interval down automatically, as is multiplied by one and is less than 1 weighted value as 0.9, if do not collect new data, increases certain acquisition interval, enters to be multiplied by the weighted value 1.1 that is greater than 1.
6, continue to return 1 and obtain next task.
7, there is in addition a background program simultaneously according to gathering daily record, draw collection capacity statistical graph, the corresponding time period of crest of the data acquisition collection capacity of comprehensive many days, be updated to dispatching center.
CN201210414966.0A 2012-10-26 2012-10-26 Dynamic collecting adjusting algorithm for spider dispatching center Pending CN103778165A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210414966.0A CN103778165A (en) 2012-10-26 2012-10-26 Dynamic collecting adjusting algorithm for spider dispatching center

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210414966.0A CN103778165A (en) 2012-10-26 2012-10-26 Dynamic collecting adjusting algorithm for spider dispatching center

Publications (1)

Publication Number Publication Date
CN103778165A true CN103778165A (en) 2014-05-07

Family

ID=50570407

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210414966.0A Pending CN103778165A (en) 2012-10-26 2012-10-26 Dynamic collecting adjusting algorithm for spider dispatching center

Country Status (1)

Country Link
CN (1) CN103778165A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104008213A (en) * 2014-06-24 2014-08-27 电子科技大学 Method and device for finding and counting webpage information updating
CN104820680A (en) * 2015-04-17 2015-08-05 南京大学 Universal distributed crawler scheduling system
CN105577718A (en) * 2014-10-15 2016-05-11 卓望数码技术(深圳)有限公司 Intelligent network information acquisition method and network information acquisition system
CN106294364A (en) * 2015-05-15 2017-01-04 阿里巴巴集团控股有限公司 Realize the method and apparatus that web crawlers captures webpage
CN107451218A (en) * 2017-07-17 2017-12-08 广州特道信息科技有限公司 On-Line review method for automatically releasing and device
CN109688207A (en) * 2018-12-11 2019-04-26 北京云中融信网络科技有限公司 Log transmission method, apparatus and server

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101436196A (en) * 2008-11-25 2009-05-20 北京邮电大学 Construction method capable of automatically and dynamically updating forum reptile crawler system
US20090204575A1 (en) * 2008-02-07 2009-08-13 Christopher Olston Modular web crawling policies and metrics
CN101739427A (en) * 2008-11-10 2010-06-16 中国移动通信集团公司 Crawler capturing method and device thereof
CN102402627A (en) * 2011-12-31 2012-04-04 凤凰在线(北京)信息技术有限公司 System and method for real-time intelligent capturing of article
US20120130970A1 (en) * 2010-11-18 2012-05-24 Shepherd Daniel W Method And Apparatus For Enhanced Web Browsing

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090204575A1 (en) * 2008-02-07 2009-08-13 Christopher Olston Modular web crawling policies and metrics
CN101739427A (en) * 2008-11-10 2010-06-16 中国移动通信集团公司 Crawler capturing method and device thereof
CN101436196A (en) * 2008-11-25 2009-05-20 北京邮电大学 Construction method capable of automatically and dynamically updating forum reptile crawler system
US20120130970A1 (en) * 2010-11-18 2012-05-24 Shepherd Daniel W Method And Apparatus For Enhanced Web Browsing
CN102402627A (en) * 2011-12-31 2012-04-04 凤凰在线(北京)信息技术有限公司 System and method for real-time intelligent capturing of article

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
杨华: "网络信息动态采集策略的研究及应用", 《中国优秀硕士学位论文全文数据库》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104008213A (en) * 2014-06-24 2014-08-27 电子科技大学 Method and device for finding and counting webpage information updating
CN105577718A (en) * 2014-10-15 2016-05-11 卓望数码技术(深圳)有限公司 Intelligent network information acquisition method and network information acquisition system
CN104820680A (en) * 2015-04-17 2015-08-05 南京大学 Universal distributed crawler scheduling system
CN104820680B (en) * 2015-04-17 2018-04-06 南京大学 A kind of universal distributed reptile scheduling system
CN106294364A (en) * 2015-05-15 2017-01-04 阿里巴巴集团控股有限公司 Realize the method and apparatus that web crawlers captures webpage
CN106294364B (en) * 2015-05-15 2020-04-10 阿里巴巴集团控股有限公司 Method and device for realizing web crawler to capture webpage
CN107451218A (en) * 2017-07-17 2017-12-08 广州特道信息科技有限公司 On-Line review method for automatically releasing and device
CN107451218B (en) * 2017-07-17 2020-04-03 云润大数据服务有限公司 Automatic publishing method and device for online comments
CN109688207A (en) * 2018-12-11 2019-04-26 北京云中融信网络科技有限公司 Log transmission method, apparatus and server
CN109688207B (en) * 2018-12-11 2022-06-03 北京云中融信网络科技有限公司 Log transmission method and device and server

Similar Documents

Publication Publication Date Title
CN103778165A (en) Dynamic collecting adjusting algorithm for spider dispatching center
CN109673232A (en) A kind of wisdom trickle irrigation cloud service management system based on micro services framework
CN107133713A (en) A kind of photovoltaic plant intelligently cleans the method for building up of decision system
CN105740124B (en) A kind of redundant data filter method towards cloud computing monitoring system
CN113675853B (en) Energy internet-oriented electricity consumption information acquisition system
CN1967620A (en) Online visible energy consumption audit management system
CN106251034A (en) Wisdom energy saving electric meter monitoring system based on cloud computing technology
CN108255981A (en) The storage and lookup method that section timestamp serial number index minute continuous time freezes
CN103517405B (en) A kind of method and system of network positions, mobile terminal and network side equipment
CN107819607B (en) Micro-service monitoring system based on dubbo
CN106201826A (en) A kind of diagnose the big affairs of oracle database and the method for focus affairs
CN106789347A (en) A kind of method that alarm association and network fault diagnosis are realized based on alarm data
CN102904744B (en) The acquisition method of performance data and system
CN105022823B (en) A kind of cloud service performance early warning event generation method based on data mining
CN107145609A (en) Tunnel traffic accident association rule algorithm based on FP Growth algorithms
CN104268665A (en) User behavior analysis method of management system
CN102026228A (en) Statistical method and equipment for communication network performance data
CN102761429B (en) A kind of abnormal call bill processing method and system
CN103413192A (en) Unit dispatching method based on power grid dispatching automatic system power load curve
CN105427543A (en) Temperature early warning method and system based on smart grid
CN102013996A (en) Data acquisition management method and system and telecommunication network management system
CN115660314A (en) Shadow shielding diagnosis method and device, electronic equipment and storage medium
CN106887848B (en) Voltage power-less real-time control method based on Fuzzy Pattern Recognition
CN103684877B (en) A kind of method and apparatus choosing infrastructure for Web content service
CN103455556A (en) Intelligent storage unit data clipping process

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20140507