CN103778165A - Dynamic collecting adjusting algorithm for spider dispatching center - Google Patents
Dynamic collecting adjusting algorithm for spider dispatching center Download PDFInfo
- Publication number
- CN103778165A CN103778165A CN201210414966.0A CN201210414966A CN103778165A CN 103778165 A CN103778165 A CN 103778165A CN 201210414966 A CN201210414966 A CN 201210414966A CN 103778165 A CN103778165 A CN 103778165A
- Authority
- CN
- China
- Prior art keywords
- spider
- task
- time
- dispatching center
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a dynamic collecting adjusting algorithm for a spider dispatching center. The algorithm is achieved mainly through two aspects that firstly, dynamic adjustment is automatically carried out on the task collecting interval according to collected data during the spider running process so that the purpose that the longer the running time is, the more superior the performance becomes can be achieved; secondly, the time quantum in which the maximum data size is collected is excavated by analyzing the collecting log of a spider program, and the task configuration in the time quantum should be emphasized; two important parameters including the collecting interval and the key updating time quantum exist in the task dispatching, manual intervention is not needed, and automatic adaptation is carried out according to the updating frequencies and the updating time quantum of different websites. The purpose of the maximum collecting efficiency is achieved.
Description
Technical field
The present invention relates to Internet technical field, particularly spider dispatching center gathers dynamic adjustment algorithm.
Background technology
In search engine, web data all is automatically asked to gather by spider and is provided.Because website on internet has a lot, for convenient management, defining conventionally of spider task decided by website.
Spider is distributed on the server of different clusters conventionally, and for unified coordination and administration, spider can be automatically to unified task scheduling center requests task.There is different renewal frequencies different websites, therefore performance on spider of the scheduling time of task and scheduling interval and efficiently have a very great impact.
Therefore the most important work in task scheduling center is exactly the setting of task distribution and correlation parameter.
Task scheduling center adopts poll, the mechanism of first in first out.All collected websites are waited for collected according to the time sequencing of finding.
The renewal frequency of not considering different web sites is inconsistent, causes inefficiency.
On the basis of technology one, adopt for different web sites the mode that fixing acquisition time interval is manually set.
Need manual intervention adjustment, because Websites quantity is numerous, maintenance cost is very high.The renewal frequency of a lot of websites self often changes, and cannot adjust timely.
Summary of the invention
The object of the invention is to solve problem, provide the dispatching center's robotization of a kind of spider acquisition interval adjust mechanism, without manual intervention, according to the renewal frequency of different web sites and update time section automatically adapt to.Reach the object that gathers maximum efficiency.
For achieving the above object, the technical solution adopted in the present invention is: a kind of spider dispatching center gathers dynamic adjustment algorithm, it is characterized in that: this algorithm is mainly started with from two aspects: the one, the operational process of spider, according to the data that collect, the acquisition interval of task is dynamically adjusted automatically, longer to reach working time, performance is more excellent; The 2nd, by analyzing the collection daily record of spider, mining data amount gathers the maximum time period, and this time period is carried out to emphasis treatment in task configuration; In task scheduling, there are two important parameters: acquisition interval, emphasis section update time.
Algorithm steps:
1, spider initiates a request of obtaining task to task scheduling center.
2, dispatching center once divides into groups to all websites, and a point set condition is whether current slot is emphasis section update time of this website.
3, the data of two groupings are sorted respectively, sort criteria is that last acquisition time adds acquisition interval.Getting task minimal time and that be less than current time returns.If do not have qualifiedly, be directly returned as sky.The priority of returning of two groupings is the group that is first taken at emphasis section update time.
If 4, spider does not successfully get task, Returning process 1, if successfully obtained, carries out data acquisition.
5, record current collection capacity to daily record.If have current and collect new data, notice dispatching center turns acquisition interval down automatically, as is multiplied by one and is less than 1 weighted value as 0.9, if do not collect new data, increases certain acquisition interval, enters to be multiplied by the weighted value 1.1 that is greater than 1.
6, continue to return 1 and obtain next task.
7, there is in addition a background program simultaneously according to gathering daily record, draw collection capacity statistical graph, the corresponding time period of crest of the data acquisition collection capacity of comprehensive many days, be updated to dispatching center.
This algorithm is without manual intervention, according to the renewal frequency of different web sites and update time section automatically adapt to.Reach the object that gathers maximum efficiency.
Embodiment:
For making technical scheme of the present invention be convenient to understand, below in conjunction with embodiment, the present invention is further illustrated.
Embodiment: a kind of spider dispatching center gathers dynamic adjustment algorithm, it is characterized in that: this algorithm is mainly started with from two aspects: the one, the operational process of spider, according to the data that collect, the acquisition interval of task is dynamically adjusted automatically, longer to reach working time, performance is more excellent; The 2nd, by analyzing the collection daily record of spider, mining data amount gathers the maximum time period, and this time period is carried out to emphasis treatment in task configuration; In task scheduling, there are two important parameters: acquisition interval, emphasis section update time.
Algorithm steps:
1, spider initiates a request of obtaining task to task scheduling center.
2, dispatching center once divides into groups to all websites, and a point set condition is whether current slot is emphasis section update time of this website.
3, the data of two groupings are sorted respectively, sort criteria is that last acquisition time adds acquisition interval.Getting task minimal time and that be less than current time returns.If do not have qualifiedly, be directly returned as sky.The priority of returning of two groupings is the group that is first taken at emphasis section update time.
If 4, spider does not successfully get task, Returning process 1, if successfully obtained, carries out data acquisition.
5, record current collection capacity to daily record.If have current and collect new data, notice dispatching center turns acquisition interval down automatically, as is multiplied by one and is less than 1 weighted value as 0.9, if do not collect new data, increases certain acquisition interval, enters to be multiplied by the weighted value 1.1 that is greater than 1.
6, continue to return 1 and obtain next task.
7, there is in addition a background program simultaneously according to gathering daily record, draw collection capacity statistical graph, the corresponding time period of crest of the data acquisition collection capacity of comprehensive many days, be updated to dispatching center.
This algorithm is without manual intervention, according to the renewal frequency of different web sites and update time section automatically adapt to.Reach the object that gathers maximum efficiency.
The above, it is only preferred embodiment of the present invention, not the present invention is done to any formal and substantial restriction, all those skilled in the art, do not departing within the scope of technical solution of the present invention, when utilizing disclosed above technology contents, and the equivalent variations of a little change of making, modification and differentiation is equivalent embodiment of the present invention; Meanwhile, the change of any equivalent variations that all foundations essence technology of the present invention is done above embodiment, modification and differentiation, all still belong in the scope of technical scheme of the present invention.
Claims (1)
1. a spider dispatching center gathers dynamic adjustment algorithm, it is characterized in that: this algorithm is mainly started with from two aspects: the one, the operational process of spider, according to the data that collect, the acquisition interval of task is dynamically adjusted automatically, longer to reach working time, performance is more excellent; The 2nd, by analyzing the collection daily record of spider, mining data amount gathers the maximum time period, and this time period is carried out to emphasis treatment in task configuration; In task scheduling, there are two important parameters: acquisition interval, emphasis section update time.
Algorithm steps:
1, spider initiates a request of obtaining task to task scheduling center.
2, dispatching center once divides into groups to all websites, and a point set condition is whether current slot is emphasis section update time of this website.
3, the data of two groupings are sorted respectively, sort criteria is that last acquisition time adds acquisition interval.Getting task minimal time and that be less than current time returns.If do not have qualifiedly, be directly returned as sky.The priority of returning of two groupings is the group that is first taken at emphasis section update time.
If 4, spider does not successfully get task, Returning process 1, if successfully obtained, carries out data acquisition.
5, record current collection capacity to daily record.If have current and collect new data, notice dispatching center turns acquisition interval down automatically, as is multiplied by one and is less than 1 weighted value as 0.9, if do not collect new data, increases certain acquisition interval, enters to be multiplied by the weighted value 1.1 that is greater than 1.
6, continue to return 1 and obtain next task.
7, there is in addition a background program simultaneously according to gathering daily record, draw collection capacity statistical graph, the corresponding time period of crest of the data acquisition collection capacity of comprehensive many days, be updated to dispatching center.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210414966.0A CN103778165A (en) | 2012-10-26 | 2012-10-26 | Dynamic collecting adjusting algorithm for spider dispatching center |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210414966.0A CN103778165A (en) | 2012-10-26 | 2012-10-26 | Dynamic collecting adjusting algorithm for spider dispatching center |
Publications (1)
Publication Number | Publication Date |
---|---|
CN103778165A true CN103778165A (en) | 2014-05-07 |
Family
ID=50570407
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201210414966.0A Pending CN103778165A (en) | 2012-10-26 | 2012-10-26 | Dynamic collecting adjusting algorithm for spider dispatching center |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103778165A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104008213A (en) * | 2014-06-24 | 2014-08-27 | 电子科技大学 | Method and device for finding and counting webpage information updating |
CN104820680A (en) * | 2015-04-17 | 2015-08-05 | 南京大学 | Universal distributed crawler scheduling system |
CN105577718A (en) * | 2014-10-15 | 2016-05-11 | 卓望数码技术(深圳)有限公司 | Intelligent network information acquisition method and network information acquisition system |
CN106294364A (en) * | 2015-05-15 | 2017-01-04 | 阿里巴巴集团控股有限公司 | Realize the method and apparatus that web crawlers captures webpage |
CN107451218A (en) * | 2017-07-17 | 2017-12-08 | 广州特道信息科技有限公司 | On-Line review method for automatically releasing and device |
CN109688207A (en) * | 2018-12-11 | 2019-04-26 | 北京云中融信网络科技有限公司 | Log transmission method, apparatus and server |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101436196A (en) * | 2008-11-25 | 2009-05-20 | 北京邮电大学 | Construction method capable of automatically and dynamically updating forum reptile crawler system |
US20090204575A1 (en) * | 2008-02-07 | 2009-08-13 | Christopher Olston | Modular web crawling policies and metrics |
CN101739427A (en) * | 2008-11-10 | 2010-06-16 | 中国移动通信集团公司 | Crawler capturing method and device thereof |
CN102402627A (en) * | 2011-12-31 | 2012-04-04 | 凤凰在线(北京)信息技术有限公司 | System and method for real-time intelligent capturing of article |
US20120130970A1 (en) * | 2010-11-18 | 2012-05-24 | Shepherd Daniel W | Method And Apparatus For Enhanced Web Browsing |
-
2012
- 2012-10-26 CN CN201210414966.0A patent/CN103778165A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090204575A1 (en) * | 2008-02-07 | 2009-08-13 | Christopher Olston | Modular web crawling policies and metrics |
CN101739427A (en) * | 2008-11-10 | 2010-06-16 | 中国移动通信集团公司 | Crawler capturing method and device thereof |
CN101436196A (en) * | 2008-11-25 | 2009-05-20 | 北京邮电大学 | Construction method capable of automatically and dynamically updating forum reptile crawler system |
US20120130970A1 (en) * | 2010-11-18 | 2012-05-24 | Shepherd Daniel W | Method And Apparatus For Enhanced Web Browsing |
CN102402627A (en) * | 2011-12-31 | 2012-04-04 | 凤凰在线(北京)信息技术有限公司 | System and method for real-time intelligent capturing of article |
Non-Patent Citations (1)
Title |
---|
杨华: "网络信息动态采集策略的研究及应用", 《中国优秀硕士学位论文全文数据库》 * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104008213A (en) * | 2014-06-24 | 2014-08-27 | 电子科技大学 | Method and device for finding and counting webpage information updating |
CN105577718A (en) * | 2014-10-15 | 2016-05-11 | 卓望数码技术(深圳)有限公司 | Intelligent network information acquisition method and network information acquisition system |
CN104820680A (en) * | 2015-04-17 | 2015-08-05 | 南京大学 | Universal distributed crawler scheduling system |
CN104820680B (en) * | 2015-04-17 | 2018-04-06 | 南京大学 | A kind of universal distributed reptile scheduling system |
CN106294364A (en) * | 2015-05-15 | 2017-01-04 | 阿里巴巴集团控股有限公司 | Realize the method and apparatus that web crawlers captures webpage |
CN106294364B (en) * | 2015-05-15 | 2020-04-10 | 阿里巴巴集团控股有限公司 | Method and device for realizing web crawler to capture webpage |
CN107451218A (en) * | 2017-07-17 | 2017-12-08 | 广州特道信息科技有限公司 | On-Line review method for automatically releasing and device |
CN107451218B (en) * | 2017-07-17 | 2020-04-03 | 云润大数据服务有限公司 | Automatic publishing method and device for online comments |
CN109688207A (en) * | 2018-12-11 | 2019-04-26 | 北京云中融信网络科技有限公司 | Log transmission method, apparatus and server |
CN109688207B (en) * | 2018-12-11 | 2022-06-03 | 北京云中融信网络科技有限公司 | Log transmission method and device and server |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103778165A (en) | Dynamic collecting adjusting algorithm for spider dispatching center | |
CN107133713A (en) | A kind of photovoltaic plant intelligently cleans the method for building up of decision system | |
CN105740124B (en) | A kind of redundant data filter method towards cloud computing monitoring system | |
CN113675853B (en) | Energy internet-oriented electricity consumption information acquisition system | |
CN104917627B (en) | A kind of log cluster for large server cluster scans and analysis method | |
CN108255981A (en) | The storage and lookup method that section timestamp serial number index minute continuous time freezes | |
CN106251034A (en) | Wisdom energy saving electric meter monitoring system based on cloud computing technology | |
CN107819607B (en) | Micro-service monitoring system based on dubbo | |
CN106022664A (en) | Big data analysis based network intelligent power saving monitoring method | |
CN105868327A (en) | Distributed web crawler capturing method based on different updating strategies | |
CN111027786B (en) | Micro-grid operation optimization and energy efficiency management system | |
CN107145609A (en) | Tunnel traffic accident association rule algorithm based on FP Growth algorithms | |
CN107516409A (en) | Power information acquisition system electric energy meter cognitive method based on narrow-band power line carrier | |
CN106899678B (en) | High-efficiency data transmission method and system for dynamically balancing energy Internet network bandwidth | |
CN106788810A (en) | A kind of wireless energy collection of green cognitive radio and distribution method | |
CN112446645A (en) | Power management service system based on Internet of things | |
CN114915637B (en) | Remote operation and maintenance data acquisition optimization method for combine harvester | |
CN102013996A (en) | Data acquisition management method and system and telecommunication network management system | |
CN115660314A (en) | Shadow shielding diagnosis method and device, electronic equipment and storage medium | |
CN106887848B (en) | Voltage power-less real-time control method based on Fuzzy Pattern Recognition | |
CN103684877B (en) | A kind of method and apparatus choosing infrastructure for Web content service | |
CN103455556A (en) | Intelligent storage unit data clipping process | |
CN202836612U (en) | Short-message gateway based remote agricultural-environment monitoring system | |
CN102868548B (en) | The application affected user distribution detection method of performance and system | |
CN104361060A (en) | Data mining method and system applied to manufacturing Internet of things |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20140507 |