CN109388736A - Response scheduling method in crawler system - Google Patents
Response scheduling method in crawler system Download PDFInfo
- Publication number
- CN109388736A CN109388736A CN201811106373.1A CN201811106373A CN109388736A CN 109388736 A CN109388736 A CN 109388736A CN 201811106373 A CN201811106373 A CN 201811106373A CN 109388736 A CN109388736 A CN 109388736A
- Authority
- CN
- China
- Prior art keywords
- news
- frequency
- attributes
- entrance
- subtask
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
This application involves a kind of response scheduling methods in crawler system, this method comprises: marking off multiple entrance kinds subtask according to the level plate structure of website;Initial acquisition frequency is preset according to level plate where each entrance kind subtask and the news amount of unit time;Multiple attributes of news are preset, and formulate the corresponding adjustment rule of the attributes of news;According to the calculating sample frequency for adjusting regular real-time update each entrance kind subtask.The application is each entrance seed in distribution optimal acquisition frequency in real time, and the waste of Service Source caused by avoiding information output and frequency not reciprocity also indirectly alleviates the pressure of monitoring station.
Description
Technical field
Response scheduling method this application involves Internet resources search technique field, in especially a kind of crawler system.
Background technique
With the arrival of big data era, excavation and analysis for mass data have become current research hotspot,
And data acquisition is the basis of data mining and analysis.During data acquisition, most important is exactly the reality of data acquisition
Shi Xing, accuracy with it is comprehensive.And the real-time of data acquisition, i.e. INFORMATION DISCOVERY it is timely whether can directly affect one
The development of a event, so the frequency of program scanning monitoring station becomes most important when designing crawlers.
In the related technology, data collection system is usually to set a scan frequency to a website unification, but adopting
Just often it will appear when being acquired for changing faster plate data during collection, real-time is poor;Either for becoming
When the slower plate data of change are acquired, system resource is wasted, and is also easy to because it is not artificial clear for being monitored to
It lookes at and keeps crawlers banned.Although some systems also can carry out Plate division to website, for different plate and different
Different frequency acquisitions is arranged in period, but according to the different plates under each website on existing market in other words a website
Block can all have the different hot spot periods, and when the dispatch amount of a website, when pageview adjusts, preset frequency still can before
There is the problem of real-time difference.
Summary of the invention
To be overcome at least to a certain extent to website one scan frequency of unified setting, cause real-time poor or
The problem of person's system resource wastes, the application provide a kind of response scheduling method in crawler system, comprising:
Multiple entrance kinds subtask is marked off according to the level plate structure of website;
Initial acquisition frequency is preset according to level plate where each entrance kind subtask and the news amount of unit time;
Multiple attributes of news are preset, and formulate the corresponding adjustment rule of the attributes of news;
According to the calculating sample frequency for adjusting regular real-time update each entrance kind subtask.
Further, the level plate structure according to website marks off multiple entrance kinds subtask, comprising: the layer
Grade plate structure and entrance kind subtask correspond.
Further, described default just according to level plate where each entrance kind subtask and the news amount of unit time
Beginning frequency acquisition, comprising:
News amount W under a level column in a default hour;
It presets every page and includes news item number n;
Initial acquisition frequency p1, p1=1/ (W/n), p1 are to acquire website frequency per hour.
Further, the multiple attributes of news includes:
Plate rank, the period of news collection, the response of website are fast where news quantum of output, news amount of reading, news
Whether whether degree be hot news and be one of starting or a variety of.
It is further, described to formulate the corresponding adjustment rule of the attributes of news, comprising:
Preset the parameter of each attributes of news;
Adjustment rule is determined according to each cycle acquisition parameter changing value.
Further, the adjustment rule, comprising:
The Relation Parameters value of each attributes of news and frequency acquisition is preset,
Adjustment rule is calculated by aggregate function.
Further, the aggregate function includes:
It is line number count, average avg, summation sum, one or more in maximum value max.
Further, described to according to the calculating sample frequency for adjusting regular real-time update each entrance kind subtask, packet
It includes:
It is calculated in after first attributes of news according to the adjustment of first attributes of news rule and initial acquisition frequency
Between sample frequency;
Based on the intermediate samples frequency, successively calculates and obtained after the adjustment rule of other attributes of news is cumulative
Calculating sample frequency.
The technical solution that embodiments herein provides can include the following benefits:
The application marks off multiple entrance kinds subtask, each entrance seed job order according to the level plate structure of website
Solely setting frequency acquisition avoids causing real-time poor or system resource one website, one scan frequency of unified setting
The problem of wasting further samples frequency according to the calculating for adjusting regular real-time update each entrance kind subtask
Rate, the frequency acquisition adjust automatically under different time sections, avoid news quantum of output and frequency acquisition it is not reciprocity caused by service
The wasting of resources.
It should be understood that above general description and following detailed description be only it is exemplary and explanatory, not
The application can be limited.
Detailed description of the invention
The drawings herein are incorporated into the specification and forms part of this specification, and shows the implementation for meeting the application
Example, and together with specification it is used to explain the principle of the application.
Fig. 1 is the flow chart of the response scheduling method in a kind of crawler system that the application one embodiment provides.
Specific embodiment
The present invention is described in detail below with reference to the accompanying drawings and embodiments.
Fig. 1 is the flow chart of the response scheduling method in a kind of crawler system that the application one embodiment provides.
As shown in Figure 1, the method for the present embodiment includes:
S1: multiple entrance kinds subtask is marked off according to the level plate structure of website;
S2: initial acquisition frequency is preset according to level plate where each entrance kind subtask and the news amount of unit time
Rate;
S3: presetting multiple attributes of news, and formulates the corresponding adjustment rule of the attributes of news;
S4: according to the calculating sample frequency for adjusting regular real-time update each entrance kind subtask.
It is described that multiple entrances are marked off according to the level plate structure of website as optional a kind of implementation of the invention
Kind subtask, comprising: the level plate structure and entrance kind subtask correspond.
Level plate is, for example, to entertain plate, finance and economics plate, movement plate etc., just often be will appear in collection process pair
When the data that the faster plate of variation for example entertains plate are acquired, real-time is poor;It is either slower for changing
The data of plate such as finance and economics plate when being acquired, system resource is wasted, and is also easy to because being monitored to not
Be artificial browsing and make crawlers it is banned fall.Therefore entrance seed is separately provided for each level plate, avoids standing to one
Point one frequency acquisition of unified setting causes real-time poor or the problem of system resource wastes.
It is described according to level plate where each entrance kind subtask and list as optional a kind of implementation of the invention
The news amount of position time presets an initial acquisition frequency, comprising:
News amount W under a level column in a default hour;
It presets every page and includes news item number n;
Initial acquisition frequency p1, p1=1/ (W/n), p1 are to acquire website frequency per hour.
For example, common property goes out 100 datas in 10 points to 11 points 1 hours of certain portal page, every page of 50 datas, we think
Guarantee that news is put in storage within an hour.Frequency acquisition provides data according to page data and calculates preset value, the entrance one
The page data of (100/50)=2 is generated in hour, that is, a hour will acquire two page datas, so frequency is 1/2=0.5 small
When/time, i.e., predeterminated frequency is 30 minutes/time.
By the way that preliminary examination frequency acquisition value is arranged, foundation is provided for subsequent calculating frequency acquisition.
Further, the multiple attributes of news includes:
Plate rank, the period of news collection, the response of website are fast where news quantum of output, news amount of reading, news
Whether whether degree be hot news and be one of starting or a variety of.
As optional a kind of implementation of the invention, the multiple attributes of news includes:
Plate rank, the period of news collection, the response of website are fast where news quantum of output, news amount of reading, news
It spends, whether be hot news, whether be starting;
Each attributes of news includes the adjustment rule to frequency acquisition.
News quantum of output: the data volume of an entrance news output is bigger, and frequency acquisition is faster, otherwise may cause data
Leakage grabs or is delayed.
The amount of reading of news: by news item in the amount of access of a period, it can be inferred that some period
The number of visiting people improves frequency acquisition in people's flow-rate ratio biggish period and concurrent quantity is less susceptible to be detected by website.
Amount of reading also reflects the news quality height of this column sending simultaneously, and the news amount of reading of the entrance output of same frequency is high
Entrance more should be ensured that his timeliness, and frequency is also corresponding more should be fast.
Plate rank of the seed locating for website where news: an entrance seed, the hierarchical location a website, certainly
The probability that the news under this entrance is browsed and clicked is determined.Often news quantum of output wants high to the entrance of one homepage, and
Quality also wants higher, so frequency acquisition also wants higher.
The period of news collection: the behavior of the information output period and people of a website have direct relationship, each
Entrance information output in different time periods is different, so to calculate acquisition frequency according to different time dimensions for entrance
Rate.
The response speed of website: a website response speed in different time periods has reacted website holding in this period
Loading capability reduces frequency acquisition when a website bearing capacity is weaker, on the one hand mitigates the pressure of website, on the one hand
It prevents from being monitored by website being crawlers.
Whether be hot news: the information quality of hot news implied meaning is high, concerned degree is high, ought to acquire more in time, frequency
Rate is higher.
It is starting or forwarding: it is starting often more to be paid close attention to by people than forwarding, therefore starting frequency acquisition is higher than forwarding acquisition
Frequency.
It is described to formulate the corresponding adjustment rule of the attributes of news as optional a kind of implementation of the invention, comprising:
Preset the parameter of each attributes of news;
Adjustment rule is determined according to each cycle acquisition parameter changing value.
As optional a kind of implementation of the invention, the adjustment rule, comprising:
The Relation Parameters value of each attributes of news and frequency acquisition is preset,
Adjustment rule is calculated by aggregate function.
Such as:
Default websites response speed and the Relation Parameters value of frequency acquisition are that 2,2 representatives mean the every increase of response speed
One times of frequency acquisition will be added 2 minutes.
As optional a kind of implementation of the invention, the aggregate function includes:
It is line number count, average avg, summation sum, one or more in maximum value max.
For example, calculating the response speed of website:
The total time-consuming of 50 news of acquisition is calculated by summing function (sum);
Every average duration, the as response speed of website are calculated by being averaging function (avg).
Each attributes of news is calculated by aggregate function, not only calculates simply, also attributes of news is quantified
Processing, provides foundation for subsequent calculating frequency acquisition.
It is described to appoint to according to the regular each entrance seed of real-time update of adjustment as optional a kind of implementation of the invention
The calculating sample frequency of business, comprising:
It is calculated in after first attributes of news according to the adjustment of first attributes of news rule and initial acquisition frequency
Between sample frequency;
Based on the intermediate samples frequency, successively calculates and obtained after the adjustment rule of other attributes of news is cumulative
Calculating sample frequency.
Intermediate samples frequency after calculating each attributes of news is as shown in table 1.
Table 1 calculates frequency acquisition and updates table
In table 1, calculating initial acquisition frequency according to news quantum of output per hour is 20 minutes/time;News amount of reading increases,
Calculating adjustment rule according to aggregate function is -4, obtains 16 minute/time of the second intermediate acquisition frequency;Plate rank where news
Constant, item number increases, and calculating adjustment rule according to aggregate function is -4, obtains 12 minute/time of the second intermediate acquisition frequency;Newly
The period of acquisition is heard due to being in idle, is obtained according to the period of preset news collection and frequency acquisition Relation Parameters value
Adjustment rule is+1, obtains 13 minute/time of third intermediate acquisition frequency;The response speed of website is slack-off, according to aggregate function meter
Calculating adjustment rule is+4, obtains 17 minute/time of the 4th intermediate acquisition frequency;Hot news item number increases, according to aggregate function
Calculating adjustment rule is -3, obtains 14 minute/time of the 5th intermediate acquisition frequency;Starting number is reduced, according to aggregate function meter
Calculating adjustment rule is+4, obtains calculating 18 minute/time of frequency acquisition.
In the present embodiment, multiple entrance kinds subtask, each entrance seed are marked off according to the level plate structure of website
Frequency acquisition is separately provided in task, avoids causing real-time poor to a website one scan frequency of unified setting or being
The problem of system resource wastes, further, according to the calculating for adjusting regular real-time update each entrance kind subtask
Sample frequency, the frequency acquisition adjust automatically under different time sections avoid that news quantum of output and frequency acquisition be not reciprocity to be caused
Service Source waste, also indirectly alleviate the pressure of monitoring station.
It is understood that same or similar part can mutually refer in the various embodiments described above, in some embodiments
Unspecified content may refer to the same or similar content in other embodiments.
It should be noted that term " first ", " second " etc. are used for description purposes only in the description of the present application, without
It can be interpreted as indication or suggestion relative importance.In addition, in the description of the present application, unless otherwise indicated, the meaning of " multiple "
Refer at least two.
Any process described otherwise above or method description are construed as in flow chart or herein, and expression includes
It is one or more for realizing specific logical function or process the step of executable instruction code module, segment or portion
Point, and the range of the preferred embodiment of the application includes other realization, wherein can not press shown or discussed suitable
Sequence, including according to related function by it is basic simultaneously in the way of or in the opposite order, Lai Zhihang function, this should be by the application
Embodiment person of ordinary skill in the field understood.
It should be appreciated that each section of the application can be realized with hardware, software, firmware or their combination.Above-mentioned
In embodiment, software that multiple steps or method can be executed in memory and by suitable instruction execution system with storage
Or firmware is realized.It, and in another embodiment, can be under well known in the art for example, if realized with hardware
Any one of column technology or their combination are realized: having a logic gates for realizing logic function to data-signal
Discrete logic, with suitable combinational logic gate circuit specific integrated circuit, programmable gate array (PGA), scene
Programmable gate array (FPGA) etc..
Those skilled in the art are understood that realize all or part of step that above-described embodiment method carries
It suddenly is that relevant hardware can be instructed to complete by program, the program can store in a kind of computer-readable storage medium
In matter, which when being executed, includes the steps that one or a combination set of embodiment of the method.
It, can also be in addition, can integrate in a processing module in each functional unit in each embodiment of the application
It is that each unit physically exists alone, can also be integrated in two or more units in a module.Above-mentioned integrated mould
Block both can take the form of hardware realization, can also be realized in the form of software function module.The integrated module is such as
Fruit is realized and when sold or used as an independent product in the form of software function module, also can store in a computer
In read/write memory medium.
Storage medium mentioned above can be read-only memory, disk or CD etc..
In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show
The description of example " or " some examples " etc. means specific features, structure, material or spy described in conjunction with this embodiment or example
Point is contained at least one embodiment or example of the application.In the present specification, schematic expression of the above terms are not
Centainly refer to identical embodiment or example.Moreover, particular features, structures, materials, or characteristics described can be any
One or more embodiment or examples in can be combined in any suitable manner.
Although embodiments herein has been shown and described above, it is to be understood that above-described embodiment is example
Property, it should not be understood as the limitation to the application, those skilled in the art within the scope of application can be to above-mentioned
Embodiment is changed, modifies, replacement and variant.
It should be noted that the present invention is not limited to above-mentioned preferred forms, those skilled in the art are of the invention
Other various forms of products can be all obtained under enlightenment, however, make any variation in its shape or structure, it is all have with
The identical or similar technical solution of the application, is within the scope of the present invention.
Claims (8)
1. a kind of response scheduling method in crawler system characterized by comprising
Multiple entrance kinds subtask is marked off according to the level plate structure of website;
Initial acquisition frequency is preset according to level plate where each entrance kind subtask and the news amount of unit time;
Multiple attributes of news are preset, and formulate the corresponding adjustment rule of the attributes of news;
According to the calculating sample frequency for adjusting regular real-time update each entrance kind subtask.
2. the method according to claim 1, wherein the level plate structure according to website mark off it is multiple
Entrance kind subtask, comprising: the level plate structure and entrance kind subtask correspond.
3. the method according to claim 1, wherein described according to level plate where each entrance kind subtask
Initial acquisition frequency is preset with the news amount of unit time, comprising:
News amount W under a level column in a default hour;
It presets every page and includes news item number n;
Initial acquisition frequency p1, p1=1/ (W/n), p1 are to acquire website frequency per hour.
4. the method according to claim 1, wherein the multiple attributes of news includes:
News quantum of output, news amount of reading, plate rank where news, the period of news collection, website response speed, be
It is no to be hot news and whether be one of starting or a variety of.
5. the method according to claim 1, wherein described, to formulate the corresponding adjustment of the attributes of news regular,
Include:
Preset the parameter of each attributes of news;
Adjustment rule is determined according to each cycle acquisition parameter changing value.
6. according to the method described in claim 5, it is characterized in that, the adjustment is regular, comprising:
The Relation Parameters value of each attributes of news and frequency acquisition is preset,
Adjustment rule is calculated by aggregate function.
7. according to the method described in claim 6, it is characterized in that, the aggregate function includes:
It is line number count, average avg, summation sum, one or more in maximum value max.
8. the method according to claim 1, wherein described to according to each entrance kind of the regular real-time update of adjustment
The calculating sample frequency of subtask, comprising:
It adopts centre after calculating with initial acquisition frequency first attributes of news according to the adjustment rule of first attributes of news
Sample frequency;
Based on the intermediate samples frequency, the meter obtained after the adjustment rule of other attributes of news is cumulative is successively calculated
Calculate sample frequency.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811106373.1A CN109388736A (en) | 2018-09-21 | 2018-09-21 | Response scheduling method in crawler system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811106373.1A CN109388736A (en) | 2018-09-21 | 2018-09-21 | Response scheduling method in crawler system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109388736A true CN109388736A (en) | 2019-02-26 |
Family
ID=65418723
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811106373.1A Pending CN109388736A (en) | 2018-09-21 | 2018-09-21 | Response scheduling method in crawler system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109388736A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111753163A (en) * | 2020-07-08 | 2020-10-09 | 北京鼎泰智源科技有限公司 | Data acquisition method |
CN112835931A (en) * | 2019-11-22 | 2021-05-25 | 珠海格力电器股份有限公司 | Method and device for determining data acquisition frequency |
WO2024078070A1 (en) * | 2022-10-14 | 2024-04-18 | 卡奥斯工业智能研究院(青岛)有限公司 | Data collection resource quantity control method and apparatus, and device and storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103605670A (en) * | 2013-10-29 | 2014-02-26 | 北京奇虎科技有限公司 | Method and device for determining grabbing frequency of network resource points |
CN103617264A (en) * | 2013-12-02 | 2014-03-05 | 北京奇虎科技有限公司 | Method and device for grabbing timeliness seed page |
CN105117501A (en) * | 2015-10-09 | 2015-12-02 | 广州神马移动信息科技有限公司 | Web crawler scheduling method and web crawler system applying same |
CN105868327A (en) * | 2016-03-28 | 2016-08-17 | 浪潮软件集团有限公司 | Distributed web crawler capturing method based on different updating strategies |
CN106126716A (en) * | 2016-06-30 | 2016-11-16 | 北京奇艺世纪科技有限公司 | A kind of data crawling method and device |
CN107193828A (en) * | 2016-03-14 | 2017-09-22 | 百度在线网络技术(北京)有限公司 | Novel webpage capture method and apparatus |
-
2018
- 2018-09-21 CN CN201811106373.1A patent/CN109388736A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103605670A (en) * | 2013-10-29 | 2014-02-26 | 北京奇虎科技有限公司 | Method and device for determining grabbing frequency of network resource points |
CN103617264A (en) * | 2013-12-02 | 2014-03-05 | 北京奇虎科技有限公司 | Method and device for grabbing timeliness seed page |
CN105117501A (en) * | 2015-10-09 | 2015-12-02 | 广州神马移动信息科技有限公司 | Web crawler scheduling method and web crawler system applying same |
CN107193828A (en) * | 2016-03-14 | 2017-09-22 | 百度在线网络技术(北京)有限公司 | Novel webpage capture method and apparatus |
CN105868327A (en) * | 2016-03-28 | 2016-08-17 | 浪潮软件集团有限公司 | Distributed web crawler capturing method based on different updating strategies |
CN106126716A (en) * | 2016-06-30 | 2016-11-16 | 北京奇艺世纪科技有限公司 | A kind of data crawling method and device |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112835931A (en) * | 2019-11-22 | 2021-05-25 | 珠海格力电器股份有限公司 | Method and device for determining data acquisition frequency |
CN111753163A (en) * | 2020-07-08 | 2020-10-09 | 北京鼎泰智源科技有限公司 | Data acquisition method |
WO2024078070A1 (en) * | 2022-10-14 | 2024-04-18 | 卡奥斯工业智能研究院(青岛)有限公司 | Data collection resource quantity control method and apparatus, and device and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20230419358A1 (en) | Application program interface script caching and batching | |
US10572565B2 (en) | User behavior models based on source domain | |
US9767174B2 (en) | Efficient query processing using histograms in a columnar database | |
US11429609B2 (en) | Geo-scale analytics with bandwidth and regulatory constraints | |
Sia et al. | Efficient monitoring algorithm for fast news alerts | |
US8756206B2 (en) | Updating an inverted index in a real time fashion | |
CN109388736A (en) | Response scheduling method in crawler system | |
US8600921B2 (en) | Predicting user navigation events in a browser using directed graphs | |
US11194813B2 (en) | Adaptive big data service | |
CN103250147B (en) | The continuous-query of data stream | |
US9141722B2 (en) | Access to network content | |
US20090292677A1 (en) | Integrated web analytics and actionable workbench tools for search engine optimization and marketing | |
US20090299998A1 (en) | Keyword discovery tools for populating a private keyword database | |
US20120324043A1 (en) | Access to network content | |
US9628355B1 (en) | System for validating site configuration based on real-time analytics data | |
US8954524B1 (en) | Access to network content | |
CA2603087A1 (en) | Systems and methods for analyzing a user's web history | |
WO2013025874A2 (en) | Page reporting | |
CN113728587A (en) | Communication network optimization based on predictive enhancement gain | |
CN111125128B (en) | Cache updating method, device and system | |
CN102446171A (en) | Method and apparatus for evaluating quality score of promotion key word based on weighted average click-through rate | |
US20170323326A1 (en) | Method and systems for determining programmatically expected performances | |
US20190163664A1 (en) | Method and system for intelligent priming of an application with relevant priming data | |
CN104850627A (en) | Method and apparatus for performing paging display | |
JP5866473B2 (en) | Automated predictive tag management system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190226 |