CN109325104B - Method for dynamically calculating news acquisition service resources - Google Patents

Method for dynamically calculating news acquisition service resources Download PDF

Info

Publication number
CN109325104B
CN109325104B CN201811274611.XA CN201811274611A CN109325104B CN 109325104 B CN109325104 B CN 109325104B CN 201811274611 A CN201811274611 A CN 201811274611A CN 109325104 B CN109325104 B CN 109325104B
Authority
CN
China
Prior art keywords
data
website
acquisition
frequency
acquisition frequency
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811274611.XA
Other languages
Chinese (zh)
Other versions
CN109325104A (en
Inventor
詹咏松
程国艮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Glabal Tone Communication Technology Co ltd
Original Assignee
Glabal Tone Communication Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Glabal Tone Communication Technology Co ltd filed Critical Glabal Tone Communication Technology Co ltd
Priority to CN201811274611.XA priority Critical patent/CN109325104B/en
Publication of CN109325104A publication Critical patent/CN109325104A/en
Application granted granted Critical
Publication of CN109325104B publication Critical patent/CN109325104B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for dynamically calculating news acquisition service resources. The method comprises the steps of extracting characteristics of data on the basis of previously collected news data and the amount of collected resources invested for collecting the data, dynamically analyzing and determining the data collection frequency of a specific website through a logistic regression model, further dynamically determining the collected resources needed to be invested for collecting the data of the specific website, continuously correcting parameters of the logistic regression model through the actually collected data amount and the resource investment as feedback information, and dynamically correcting and optimizing the collection frequency. By the method, the acquisition frequency and the resource input can be dynamically adjusted and optimized in the acquisition process, the problems of missed acquisition, overhigh acquisition cost and the like are effectively solved, and the acquisition cost is greatly reduced on the premise of ensuring the acquisition quality.

Description

Method for dynamically calculating news acquisition service resources
Technical Field
The invention belongs to the technical field of data analysis, and particularly relates to a method for dynamically calculating news acquisition service resources.
Background
News websites update data frequently every day, the number of sites is large, for enterprises engaged in website data mining and analysis, a large number of servers/bandwidth/IP resources are needed to collect data resources of the news websites, and the use of each type of resource involves a large amount of cost. The acquisition frequency of news websites is too low, and news acquisition is easy to omit; when the acquisition frequency is too high, in order to reduce misjudgment of news sites, proxy IP is also needed for acquisition.
The existing acquisition system generally acquires website data resources according to single frequency, part of excellent acquisition systems adopt hierarchical management to simply classify websites, and each type adopts fixed frequency to acquire the data resources. These methods are difficult to reasonably configure the acquisition frequency of the news website, and the problems of acquisition omission or excessive acquisition cost cannot be avoided.
Logistic regression is a supervised statistical learning method, and is mainly used for classifying samples.
In a linear regression model, the outputs are generally continuous, e.g., y = f (x) = ax + b, with one corresponding y output for each input x. Both the domain and the range of the model may be [ - ∞, + ∞ ]. However, for logistic regression, the domain of definition may be continuous [ - ∞, + ∞ ], but the domain of values is generally discrete, i.e. has only a limited number of output values. For example, the range may have only two values {0, 1}, which may represent some classification of the sample, such as high/low, sick/healthy, negative/positive, etc., which is the most common logistic regression for binary classification. Therefore, in general, through the logistic regression model, we map x on the whole real number range to a limited point, and thus realize the classification of x. Since each time x is taken, it can be classified into a certain class y by logistic regression analysis.
Logistic regression, also called generalized linear regression model, is basically the same form as the linear regression model, both having ax + b, where a and b are the parameters to be solved, and differs in that their dependent variables are different, multiple linear regression directly takes ax + b as the dependent variable, i.e. y = ax + b, while logistic regression corresponds ax + b to a hidden state p by a function S, p = S (ax + b), and then determines the value of the dependent variable according to the size of p and 1-p. The function S is a Sigmoid function
Figure 538738DEST_PATH_IMAGE002
(1)
And converting t into ax + b to obtain a parameter form of the logistic regression model:
Figure 721458DEST_PATH_IMAGE004
(2)
the problem to be solved by the present invention is how to obtain satisfactory data resource acquisition effect with minimum calculation, storage and network resources. The invention predicts the next collection quantity by dynamically evaluating the collection quantity, reasonably calls the collection resources and reduces the required collection resources on the premise of ensuring the collection accuracy.
Disclosure of Invention
In order to solve the problems that the existing data acquisition system statically sets data acquisition frequency, so that data acquisition is incomplete, important data are missed, calculation, storage and network resources are wasted, and acquisition cost is overhigh and the like, the invention provides a method for dynamically calculating news acquisition service resources. By the method, the acquisition frequency and the resource input can be dynamically adjusted and optimized in the acquisition process, the problems of missed acquisition, overhigh acquisition cost and the like are effectively solved, and the acquisition cost is greatly reduced on the premise of ensuring the acquisition quality.
In order to achieve the aim, the invention adopts the following technical scheme:
a method for dynamically calculating news collection service resources is characterized in that the data are extracted according to characteristics of the news data collected in the past and the collection resource amount invested for collecting the data, the data collection frequency of a specific website is determined through dynamic analysis of a logistic regression model, the collection resources required to be invested for collecting the data of the specific website are further dynamically determined, the actually collected data amount and the resource investment amount serve as feedback information, parameters of the logistic regression model are continuously corrected, and the collection frequency is dynamically corrected and optimized.
A method of dynamically computing a news gathering service resource, the method comprising the steps of:
1) selecting input data;
2) extracting input data characteristics;
3) normalizing each characteristic value of the input data;
4) whether the acquisition frequency is increased or not is used as a classification identifier, the increased frequency is marked as 1, and the frequency which is not increased is marked as 0;
5) combining the characteristic values of the input data and the corresponding classification identifications to form a training data set;
6) randomly dividing the data set into two types, wherein one type is a training data set, and the other type is a testing data set;
7) selecting a logistic regression algorithm as a classification algorithm;
8) respectively training a logistic regression algorithm by taking the training data set of each website as input to obtain a corresponding logistic regression classification model;
9) dividing the acquisition frequency into a plurality of classes which are respectively marked as f1, f2 and … fn from low to high;
10) allocating an initial acquisition frequency for each news website, and setting an accumulator;
11) taking a test data set of each website as input, and giving a classification value through a logistic regression classification model;
12) if the classification value is 1, increasing the acquisition frequency of the website to the previous level, if the highest acquisition frequency fn is reached, maintaining the acquisition frequency fn unchanged, and resetting an accumulator corresponding to the website; if the classification value is 0, maintaining the acquisition frequency of the website unchanged, simultaneously adding 1 in an accumulator, if the value of the accumulator reaches a specified threshold value, selectively reducing the acquisition frequency of the website, and if the initial acquisition frequency fi of the website is reached, maintaining the acquisition frequency fi unchanged;
13) and carrying out data acquisition on each news website by using the new acquisition frequency, and correcting and optimizing the logistic regression classification model of the website by using the newly acquired data characteristics as feedback information so as to ensure that the acquisition frequency of the website is in a reasonable position, so that the data is not too low, the data is not lost, the data is not too high, the resource waste is caused, and the acquisition cost is increased.
Preferably, in the step 9), the acquisition frequency is divided into 5 and the like, which are respectively recorded as f1, f2, f3, f4 and f 5.
Preferably, in the step 10), f1 is generally selected as the initial acquisition frequency of each website, and for some important websites, other frequencies higher than f1 may be adopted as the initial acquisition frequency to ensure the data acquisition quality.
Preferably, in the step 12), the threshold is set to 2, that is, if the classification value of a certain website is 0 twice in succession, the collection frequency of the website is reduced.
The invention has the advantages and beneficial effects that: the invention trains a classification algorithm based on the past collected data quality and the collected resource input amount of a website to obtain a training model, determines the input amount of collected resources by the model, and continuously corrects and optimizes the classification model by taking the actual data collection quality and the resource input amount as feedback information, thereby ensuring the dynamic rationality of the collected resource input amount, avoiding data loss caused by too low collected resource input and resource waste caused by too high collected resource input and increasing the collection cost. Meanwhile, for important websites, the invention provides resource guarantee by raising the initial frequency, and ensures the acquisition quality of important resources.
Detailed Description
The present invention will be further described with reference to the following examples.
Examples
A method for dynamically calculating news collection service resources is implemented according to the following steps:
1) selecting input data;
2) extracting input data characteristics;
3) normalizing each characteristic value of the input data;
4) whether the acquisition frequency is increased or not is used as a classification identifier, the increased frequency is marked as 1, and the frequency which is not increased is marked as 0;
5) combining the characteristic values of the input data and the corresponding classification identifications to form a training data set;
6) randomly dividing the data set into two types, wherein one type is a training data set, and the other type is a testing data set, wherein the training data set accounts for 80%, and the testing data set accounts for 20%;
7) selecting a logistic regression algorithm as a classification algorithm;
8) respectively training a logistic regression algorithm by taking the training data set of each website as input to obtain a corresponding logistic regression classification model;
9) dividing the acquisition frequency into 5 classes which are respectively marked as f1, f2, f3, f4 and f5 from low to high;
10) allocating an initial acquisition frequency f1 for each news website, setting an accumulator, and setting the initial acquisition frequency f3 for individual important websites;
11) taking a test data set of each website as input, and giving a classification value through a logistic regression classification model;
12) if the classification value is 1, increasing the acquisition frequency of the website to the previous level, if the highest acquisition frequency f5 is reached, maintaining the acquisition frequency f5 unchanged, and resetting an accumulator corresponding to the website; if the classification value is 0, maintaining the acquisition frequency of the website unchanged, meanwhile adding 1 in an accumulator, if the value of the accumulator reaches 2, selecting to reduce the acquisition frequency of the website, and if the initial acquisition frequency f1 or f3 of the website is reached, maintaining the acquisition frequency as f1 or f3 unchanged;
13) and carrying out data acquisition on each news website by using the new acquisition frequency, and correcting and optimizing the logistic regression classification model of the website by using the newly acquired data characteristics as feedback information so as to ensure that the acquisition frequency of the website is in a reasonable position, so that the data is not too low, the data is not lost, the data is not too high, the resource waste is caused, and the acquisition cost is increased.
Finally, it should be noted that: it should be understood that the above examples are only for clearly illustrating the present invention and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications therefrom are intended to be within the scope of the invention.

Claims (4)

1. A method for dynamically calculating news gathering service resources is characterized in that: the method comprises the steps of taking the quality of news data collected in the past and the amount of collection resources invested for collecting the data as basic data, extracting the characteristics of the data, dynamically analyzing and determining the data collection frequency of a specific website through a logistic regression model, further dynamically determining the collection resources needed to be invested for collecting the data of the specific website, and continuously correcting the parameters of the logistic regression model through taking the actually collected data amount and the resource investment amount as feedback information to realize dynamic correction and optimization of the collection frequency;
wherein the method comprises the steps of:
1) selecting input data;
2) extracting input data characteristics;
3) normalizing each characteristic value of the input data;
4) whether the acquisition frequency is increased or not is used as a classification identifier, the increased frequency is marked as 1, and the frequency which is not increased is marked as 0;
5) combining the characteristic values of the input data and the corresponding classification identifications to form a training data set;
6) randomly dividing the data set into two types, wherein one type is a training data set, and the other type is a testing data set;
7) selecting a logistic regression algorithm as a classification algorithm;
8) respectively training a logistic regression algorithm by taking the training data set of each website as input to obtain a corresponding logistic regression classification model;
9) dividing the acquisition frequency into a plurality of classes which are respectively marked as f1, f2 and … fn from low to high;
10) allocating an initial acquisition frequency for each news website, and setting an accumulator;
11) taking a test data set of each website as input, and giving a classification value through a logistic regression classification model;
12) if the classification value is 1, increasing the acquisition frequency of the website to the previous level, if the highest acquisition frequency fn is reached, maintaining the acquisition frequency fn unchanged, and resetting an accumulator corresponding to the website; if the classification value is 0, maintaining the acquisition frequency of the website unchanged, simultaneously adding 1 in an accumulator, if the value of the accumulator reaches a specified threshold value, selectively reducing the acquisition frequency of the website, and if the initial acquisition frequency fi of the website is reached, maintaining the acquisition frequency fi unchanged;
13) and carrying out data acquisition on each news website by using the new acquisition frequency, and correcting and optimizing the logistic regression classification model of the website by using the newly acquired data characteristics as feedback information so as to ensure that the acquisition frequency of the website is in a reasonable position, so that the data is not too low, the data is not lost, the data is not too high, the resource waste is caused, and the acquisition cost is increased.
2. The method of claim 1, wherein the method comprises: in the step 9), the collection frequency is divided into 5 and the like, which are respectively marked as f1, f2, f3, f4 and f 5.
3. The method of claim 1, wherein the method comprises: in the step 10), f1 is selected as the initial acquisition frequency of each website, and for some preset websites, other frequencies higher than f1 are adopted as the initial acquisition frequency to ensure the data acquisition quality.
4. The method of claim 1, wherein the method comprises: in the step 12), the threshold is set to 2, that is, if the classification value of a certain website is 0 for two consecutive times, the acquisition frequency of the website is reduced.
CN201811274611.XA 2018-10-30 2018-10-30 Method for dynamically calculating news acquisition service resources Active CN109325104B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811274611.XA CN109325104B (en) 2018-10-30 2018-10-30 Method for dynamically calculating news acquisition service resources

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811274611.XA CN109325104B (en) 2018-10-30 2018-10-30 Method for dynamically calculating news acquisition service resources

Publications (2)

Publication Number Publication Date
CN109325104A CN109325104A (en) 2019-02-12
CN109325104B true CN109325104B (en) 2021-11-19

Family

ID=65259700

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811274611.XA Active CN109325104B (en) 2018-10-30 2018-10-30 Method for dynamically calculating news acquisition service resources

Country Status (1)

Country Link
CN (1) CN109325104B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114357875B (en) * 2021-12-27 2022-09-02 广州龙数科技有限公司 Intelligent data processing system based on machine learning

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008158906A (en) * 2006-12-25 2008-07-10 Nec Corp System, method and program for adjusting collection interval in resource monitoring
CN104486166A (en) * 2014-12-31 2015-04-01 北京理工大学 QoS-based sampling period adjusting method for networked control system
CN107203623A (en) * 2017-05-26 2017-09-26 山东省科学院情报研究所 The load balancing adjusting method of network crawler system
CN108549595A (en) * 2018-04-18 2018-09-18 江苏物联网研究发展中心 A kind of computing system status information dynamic collecting method and system
CN108595666A (en) * 2018-04-28 2018-09-28 中译语通科技股份有限公司 Dynamic calculates the method for news collection Service Source, information data processing terminal

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008158906A (en) * 2006-12-25 2008-07-10 Nec Corp System, method and program for adjusting collection interval in resource monitoring
CN104486166A (en) * 2014-12-31 2015-04-01 北京理工大学 QoS-based sampling period adjusting method for networked control system
CN107203623A (en) * 2017-05-26 2017-09-26 山东省科学院情报研究所 The load balancing adjusting method of network crawler system
CN108549595A (en) * 2018-04-18 2018-09-18 江苏物联网研究发展中心 A kind of computing system status information dynamic collecting method and system
CN108595666A (en) * 2018-04-28 2018-09-28 中译语通科技股份有限公司 Dynamic calculates the method for news collection Service Source, information data processing terminal

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Informing the curious negotiator: Automatic news extraction from the Internet;Zhang, D;Simoff, SJ;《Lecture Notes in Artificial Intelligence》;20061231;全文 *
智能新闻采集处理系统的设计与实现;张建林;《中国优秀硕士学位论文全文数据库 信息科技辑》;20170930;全文 *

Also Published As

Publication number Publication date
CN109325104A (en) 2019-02-12

Similar Documents

Publication Publication Date Title
CN110659173A (en) Operation and maintenance system and method
CN109471847B (en) I/O congestion control method and control system
CN112529204A (en) Model training method, device and system
CN111489188B (en) Resident adjustable load potential mining method and system
CN109800220B (en) Big data cleaning method, system and related device
CN117493921B (en) Artificial intelligence energy-saving management method and system based on big data
CN109325104B (en) Method for dynamically calculating news acquisition service resources
CN110942098A (en) Power supply service quality analysis method based on Bayesian pruning decision tree
CN114912720A (en) Memory network-based power load prediction method, device, terminal and storage medium
CN115622867A (en) Industrial control system safety event early warning classification method and system
CN108777870B (en) LTE high-load cell discrimination method and system based on Pearson coefficient
CN112486676B (en) Data sharing and distributing device based on edge calculation
CN112925964A (en) Big data acquisition method based on cloud computing service and big data acquisition service system
CN104182470A (en) SVM (support vector machine) based mobile terminal application classification system and method
CN116502802A (en) Data management system based on big data and wireless sensing technology
US20230034061A1 (en) Method for managing proper operation of base station and system applying the method
CN112613521B (en) Multilevel data analysis system and method based on data conversion
WO2022062777A1 (en) Data management method, data management apparatus, and storage medium
CN115115107A (en) Photovoltaic power prediction method and device and computer equipment
CN113434869A (en) Data processing method and AI system based on threat perception big data and artificial intelligence
CN111709611A (en) Agricultural big data processing method and device
CN110401727B (en) IP address analysis method and device
CN115514621B (en) Fault monitoring method, electronic device and storage medium
CN111741083B (en) Communication data processing method based on edge computing and Internet of things and cloud server
CN117472589B (en) Park network service management method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant