CN109325104B

CN109325104B - Method for dynamically calculating news acquisition service resources

Info

Publication number: CN109325104B
Application number: CN201811274611.XA
Authority: CN
Inventors: 詹咏松; 程国艮
Original assignee: Glabal Tone Communication Technology Co ltd
Current assignee: Glabal Tone Communication Technology Co ltd
Priority date: 2018-10-30
Filing date: 2018-10-30
Publication date: 2021-11-19
Anticipated expiration: 2038-10-30
Also published as: CN109325104A

Abstract

The invention discloses a method for dynamically calculating news acquisition service resources. The method comprises the steps of extracting characteristics of data on the basis of previously collected news data and the amount of collected resources invested for collecting the data, dynamically analyzing and determining the data collection frequency of a specific website through a logistic regression model, further dynamically determining the collected resources needed to be invested for collecting the data of the specific website, continuously correcting parameters of the logistic regression model through the actually collected data amount and the resource investment as feedback information, and dynamically correcting and optimizing the collection frequency. By the method, the acquisition frequency and the resource input can be dynamically adjusted and optimized in the acquisition process, the problems of missed acquisition, overhigh acquisition cost and the like are effectively solved, and the acquisition cost is greatly reduced on the premise of ensuring the acquisition quality.

Description

Method for dynamically calculating news acquisition service resources

Technical Field

The invention belongs to the technical field of data analysis, and particularly relates to a method for dynamically calculating news acquisition service resources.

Background

News websites update data frequently every day, the number of sites is large, for enterprises engaged in website data mining and analysis, a large number of servers/bandwidth/IP resources are needed to collect data resources of the news websites, and the use of each type of resource involves a large amount of cost. The acquisition frequency of news websites is too low, and news acquisition is easy to omit; when the acquisition frequency is too high, in order to reduce misjudgment of news sites, proxy IP is also needed for acquisition.

The existing acquisition system generally acquires website data resources according to single frequency, part of excellent acquisition systems adopt hierarchical management to simply classify websites, and each type adopts fixed frequency to acquire the data resources. These methods are difficult to reasonably configure the acquisition frequency of the news website, and the problems of acquisition omission or excessive acquisition cost cannot be avoided.

Logistic regression is a supervised statistical learning method, and is mainly used for classifying samples.

In a linear regression model, the outputs are generally continuous, e.g., y = f (x) = ax + b, with one corresponding y output for each input x. Both the domain and the range of the model may be [ - ∞, + ∞ ]. However, for logistic regression, the domain of definition may be continuous [ - ∞, + ∞ ], but the domain of values is generally discrete, i.e. has only a limited number of output values. For example, the range may have only two values {0, 1}, which may represent some classification of the sample, such as high/low, sick/healthy, negative/positive, etc., which is the most common logistic regression for binary classification. Therefore, in general, through the logistic regression model, we map x on the whole real number range to a limited point, and thus realize the classification of x. Since each time x is taken, it can be classified into a certain class y by logistic regression analysis.

Logistic regression, also called generalized linear regression model, is basically the same form as the linear regression model, both having ax + b, where a and b are the parameters to be solved, and differs in that their dependent variables are different, multiple linear regression directly takes ax + b as the dependent variable, i.e. y = ax + b, while logistic regression corresponds ax + b to a hidden state p by a function S, p = S (ax + b), and then determines the value of the dependent variable according to the size of p and 1-p. The function S is a Sigmoid function

（1）

And converting t into ax + b to obtain a parameter form of the logistic regression model:

（2）

the problem to be solved by the present invention is how to obtain satisfactory data resource acquisition effect with minimum calculation, storage and network resources. The invention predicts the next collection quantity by dynamically evaluating the collection quantity, reasonably calls the collection resources and reduces the required collection resources on the premise of ensuring the collection accuracy.

Disclosure of Invention

In order to solve the problems that the existing data acquisition system statically sets data acquisition frequency, so that data acquisition is incomplete, important data are missed, calculation, storage and network resources are wasted, and acquisition cost is overhigh and the like, the invention provides a method for dynamically calculating news acquisition service resources. By the method, the acquisition frequency and the resource input can be dynamically adjusted and optimized in the acquisition process, the problems of missed acquisition, overhigh acquisition cost and the like are effectively solved, and the acquisition cost is greatly reduced on the premise of ensuring the acquisition quality.

In order to achieve the aim, the invention adopts the following technical scheme:

a method for dynamically calculating news collection service resources is characterized in that the data are extracted according to characteristics of the news data collected in the past and the collection resource amount invested for collecting the data, the data collection frequency of a specific website is determined through dynamic analysis of a logistic regression model, the collection resources required to be invested for collecting the data of the specific website are further dynamically determined, the actually collected data amount and the resource investment amount serve as feedback information, parameters of the logistic regression model are continuously corrected, and the collection frequency is dynamically corrected and optimized.

A method of dynamically computing a news gathering service resource, the method comprising the steps of:

1) selecting input data;

2) extracting input data characteristics;

3) normalizing each characteristic value of the input data;

4) whether the acquisition frequency is increased or not is used as a classification identifier, the increased frequency is marked as 1, and the frequency which is not increased is marked as 0;

5) combining the characteristic values of the input data and the corresponding classification identifications to form a training data set;

6) randomly dividing the data set into two types, wherein one type is a training data set, and the other type is a testing data set;

7) selecting a logistic regression algorithm as a classification algorithm;

8) respectively training a logistic regression algorithm by taking the training data set of each website as input to obtain a corresponding logistic regression classification model;

9) dividing the acquisition frequency into a plurality of classes which are respectively marked as f1, f2 and … fn from low to high;

10) allocating an initial acquisition frequency for each news website, and setting an accumulator;

11) taking a test data set of each website as input, and giving a classification value through a logistic regression classification model;

12) if the classification value is 1, increasing the acquisition frequency of the website to the previous level, if the highest acquisition frequency fn is reached, maintaining the acquisition frequency fn unchanged, and resetting an accumulator corresponding to the website; if the classification value is 0, maintaining the acquisition frequency of the website unchanged, simultaneously adding 1 in an accumulator, if the value of the accumulator reaches a specified threshold value, selectively reducing the acquisition frequency of the website, and if the initial acquisition frequency fi of the website is reached, maintaining the acquisition frequency fi unchanged;

13) and carrying out data acquisition on each news website by using the new acquisition frequency, and correcting and optimizing the logistic regression classification model of the website by using the newly acquired data characteristics as feedback information so as to ensure that the acquisition frequency of the website is in a reasonable position, so that the data is not too low, the data is not lost, the data is not too high, the resource waste is caused, and the acquisition cost is increased.

Preferably, in the step 9), the acquisition frequency is divided into 5 and the like, which are respectively recorded as f1, f2, f3, f4 and f 5.

Preferably, in the step 10), f1 is generally selected as the initial acquisition frequency of each website, and for some important websites, other frequencies higher than f1 may be adopted as the initial acquisition frequency to ensure the data acquisition quality.

Preferably, in the step 12), the threshold is set to 2, that is, if the classification value of a certain website is 0 twice in succession, the collection frequency of the website is reduced.

The invention has the advantages and beneficial effects that: the invention trains a classification algorithm based on the past collected data quality and the collected resource input amount of a website to obtain a training model, determines the input amount of collected resources by the model, and continuously corrects and optimizes the classification model by taking the actual data collection quality and the resource input amount as feedback information, thereby ensuring the dynamic rationality of the collected resource input amount, avoiding data loss caused by too low collected resource input and resource waste caused by too high collected resource input and increasing the collection cost. Meanwhile, for important websites, the invention provides resource guarantee by raising the initial frequency, and ensures the acquisition quality of important resources.

Detailed Description

The present invention will be further described with reference to the following examples.

Examples

A method for dynamically calculating news collection service resources is implemented according to the following steps:

1) selecting input data;

2) extracting input data characteristics;

3) normalizing each characteristic value of the input data;

6) randomly dividing the data set into two types, wherein one type is a training data set, and the other type is a testing data set, wherein the training data set accounts for 80%, and the testing data set accounts for 20%;

7) selecting a logistic regression algorithm as a classification algorithm;

9) dividing the acquisition frequency into 5 classes which are respectively marked as f1, f2, f3, f4 and f5 from low to high;

10) allocating an initial acquisition frequency f1 for each news website, setting an accumulator, and setting the initial acquisition frequency f3 for individual important websites;

12) if the classification value is 1, increasing the acquisition frequency of the website to the previous level, if the highest acquisition frequency f5 is reached, maintaining the acquisition frequency f5 unchanged, and resetting an accumulator corresponding to the website; if the classification value is 0, maintaining the acquisition frequency of the website unchanged, meanwhile adding 1 in an accumulator, if the value of the accumulator reaches 2, selecting to reduce the acquisition frequency of the website, and if the initial acquisition frequency f1 or f3 of the website is reached, maintaining the acquisition frequency as f1 or f3 unchanged;

Finally, it should be noted that: it should be understood that the above examples are only for clearly illustrating the present invention and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications therefrom are intended to be within the scope of the invention.

Claims

1. A method for dynamically calculating news gathering service resources is characterized in that: the method comprises the steps of taking the quality of news data collected in the past and the amount of collection resources invested for collecting the data as basic data, extracting the characteristics of the data, dynamically analyzing and determining the data collection frequency of a specific website through a logistic regression model, further dynamically determining the collection resources needed to be invested for collecting the data of the specific website, and continuously correcting the parameters of the logistic regression model through taking the actually collected data amount and the resource investment amount as feedback information to realize dynamic correction and optimization of the collection frequency;

wherein the method comprises the steps of:

1) selecting input data;

2) extracting input data characteristics;

3) normalizing each characteristic value of the input data;

7) selecting a logistic regression algorithm as a classification algorithm;

2. The method of claim 1, wherein the method comprises: in the step 9), the collection frequency is divided into 5 and the like, which are respectively marked as f1, f2, f3, f4 and f 5.

3. The method of claim 1, wherein the method comprises: in the step 10), f1 is selected as the initial acquisition frequency of each website, and for some preset websites, other frequencies higher than f1 are adopted as the initial acquisition frequency to ensure the data acquisition quality.

4. The method of claim 1, wherein the method comprises: in the step 12), the threshold is set to 2, that is, if the classification value of a certain website is 0 for two consecutive times, the acquisition frequency of the website is reduced.