CN113726558A

CN113726558A - Network equipment flow prediction system based on random forest algorithm

Info

Publication number: CN113726558A
Application number: CN202110910221.2A
Authority: CN
Inventors: 吴飞; 李霆; 吴树霖; 沈立翔; 肖传奇; 孔美美; 吴珍
Original assignee: State Grid Fujian Electric Power Co Ltd
Current assignee: State Grid Fujian Electric Power Co Ltd
Priority date: 2021-08-09
Filing date: 2021-08-09
Publication date: 2021-11-30

Abstract

The invention discloses a network equipment flow prediction system based on a random forest algorithm, which comprises a data acquisition module, a data processing module and a data processing module, wherein the data acquisition module is used for acquiring relevant data related to network equipment in operation; a data processing module: according to the collected data, corresponding data processing is carried out, and then relevant evaluation index distribution and evaluation indexes are obtained; a model construction module: constructing a random forest regression model through a random forest algorithm; a model evaluation module: according to the evaluation standard of the regression prediction model, evaluating the random forest regression model by using the collected data set and the evaluation index distribution and evaluation index obtained in the data processing module, and performing persistent storage on the model meeting the requirements after evaluation; an input-output module: and the input and output module calculates by using the relevant input data and a persistence stored model and outputs a predicted value of the network equipment flow.

Description

Network equipment flow prediction system based on random forest algorithm

Technical Field

The invention belongs to the technical field of network equipment flow prediction, and particularly relates to a network equipment flow prediction system based on a random forest algorithm.

Background

When a related service is needed, the total amount of flow needed to be used is not well judged, waste is caused if the total amount of flow needed to be used is too much, normal office work and other living use are affected if the total amount of flow needed to be used is too little, and therefore a method and a device capable of predicting the flow of equipment in a network are needed, so that the rough total amount of flow can be predicted, and the flow can be conveniently purchased by logistics personnel and the like for preliminary judgment.

Disclosure of Invention

The invention aims to provide a network equipment flow prediction system based on a random forest algorithm, which can roughly predict the total flow in the later period according to the existing data and the like, on one hand, is convenient for logistics and user management, and updates the flow and the like in time; on the other hand, the method can roughly perform corresponding budgeting, can timely know rough flow for large-scale systems and companies which need network monitoring, and further can make data reservation for later statistics and the like.

In order to achieve the technical effects, the invention is realized by the following technical scheme.

The network equipment flow prediction system based on the random forest algorithm comprises a data acquisition module, a data processing module and a data processing module, wherein the data acquisition module is used for acquiring relevant data related to network equipment in operation;

a data processing module: according to the collected data, corresponding data processing is carried out, and then relevant evaluation index distribution and evaluation indexes are obtained;

a model construction module: constructing a random forest regression model through a random forest algorithm;

a model evaluation module: according to the evaluation standard of the regression prediction model, evaluating the random forest regression model by using the collected data set and the evaluation index distribution and evaluation index obtained in the data processing module, and performing persistent storage on the model meeting the requirements after evaluation;

an input-output module: and the input and output module calculates by using the relevant input data and a persistence stored model and outputs a predicted value of the network equipment flow.

In the technical scheme, a random forest algorithm is selected. The algorithm is the same as the traditional regression model, and random forest regression can explain the influence of a plurality of independent variables on dependent variables. Suppose that the dependent variable Y has n observations, i.e., n cases, and k independent variables having influence on the observed values. In the process of constructing the regression tree, the random forest randomly extracts partial observed values of the dependent variable Y by adopting a Bootstrap resampling method, and randomly selects a specified number of variables from the k independent variables to determine classification tree nodes. Thus, the regression tree may be different for randomness for each construction. Based on this, the random forest can generally randomly generate hundreds or even thousands of classification trees, and the tree with the highest repetition degree is selected as the final result.

According to the technical scheme, the corresponding module units are set according to requirements, and the module units realize model construction and subsequent data input according to functions and relevance of other modules to obtain a predicted effect.

In the technical scheme, the evaluation indexes are used for evaluation, and if the evaluation indexes meet requirements, the model is subjected to persistent storage for subsequent application.

As a further improvement of the invention, the method also comprises an updating module, when the model stored in the persistence module is not in accordance with the standard, the model is continuously evaluated by the model evaluation module, a new model in accordance with the standard is evaluated, and the old model in accordance with the standard is replaced by the new model.

In the technical scheme, when the old model is eliminated, the corresponding updating and replacing can be carried out through the updating module so as to realize the effects of continuity, durability and updating.

As a further improvement of the invention, the invention also comprises a prediction method of the network equipment flow prediction system based on the random forest algorithm, which comprises the following steps:

data collection: according to network flow data associated with the operation of the network equipment, acquiring related data associated with the network equipment in the operation of the network equipment, and classifying the related data to obtain a data set of a plurality of types of data;

data processing: filtering, processing and analyzing each type of data set to obtain the distribution condition of the evaluation index in each type of data set;

the data feature processing of partial data sets is also included in the data sets of a plurality of types, noise is removed under the condition that the curve features of the data sets are kept according to the types of the screened data, and the evaluation index proportion of the data is obtained through a normalization processing method;

constructing a model: selecting relevant influence factors of various data by adopting a random forest algorithm, and carrying out model construction by combining a core index value and an evaluation index ratio by using a training set, and testing by using a testing set;

model evaluation cure and prediction: and dividing the data set into a training set and a testing set, training the model by using the training set, verifying the verification set, evaluating according to the evaluation index, and if the evaluation index meets the requirement, predicting the flow of the network equipment according to the model.

In the technical scheme, a model is constructed by utilizing multiple types of data, so that the influence of multiple factors on the flow can be fully considered, and the prediction accuracy is improved; and the post-processing of the data set, the association with the evaluation index and the like provide a relatively accurate basis for the subsequent further evaluation.

The random forest in the technical scheme is established in a random mode and comprises a classifier with a plurality of decision trees. The class of its output is determined by the mode of the class output by the respective tree.

Randomness is mainly reflected in two aspects: (1) when each tree is trained, selecting a data set with the same size as N, which can be repeated, from all training samples (the number of samples is N) to train (namely bootstrap sampling); (2) at each node, a subset of all the features is randomly selected for calculating the optimal segmentation.

Its advantages are as follows:

1. has great advantages over other algorithms on the current many data sets and good performance

2. It can process data with very high dimension (much feature) and does not need to make feature selection

3. After training, it can give which features are important

4. When a random forest is created, unbiased estimation is used for the generational error, and the model generalization capability is strong

5. High training speed and easy parallelization method

6. In the training process, the mutual influence between features can be detected

7. The realization is simpler

8. For unbalanced data sets, it may balance the error.

9. Accuracy can still be maintained if a significant portion of the features are missing.

In the technical scheme, the later stage also relates to model evaluation and solidification and the like, and whether the model is qualified or not can be monitored, so that the model is used for subsequent reutilization, verification is carried out, and the accuracy is improved.

As a further improvement of the present invention, the data at least includes 3 types, which are respectively network device traffic data, host traffic data, and network design parameter information, the traffic data of the network device and the traffic data of the host are time sequence data sorted by time, and the data granularity of the time sequence data is in the minute level.

In the technical scheme, three types of different data are utilized to collect parameters of flow of related network equipment, the equipment is fully considered, interference of various factors such as the network and the like is avoided, the considered factors are more comprehensive and diversified, and the accuracy of the later stage is improved.

As a further improvement of the present invention, in the data collecting step, the network design parameter information includes a local area network topology structure, a maximum load and/or a carrying capacity of network traffic of each link, and compliance information for ensuring safety and stability of a network environment.

In the technical scheme, various topological structures are fully considered in the network design parameters, and factors such as conformity, bearing capacity, safety and stability are also integrated, so that the whole system is more integrated in consideration of the factors, and the comprehensiveness of the network design parameters is improved.

As a further improvement of the invention, the data processing specifically comprises index filtering, data preprocessing and threshold solving to obtain the distribution condition of the evaluation index in each type of data set.

In the technical scheme, in the data processing process, preliminary noise reduction and impurity removal are carried out through filtering; then, through data preprocessing, the data is subjected to processing such as screening according to requirements and the like; and in the later stage, the data after being processed by denoising and the like is analyzed to find out the evaluation index of the core, so that the subsequent calculation is reduced, and the data is simplified.

As a further improvement of the invention, the index filtering is to screen out qualified indexes from a plurality of types of data through correlation analysis and mechanism knowledge; the data preprocessing is to identify and process abnormal values and missing values in qualified indexes in a data set so as to realize the integrity of the data; and the threshold solution is to optimize the data classes to be collected by exploring the distribution of the qualified index values, thereby providing a basis for data screening.

In the technical scheme, screening, recognition recovery processing, analysis and the like are utilized, so that the data have a complete processing process from beginning to end, the noise is low and the calculated amount is low during later-stage data application, and meanwhile, data support is provided for early-stage data screening and the like.

As a further improvement of the present invention, the data feature processing of the partial data set specifically includes: and expressing the time sequence data with continuity in a time lag mode, and obtaining the evaluation index ratio of the data by denoising and normalizing.

In the technical scheme, continuous data has volatility and instantaneity, so that the influence on model training and accuracy is large, and therefore a filtering algorithm, such as a Kalman filtering algorithm, is required to be adopted for smoothing. Different evaluation indexes often have different dimensions and dimension units, which affect the result of data analysis, and in order to eliminate the dimension influence among the indexes, the data is generally normalized, but not all models need to be normalized.

As a further improvement of the invention, the denoising process comprises a smoothing process and parameter threshold filtering, and the smoothing process adopts a filtering algorithm.

In the technical scheme, through various denoising and the like, the volatility and the instantaneity of the method can be reduced, so that the influence on the precision in later-stage operation is reduced, and the whole precision is improved.

As a further improvement of the present invention, the model is constructed by repeatedly extracting b training sample sets in a plurality of data sets, constructing b regression trees according to the b training sample sets, selecting unextracted cases to form b data outside bags as a test sample set, constructing a random forest, and obtaining a prediction result by using the test sample set.

According to the technical scheme, the training sample sets and the testing sample sets which are equal in number are selected, so that the testing precision can be improved, and an optimal prediction structure can be obtained.

Drawings

Fig. 1 is a flowchart of a method for predicting network device traffic according to the present invention;

FIG. 2 is a flow chart of a random forest regression method provided by the present invention;

FIG. 3 is a flow chart of network device traffic prediction according to the present invention;

fig. 4 is a schematic circuit diagram of traffic prediction of a network device according to the present invention.

Detailed Description

The following describes embodiments of the present invention in further detail with reference to the accompanying drawings.

Example 1

In this embodiment, a main module of a network device traffic prediction system based on a random forest algorithm is mainly introduced.

Referring to fig. 4, the invention further discloses a network device traffic prediction system based on a random forest algorithm, which comprises a data acquisition module, a data processing module and a traffic prediction module, wherein the data acquisition module is used for acquiring relevant data related to the network device in operation;

In this embodiment, corresponding module units are set as required, and the module units implement model construction and subsequent data input according to functions and relevance with other modules, so as to obtain a predicted effect.

In this embodiment, a random forest algorithm is selected. The algorithm is the same as the traditional regression model, and random forest regression can explain the influence of a plurality of independent variables on dependent variables. Suppose that the dependent variable Y has n observations, i.e., n cases, and k independent variables having influence on the observed values. In the process of constructing the regression tree, the random forest randomly extracts partial observed values of the dependent variable Y by adopting a Bootstrap resampling method, and randomly selects a specified number of variables from the k independent variables to determine classification tree nodes. Thus, the regression tree may be different for randomness for each construction. Based on this, the random forest can generally randomly generate hundreds or even thousands of classification trees, and the tree with the highest repetition degree is selected as the final result.

Specifically, the method further comprises an updating module, when the persisted model is not in accordance with the standard, the model continues to be evaluated by the model evaluation module, a new model in accordance with the standard is evaluated, and the old model in accordance with the standard is replaced by the new model.

In this embodiment, when an old model is eliminated, a corresponding update replacement can be performed through the update module, so as to achieve the effects of persistence, permanence, and update.

Example 2

In this embodiment, a flow of a prediction method of a network device traffic prediction system based on a random forest algorithm is mainly described.

Referring to fig. 1-4, the method includes the following steps:

In the embodiment, a model is constructed by utilizing multiple types of data, so that the influence of multiple factors on the flow can be fully considered, and the prediction accuracy is improved; and the post-processing of the data set, the association with the evaluation index and the like provide a relatively accurate basis for the subsequent further evaluation.

The random forest in this embodiment is a classifier that is created in a random manner and includes a plurality of decision trees. The class of its output is determined by the mode of the class output by the respective tree.

Its advantages are as follows:

3. After training, it can give which features are important

5. High training speed and easy parallelization method

7. The realization is simpler

8. For unbalanced data sets, it may balance the error.

In the embodiment, the later stage also involves model evaluation and curing and the like, and the model can be monitored to be qualified or not, so that the model can be used for subsequent reutilization, verification is performed, and the accuracy is improved.

Example 3

In this embodiment, related data related to the operation of the network device is mainly introduced.

Specifically, the data at least includes 3 types, which are respectively network device traffic data, host traffic data, and network design parameter information, where the traffic data of the network device and the traffic data of the host are time sequence data sorted by time, and the data granularity of the time sequence data is in the minute level.

In this embodiment, utilize three kinds of different data, carry out the parameter acquisition of relevant network equipment flow, it has fully considered equipment to and the interference of various factors such as network itself, make the factor of considering more comprehensive, more diversified, and then promote the precision in later stage.

Further, in the data collection step, the network design parameter information includes a local area network topology structure, a maximum load and/or a bearing capacity of network traffic of each link, and compliance information for ensuring safety and stability of a network environment.

In this embodiment, the network design parameters fully consider various topological structures, and also integrate the factors such as compliance, bearing capacity, safety and stability, so that the whole system is more comprehensive in consideration of the factors, and the comprehensiveness of the network design parameters is improved.

Example 4

In this embodiment, other key steps are introduced.

Referring to the attached drawings, the data processing specifically includes index filtering, data preprocessing and threshold solving to obtain distribution conditions of the evaluation indexes in each type of data set.

In the embodiment, in the data processing process, preliminary noise reduction and impurity removal are carried out through filtering; then, through data preprocessing, the data is subjected to processing such as screening according to requirements and the like; and in the later stage, the data after being processed by denoising and the like is analyzed to find out the evaluation index of the core, so that the subsequent calculation is reduced, and the data is simplified.

Further, the index filtering specifically comprises screening out qualified indexes from a plurality of types of data through correlation analysis and mechanism knowledge; the data preprocessing is to identify and process abnormal values and missing values in qualified indexes in a data set so as to realize the integrity of the data; and the threshold solution is to optimize the data classes to be collected by exploring the distribution of the qualified index values, thereby providing a basis for data screening.

In the embodiment, screening, recognition, recovery, processing, analysis and the like are utilized, so that the data have a complete processing process from beginning to end, the noise is low and the calculated amount is low during later-stage data application, and meanwhile, data support is provided for early-stage data screening and the like.

Further, the data feature processing of the partial data set specifically includes: and expressing the time sequence data with continuity in a time lag mode, and obtaining the evaluation index ratio of the data by denoising and normalizing.

In this embodiment, the continuous data has volatility and instantaneity, so that the influence on the model training and the accuracy is large, and therefore, a filtering algorithm, such as a kalman filtering algorithm, needs to be adopted to perform smoothing processing on the model training and the accuracy. Different evaluation indexes often have different dimensions and dimension units, which affect the result of data analysis, and in order to eliminate the dimension influence among the indexes, the data is generally normalized, but not all models need to be normalized.

Further, the denoising process comprises a smoothing process and parameter threshold filtering, and the smoothing process adopts a filtering algorithm.

In the embodiment, through various denoising and the like, the volatility and the instantaneity of the method can be reduced, so that the influence on the precision in the later operation is reduced, and the whole precision is improved.

Specifically, the model is constructed by repeatedly extracting b training sample sets from a plurality of data sets, constructing b regression trees according to the b training sample sets, selecting unextracted cases to form b data outside bags as test sample sets, constructing a random forest, and obtaining a prediction result by using the test sample sets.

In this embodiment, the training sample sets and the test sample sets with the same number are selected, so that the test accuracy can be improved to obtain the optimal prediction structure.

Example 5

In this embodiment, specific applications are mainly described.

Referring to fig. 1-4, the method includes the following steps:

step 1: data preparation

The data sets mainly required for constructing the model comprise 3 types, namely network equipment flow data, host flow data and network design parameter information. The network design parameter information mainly includes information such as a local area network topology structure, a maximum load or carrying capacity of network flow of each link, and a load for ensuring the safety and stability of a network environment. The network equipment and the host flow data are time sequence data, and the data granularity is in the level of minutes.

Step 2: data processing

The main work of the data processing stage comprises index filtering, data preprocessing and threshold solving. The index filtering is mainly to screen the indexes by utilizing correlation analysis and mechanism knowledge according to the collected information; the data preprocessing is mainly used for identifying and processing abnormal values and missing values in the data set. Among the commonly used methods for outlier identification include the 3sigma criterion and the quantile method (boxplot). The threshold solution is mainly used for exploring the distribution situation of key index values, so that data support is provided for data screening.

And step 3: feature engineering

The characteristic engineering process mainly comprises 3 parts of contents: feature calculation, data filtering and data normalization processing. The network flow data is time sequence data, and the total flow demand in the whole network has continuity, so that the time lag mode is adopted for expression in the characteristic calculation process; the main process operations of data filtering are flow data smoothing processing and parameter threshold filtering. Particularly, network traffic has volatility and instantaneity, so that the influence on model training and accuracy is large, and therefore, a filtering algorithm, such as a kalman filtering algorithm, needs to be adopted to perform smoothing processing on the network traffic. Different evaluation indexes often have different dimensions and dimension units, which affect the result of data analysis, and in order to eliminate the dimension influence among the indexes, the data is generally normalized, but not all models need to be normalized.

And 4, step 4: model construction

For the flow prediction, a random forest algorithm is selected in the project. The algorithm is the same as the traditional regression model, and random forest regression can explain the influence of a plurality of independent variables on dependent variables. Suppose that the dependent variable Y has n observations, i.e., n cases, and k independent variables having influence on the observed values. In the process of constructing the regression tree, the random forest randomly extracts partial observed values of the dependent variable Y by adopting a Bootstrap resampling method, and randomly selects a specified number of variables from the k independent variables to determine classification tree nodes. Thus, the regression tree may be different for randomness for each construction. Based on this, the random forest can generally randomly generate hundreds or even thousands of classification trees, and the tree with the highest repetition degree is selected as the final result. The regression tree theta is used for forming a combined model { h (X, theta)_j) J is 1,2, L, b, j is obtained by finding j regression trees h (X, θ)_j) The average value of (a) forms the predicted value of the random forest regression model.

The model satisfies the condition: the training sets forming the random forest are independent of each other, so the mean square error of the prediction vector h (X) is E_X,Y(Y-h(X))。

The flow of the random forest regression algorithm is as follows:

(1) from n cases of the original data set, b training sample sets are repeatedly extracted by applying a Bootstrap method, and accordingly b regression trees are constructed. And b samples composed of the unextracted cases are out-of-bag data (OOB) as a test sample set when the training samples are extracted each time.

(2) When a regression tree is constructed, m independent variables are randomly selected from k independent variables at the branch node of each tree_try(m_try< k) as candidate branching variables and then determining the optimal branching among them according to the branching goodness criterion.

(3) Each regression tree recursively branches from top to bottom and grows continuously, and the number n of the trees is set as a termination condition for the growth of the regression tree;

(4) the generated b regression trees form a random forest regression model, the estimation effect of the model is evaluated by the accuracy of the prediction of the out-of-bag data (OOB), namely, the estimation effect is measured by the mean square error of the test set, and if the number of samples of the out-of-bag data is m, the random forest regression model is:

y_irepresents the true value of the dependent variable in OOB,

representing the predicted values obtained with a random forest regression model,

represents the variance of the OOB prediction.

And 5: model evaluation and persistence

Referring to the evaluation standard of the regression prediction model in the industry and academic community, the evaluation indexes aiming at the regression prediction model comprise:

(1) mean Square Error (MSE):

(2) root Mean Square Error (RMSE):

(3) mean Absolute Error (MAE):

in the model training process, the data set is divided into a training set and a verification set according to the ratio of 8:2, namely, the training set is used for training the model, the verification set is used for testing the model effect after the training is finished, the evaluation indexes are used for evaluating, and the model is stored persistently for subsequent application if each evaluation index meets the requirement.

The foregoing shows and describes the general principles and broad features of the present invention and advantages thereof. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. The network equipment flow prediction system based on the random forest algorithm is characterized by comprising a data acquisition module, a data processing module and a data processing module, wherein the data acquisition module is used for acquiring relevant data related to network equipment in operation;

2. The system of claim 1, further comprising an updating module that, when the persisted model is evaluated to be non-compliant, continues to be evaluated by the model evaluation module to evaluate a new compliant model and replaces the non-compliant old model with the new model.

3. The system for predicting network equipment traffic based on the random forest algorithm according to claim 1, wherein the method for predicting the network equipment traffic based on the random forest algorithm comprises the following steps:

4. The system of claim 3, wherein the data includes at least 3 types, which are network device traffic data, host traffic data, and network design parameter information, the network device traffic data and the host traffic data are time sequence data sorted in time, and the data granularity of the time sequence data is in the order of minutes.

5. The system of claim 4, wherein in the data collection step, the network design parameter information includes a topology structure of a local area network, a maximum load and/or a bearing capacity of network traffic of each link, and compliance information for ensuring safety and stability of a network environment.

6. The system for predicting network equipment traffic based on the random forest algorithm according to claim 3, wherein the data processing specifically comprises index filtering, data preprocessing and threshold solving so as to obtain each type of data set and evaluate the distribution condition of indexes.

7. The system for predicting the network equipment flow based on the random forest algorithm according to claim 6, wherein the index filtering is to screen out qualified indexes from a plurality of types of data through correlation analysis and mechanism knowledge; the data preprocessing is to identify and process abnormal values and missing values in qualified indexes in a data set so as to realize the integrity of the data; and the threshold solution is to optimize the data classes to be collected by exploring the distribution of the qualified index values, thereby providing a basis for data screening.

8. The system for predicting network device traffic based on the random forest algorithm according to claim 4, wherein the data feature processing of the partial data set specifically comprises: and expressing the time sequence data with continuity in a time lag mode, and obtaining the evaluation index ratio of the data by denoising and normalizing.

9. The system of claim 8, wherein the denoising process comprises a smoothing process and a parameter threshold filtering, and the smoothing process adopts a filtering algorithm.

10. The system as claimed in claim 4, wherein the model is constructed by repeatedly extracting b training sample sets from a plurality of classes of data sets, constructing b regression trees based on the b training sample sets, selecting non-extracted cases to form b out-of-bag data as test sample sets, constructing a random forest, and obtaining a prediction result by using the test sample sets.