CN113726558A - Network equipment flow prediction system based on random forest algorithm - Google Patents
Network equipment flow prediction system based on random forest algorithm Download PDFInfo
- Publication number
- CN113726558A CN113726558A CN202110910221.2A CN202110910221A CN113726558A CN 113726558 A CN113726558 A CN 113726558A CN 202110910221 A CN202110910221 A CN 202110910221A CN 113726558 A CN113726558 A CN 113726558A
- Authority
- CN
- China
- Prior art keywords
- data
- model
- random forest
- evaluation
- network equipment
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000007637 random forest analysis Methods 0.000 title claims abstract description 57
- 238000004422 calculation algorithm Methods 0.000 title claims abstract description 41
- 238000011156 evaluation Methods 0.000 claims abstract description 73
- 238000012545 processing Methods 0.000 claims abstract description 49
- 238000009826 distribution Methods 0.000 claims abstract description 18
- 238000010276 construction Methods 0.000 claims abstract description 14
- 230000002688 persistence Effects 0.000 claims abstract description 7
- 230000002085 persistent effect Effects 0.000 claims abstract description 5
- 238000003860 storage Methods 0.000 claims abstract description 5
- 238000000034 method Methods 0.000 claims description 46
- 238000012549 training Methods 0.000 claims description 40
- 238000001914 filtration Methods 0.000 claims description 28
- 230000008569 process Effects 0.000 claims description 25
- 238000012360 testing method Methods 0.000 claims description 22
- 238000013461 design Methods 0.000 claims description 12
- 238000012216 screening Methods 0.000 claims description 11
- 238000009499 grossing Methods 0.000 claims description 10
- 238000007781 pre-processing Methods 0.000 claims description 10
- 238000012795 verification Methods 0.000 claims description 7
- 238000013480 data collection Methods 0.000 claims description 5
- 230000002159 abnormal effect Effects 0.000 claims description 4
- 238000010219 correlation analysis Methods 0.000 claims description 4
- 230000007246 mechanism Effects 0.000 claims description 4
- 238000010606 normalization Methods 0.000 claims description 4
- 238000003672 processing method Methods 0.000 claims description 3
- 230000001419 dependent effect Effects 0.000 description 10
- 230000000694 effects Effects 0.000 description 9
- 230000006872 improvement Effects 0.000 description 9
- 238000004364 calculation method Methods 0.000 description 4
- 238000012952 Resampling Methods 0.000 description 3
- 238000007405 data analysis Methods 0.000 description 3
- 239000000284 extract Substances 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 238000003066 decision tree Methods 0.000 description 2
- 239000012535 impurity Substances 0.000 description 2
- 238000012805 post-processing Methods 0.000 description 2
- 238000011084 recovery Methods 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 238000011112 process operation Methods 0.000 description 1
- 238000007711 solidification Methods 0.000 description 1
- 230000008023 solidification Effects 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/14—Network analysis or design
- H04L41/147—Network analysis or design for predicting network behaviour
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/24323—Tree-organised classifiers
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/14—Network analysis or design
- H04L41/145—Network analysis or design involving simulating, designing, planning or modelling of a network
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Signal Processing (AREA)
- Computer Networks & Wireless Communication (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
Abstract
The invention discloses a network equipment flow prediction system based on a random forest algorithm, which comprises a data acquisition module, a data processing module and a data processing module, wherein the data acquisition module is used for acquiring relevant data related to network equipment in operation; a data processing module: according to the collected data, corresponding data processing is carried out, and then relevant evaluation index distribution and evaluation indexes are obtained; a model construction module: constructing a random forest regression model through a random forest algorithm; a model evaluation module: according to the evaluation standard of the regression prediction model, evaluating the random forest regression model by using the collected data set and the evaluation index distribution and evaluation index obtained in the data processing module, and performing persistent storage on the model meeting the requirements after evaluation; an input-output module: and the input and output module calculates by using the relevant input data and a persistence stored model and outputs a predicted value of the network equipment flow.
Description
Technical Field
The invention belongs to the technical field of network equipment flow prediction, and particularly relates to a network equipment flow prediction system based on a random forest algorithm.
Background
When a related service is needed, the total amount of flow needed to be used is not well judged, waste is caused if the total amount of flow needed to be used is too much, normal office work and other living use are affected if the total amount of flow needed to be used is too little, and therefore a method and a device capable of predicting the flow of equipment in a network are needed, so that the rough total amount of flow can be predicted, and the flow can be conveniently purchased by logistics personnel and the like for preliminary judgment.
Disclosure of Invention
The invention aims to provide a network equipment flow prediction system based on a random forest algorithm, which can roughly predict the total flow in the later period according to the existing data and the like, on one hand, is convenient for logistics and user management, and updates the flow and the like in time; on the other hand, the method can roughly perform corresponding budgeting, can timely know rough flow for large-scale systems and companies which need network monitoring, and further can make data reservation for later statistics and the like.
In order to achieve the technical effects, the invention is realized by the following technical scheme.
The network equipment flow prediction system based on the random forest algorithm comprises a data acquisition module, a data processing module and a data processing module, wherein the data acquisition module is used for acquiring relevant data related to network equipment in operation;
a data processing module: according to the collected data, corresponding data processing is carried out, and then relevant evaluation index distribution and evaluation indexes are obtained;
a model construction module: constructing a random forest regression model through a random forest algorithm;
a model evaluation module: according to the evaluation standard of the regression prediction model, evaluating the random forest regression model by using the collected data set and the evaluation index distribution and evaluation index obtained in the data processing module, and performing persistent storage on the model meeting the requirements after evaluation;
an input-output module: and the input and output module calculates by using the relevant input data and a persistence stored model and outputs a predicted value of the network equipment flow.
In the technical scheme, a random forest algorithm is selected. The algorithm is the same as the traditional regression model, and random forest regression can explain the influence of a plurality of independent variables on dependent variables. Suppose that the dependent variable Y has n observations, i.e., n cases, and k independent variables having influence on the observed values. In the process of constructing the regression tree, the random forest randomly extracts partial observed values of the dependent variable Y by adopting a Bootstrap resampling method, and randomly selects a specified number of variables from the k independent variables to determine classification tree nodes. Thus, the regression tree may be different for randomness for each construction. Based on this, the random forest can generally randomly generate hundreds or even thousands of classification trees, and the tree with the highest repetition degree is selected as the final result.
According to the technical scheme, the corresponding module units are set according to requirements, and the module units realize model construction and subsequent data input according to functions and relevance of other modules to obtain a predicted effect.
In the technical scheme, the evaluation indexes are used for evaluation, and if the evaluation indexes meet requirements, the model is subjected to persistent storage for subsequent application.
As a further improvement of the invention, the method also comprises an updating module, when the model stored in the persistence module is not in accordance with the standard, the model is continuously evaluated by the model evaluation module, a new model in accordance with the standard is evaluated, and the old model in accordance with the standard is replaced by the new model.
In the technical scheme, when the old model is eliminated, the corresponding updating and replacing can be carried out through the updating module so as to realize the effects of continuity, durability and updating.
As a further improvement of the invention, the invention also comprises a prediction method of the network equipment flow prediction system based on the random forest algorithm, which comprises the following steps:
data collection: according to network flow data associated with the operation of the network equipment, acquiring related data associated with the network equipment in the operation of the network equipment, and classifying the related data to obtain a data set of a plurality of types of data;
data processing: filtering, processing and analyzing each type of data set to obtain the distribution condition of the evaluation index in each type of data set;
the data feature processing of partial data sets is also included in the data sets of a plurality of types, noise is removed under the condition that the curve features of the data sets are kept according to the types of the screened data, and the evaluation index proportion of the data is obtained through a normalization processing method;
constructing a model: selecting relevant influence factors of various data by adopting a random forest algorithm, and carrying out model construction by combining a core index value and an evaluation index ratio by using a training set, and testing by using a testing set;
model evaluation cure and prediction: and dividing the data set into a training set and a testing set, training the model by using the training set, verifying the verification set, evaluating according to the evaluation index, and if the evaluation index meets the requirement, predicting the flow of the network equipment according to the model.
In the technical scheme, a model is constructed by utilizing multiple types of data, so that the influence of multiple factors on the flow can be fully considered, and the prediction accuracy is improved; and the post-processing of the data set, the association with the evaluation index and the like provide a relatively accurate basis for the subsequent further evaluation.
The random forest in the technical scheme is established in a random mode and comprises a classifier with a plurality of decision trees. The class of its output is determined by the mode of the class output by the respective tree.
Randomness is mainly reflected in two aspects: (1) when each tree is trained, selecting a data set with the same size as N, which can be repeated, from all training samples (the number of samples is N) to train (namely bootstrap sampling); (2) at each node, a subset of all the features is randomly selected for calculating the optimal segmentation.
Its advantages are as follows:
1. has great advantages over other algorithms on the current many data sets and good performance
2. It can process data with very high dimension (much feature) and does not need to make feature selection
3. After training, it can give which features are important
4. When a random forest is created, unbiased estimation is used for the generational error, and the model generalization capability is strong
5. High training speed and easy parallelization method
6. In the training process, the mutual influence between features can be detected
7. The realization is simpler
8. For unbalanced data sets, it may balance the error.
9. Accuracy can still be maintained if a significant portion of the features are missing.
In the technical scheme, the later stage also relates to model evaluation and solidification and the like, and whether the model is qualified or not can be monitored, so that the model is used for subsequent reutilization, verification is carried out, and the accuracy is improved.
As a further improvement of the present invention, the data at least includes 3 types, which are respectively network device traffic data, host traffic data, and network design parameter information, the traffic data of the network device and the traffic data of the host are time sequence data sorted by time, and the data granularity of the time sequence data is in the minute level.
In the technical scheme, three types of different data are utilized to collect parameters of flow of related network equipment, the equipment is fully considered, interference of various factors such as the network and the like is avoided, the considered factors are more comprehensive and diversified, and the accuracy of the later stage is improved.
As a further improvement of the present invention, in the data collecting step, the network design parameter information includes a local area network topology structure, a maximum load and/or a carrying capacity of network traffic of each link, and compliance information for ensuring safety and stability of a network environment.
In the technical scheme, various topological structures are fully considered in the network design parameters, and factors such as conformity, bearing capacity, safety and stability are also integrated, so that the whole system is more integrated in consideration of the factors, and the comprehensiveness of the network design parameters is improved.
As a further improvement of the invention, the data processing specifically comprises index filtering, data preprocessing and threshold solving to obtain the distribution condition of the evaluation index in each type of data set.
In the technical scheme, in the data processing process, preliminary noise reduction and impurity removal are carried out through filtering; then, through data preprocessing, the data is subjected to processing such as screening according to requirements and the like; and in the later stage, the data after being processed by denoising and the like is analyzed to find out the evaluation index of the core, so that the subsequent calculation is reduced, and the data is simplified.
As a further improvement of the invention, the index filtering is to screen out qualified indexes from a plurality of types of data through correlation analysis and mechanism knowledge; the data preprocessing is to identify and process abnormal values and missing values in qualified indexes in a data set so as to realize the integrity of the data; and the threshold solution is to optimize the data classes to be collected by exploring the distribution of the qualified index values, thereby providing a basis for data screening.
In the technical scheme, screening, recognition recovery processing, analysis and the like are utilized, so that the data have a complete processing process from beginning to end, the noise is low and the calculated amount is low during later-stage data application, and meanwhile, data support is provided for early-stage data screening and the like.
As a further improvement of the present invention, the data feature processing of the partial data set specifically includes: and expressing the time sequence data with continuity in a time lag mode, and obtaining the evaluation index ratio of the data by denoising and normalizing.
In the technical scheme, continuous data has volatility and instantaneity, so that the influence on model training and accuracy is large, and therefore a filtering algorithm, such as a Kalman filtering algorithm, is required to be adopted for smoothing. Different evaluation indexes often have different dimensions and dimension units, which affect the result of data analysis, and in order to eliminate the dimension influence among the indexes, the data is generally normalized, but not all models need to be normalized.
As a further improvement of the invention, the denoising process comprises a smoothing process and parameter threshold filtering, and the smoothing process adopts a filtering algorithm.
In the technical scheme, through various denoising and the like, the volatility and the instantaneity of the method can be reduced, so that the influence on the precision in later-stage operation is reduced, and the whole precision is improved.
As a further improvement of the present invention, the model is constructed by repeatedly extracting b training sample sets in a plurality of data sets, constructing b regression trees according to the b training sample sets, selecting unextracted cases to form b data outside bags as a test sample set, constructing a random forest, and obtaining a prediction result by using the test sample set.
According to the technical scheme, the training sample sets and the testing sample sets which are equal in number are selected, so that the testing precision can be improved, and an optimal prediction structure can be obtained.
Drawings
Fig. 1 is a flowchart of a method for predicting network device traffic according to the present invention;
FIG. 2 is a flow chart of a random forest regression method provided by the present invention;
FIG. 3 is a flow chart of network device traffic prediction according to the present invention;
fig. 4 is a schematic circuit diagram of traffic prediction of a network device according to the present invention.
Detailed Description
The following describes embodiments of the present invention in further detail with reference to the accompanying drawings.
Example 1
In this embodiment, a main module of a network device traffic prediction system based on a random forest algorithm is mainly introduced.
Referring to fig. 4, the invention further discloses a network device traffic prediction system based on a random forest algorithm, which comprises a data acquisition module, a data processing module and a traffic prediction module, wherein the data acquisition module is used for acquiring relevant data related to the network device in operation;
a data processing module: according to the collected data, corresponding data processing is carried out, and then relevant evaluation index distribution and evaluation indexes are obtained;
a model construction module: constructing a random forest regression model through a random forest algorithm;
a model evaluation module: according to the evaluation standard of the regression prediction model, evaluating the random forest regression model by using the collected data set and the evaluation index distribution and evaluation index obtained in the data processing module, and performing persistent storage on the model meeting the requirements after evaluation;
an input-output module: and the input and output module calculates by using the relevant input data and a persistence stored model and outputs a predicted value of the network equipment flow.
In this embodiment, corresponding module units are set as required, and the module units implement model construction and subsequent data input according to functions and relevance with other modules, so as to obtain a predicted effect.
In this embodiment, a random forest algorithm is selected. The algorithm is the same as the traditional regression model, and random forest regression can explain the influence of a plurality of independent variables on dependent variables. Suppose that the dependent variable Y has n observations, i.e., n cases, and k independent variables having influence on the observed values. In the process of constructing the regression tree, the random forest randomly extracts partial observed values of the dependent variable Y by adopting a Bootstrap resampling method, and randomly selects a specified number of variables from the k independent variables to determine classification tree nodes. Thus, the regression tree may be different for randomness for each construction. Based on this, the random forest can generally randomly generate hundreds or even thousands of classification trees, and the tree with the highest repetition degree is selected as the final result.
In this embodiment, corresponding module units are set as required, and the module units implement model construction and subsequent data input according to functions and relevance with other modules, so as to obtain a predicted effect.
Specifically, the method further comprises an updating module, when the persisted model is not in accordance with the standard, the model continues to be evaluated by the model evaluation module, a new model in accordance with the standard is evaluated, and the old model in accordance with the standard is replaced by the new model.
In this embodiment, when an old model is eliminated, a corresponding update replacement can be performed through the update module, so as to achieve the effects of persistence, permanence, and update.
Example 2
In this embodiment, a flow of a prediction method of a network device traffic prediction system based on a random forest algorithm is mainly described.
Referring to fig. 1-4, the method includes the following steps:
data collection: according to network flow data associated with the operation of the network equipment, acquiring related data associated with the network equipment in the operation of the network equipment, and classifying the related data to obtain a data set of a plurality of types of data;
data processing: filtering, processing and analyzing each type of data set to obtain the distribution condition of the evaluation index in each type of data set;
the data feature processing of partial data sets is also included in the data sets of a plurality of types, noise is removed under the condition that the curve features of the data sets are kept according to the types of the screened data, and the evaluation index proportion of the data is obtained through a normalization processing method;
constructing a model: selecting relevant influence factors of various data by adopting a random forest algorithm, and carrying out model construction by combining a core index value and an evaluation index ratio by using a training set, and testing by using a testing set;
model evaluation cure and prediction: and dividing the data set into a training set and a testing set, training the model by using the training set, verifying the verification set, evaluating according to the evaluation index, and if the evaluation index meets the requirement, predicting the flow of the network equipment according to the model.
In the embodiment, a model is constructed by utilizing multiple types of data, so that the influence of multiple factors on the flow can be fully considered, and the prediction accuracy is improved; and the post-processing of the data set, the association with the evaluation index and the like provide a relatively accurate basis for the subsequent further evaluation.
The random forest in this embodiment is a classifier that is created in a random manner and includes a plurality of decision trees. The class of its output is determined by the mode of the class output by the respective tree.
Randomness is mainly reflected in two aspects: (1) when each tree is trained, selecting a data set with the same size as N, which can be repeated, from all training samples (the number of samples is N) to train (namely bootstrap sampling); (2) at each node, a subset of all the features is randomly selected for calculating the optimal segmentation.
Its advantages are as follows:
1. has great advantages over other algorithms on the current many data sets and good performance
2. It can process data with very high dimension (much feature) and does not need to make feature selection
3. After training, it can give which features are important
4. When a random forest is created, unbiased estimation is used for the generational error, and the model generalization capability is strong
5. High training speed and easy parallelization method
6. In the training process, the mutual influence between features can be detected
7. The realization is simpler
8. For unbalanced data sets, it may balance the error.
9. Accuracy can still be maintained if a significant portion of the features are missing.
In the embodiment, the later stage also involves model evaluation and curing and the like, and the model can be monitored to be qualified or not, so that the model can be used for subsequent reutilization, verification is performed, and the accuracy is improved.
Example 3
In this embodiment, related data related to the operation of the network device is mainly introduced.
Specifically, the data at least includes 3 types, which are respectively network device traffic data, host traffic data, and network design parameter information, where the traffic data of the network device and the traffic data of the host are time sequence data sorted by time, and the data granularity of the time sequence data is in the minute level.
In this embodiment, utilize three kinds of different data, carry out the parameter acquisition of relevant network equipment flow, it has fully considered equipment to and the interference of various factors such as network itself, make the factor of considering more comprehensive, more diversified, and then promote the precision in later stage.
Further, in the data collection step, the network design parameter information includes a local area network topology structure, a maximum load and/or a bearing capacity of network traffic of each link, and compliance information for ensuring safety and stability of a network environment.
In this embodiment, the network design parameters fully consider various topological structures, and also integrate the factors such as compliance, bearing capacity, safety and stability, so that the whole system is more comprehensive in consideration of the factors, and the comprehensiveness of the network design parameters is improved.
Example 4
In this embodiment, other key steps are introduced.
Referring to the attached drawings, the data processing specifically includes index filtering, data preprocessing and threshold solving to obtain distribution conditions of the evaluation indexes in each type of data set.
In the embodiment, in the data processing process, preliminary noise reduction and impurity removal are carried out through filtering; then, through data preprocessing, the data is subjected to processing such as screening according to requirements and the like; and in the later stage, the data after being processed by denoising and the like is analyzed to find out the evaluation index of the core, so that the subsequent calculation is reduced, and the data is simplified.
Further, the index filtering specifically comprises screening out qualified indexes from a plurality of types of data through correlation analysis and mechanism knowledge; the data preprocessing is to identify and process abnormal values and missing values in qualified indexes in a data set so as to realize the integrity of the data; and the threshold solution is to optimize the data classes to be collected by exploring the distribution of the qualified index values, thereby providing a basis for data screening.
In the embodiment, screening, recognition, recovery, processing, analysis and the like are utilized, so that the data have a complete processing process from beginning to end, the noise is low and the calculated amount is low during later-stage data application, and meanwhile, data support is provided for early-stage data screening and the like.
Further, the data feature processing of the partial data set specifically includes: and expressing the time sequence data with continuity in a time lag mode, and obtaining the evaluation index ratio of the data by denoising and normalizing.
In this embodiment, the continuous data has volatility and instantaneity, so that the influence on the model training and the accuracy is large, and therefore, a filtering algorithm, such as a kalman filtering algorithm, needs to be adopted to perform smoothing processing on the model training and the accuracy. Different evaluation indexes often have different dimensions and dimension units, which affect the result of data analysis, and in order to eliminate the dimension influence among the indexes, the data is generally normalized, but not all models need to be normalized.
Further, the denoising process comprises a smoothing process and parameter threshold filtering, and the smoothing process adopts a filtering algorithm.
In the embodiment, through various denoising and the like, the volatility and the instantaneity of the method can be reduced, so that the influence on the precision in the later operation is reduced, and the whole precision is improved.
Specifically, the model is constructed by repeatedly extracting b training sample sets from a plurality of data sets, constructing b regression trees according to the b training sample sets, selecting unextracted cases to form b data outside bags as test sample sets, constructing a random forest, and obtaining a prediction result by using the test sample sets.
In this embodiment, the training sample sets and the test sample sets with the same number are selected, so that the test accuracy can be improved to obtain the optimal prediction structure.
Example 5
In this embodiment, specific applications are mainly described.
Referring to fig. 1-4, the method includes the following steps:
step 1: data preparation
The data sets mainly required for constructing the model comprise 3 types, namely network equipment flow data, host flow data and network design parameter information. The network design parameter information mainly includes information such as a local area network topology structure, a maximum load or carrying capacity of network flow of each link, and a load for ensuring the safety and stability of a network environment. The network equipment and the host flow data are time sequence data, and the data granularity is in the level of minutes.
Step 2: data processing
The main work of the data processing stage comprises index filtering, data preprocessing and threshold solving. The index filtering is mainly to screen the indexes by utilizing correlation analysis and mechanism knowledge according to the collected information; the data preprocessing is mainly used for identifying and processing abnormal values and missing values in the data set. Among the commonly used methods for outlier identification include the 3sigma criterion and the quantile method (boxplot). The threshold solution is mainly used for exploring the distribution situation of key index values, so that data support is provided for data screening.
And step 3: feature engineering
The characteristic engineering process mainly comprises 3 parts of contents: feature calculation, data filtering and data normalization processing. The network flow data is time sequence data, and the total flow demand in the whole network has continuity, so that the time lag mode is adopted for expression in the characteristic calculation process; the main process operations of data filtering are flow data smoothing processing and parameter threshold filtering. Particularly, network traffic has volatility and instantaneity, so that the influence on model training and accuracy is large, and therefore, a filtering algorithm, such as a kalman filtering algorithm, needs to be adopted to perform smoothing processing on the network traffic. Different evaluation indexes often have different dimensions and dimension units, which affect the result of data analysis, and in order to eliminate the dimension influence among the indexes, the data is generally normalized, but not all models need to be normalized.
And 4, step 4: model construction
For the flow prediction, a random forest algorithm is selected in the project. The algorithm is the same as the traditional regression model, and random forest regression can explain the influence of a plurality of independent variables on dependent variables. Suppose that the dependent variable Y has n observations, i.e., n cases, and k independent variables having influence on the observed values. In the process of constructing the regression tree, the random forest randomly extracts partial observed values of the dependent variable Y by adopting a Bootstrap resampling method, and randomly selects a specified number of variables from the k independent variables to determine classification tree nodes. Thus, the regression tree may be different for randomness for each construction. Based on this, the random forest can generally randomly generate hundreds or even thousands of classification trees, and the tree with the highest repetition degree is selected as the final result. The regression tree theta is used for forming a combined model { h (X, theta)j) J is 1,2, L, b, j is obtained by finding j regression trees h (X, θ)j) The average value of (a) forms the predicted value of the random forest regression model.
The model satisfies the condition: the training sets forming the random forest are independent of each other, so the mean square error of the prediction vector h (X) is EX,Y(Y-h(X))。
The flow of the random forest regression algorithm is as follows:
(1) from n cases of the original data set, b training sample sets are repeatedly extracted by applying a Bootstrap method, and accordingly b regression trees are constructed. And b samples composed of the unextracted cases are out-of-bag data (OOB) as a test sample set when the training samples are extracted each time.
(2) When a regression tree is constructed, m independent variables are randomly selected from k independent variables at the branch node of each treetry(mtry< k) as candidate branching variables and then determining the optimal branching among them according to the branching goodness criterion.
(3) Each regression tree recursively branches from top to bottom and grows continuously, and the number n of the trees is set as a termination condition for the growth of the regression tree;
(4) the generated b regression trees form a random forest regression model, the estimation effect of the model is evaluated by the accuracy of the prediction of the out-of-bag data (OOB), namely, the estimation effect is measured by the mean square error of the test set, and if the number of samples of the out-of-bag data is m, the random forest regression model is:
yirepresents the true value of the dependent variable in OOB,representing the predicted values obtained with a random forest regression model,represents the variance of the OOB prediction.
And 5: model evaluation and persistence
Referring to the evaluation standard of the regression prediction model in the industry and academic community, the evaluation indexes aiming at the regression prediction model comprise:
(1) mean Square Error (MSE):
(2) root Mean Square Error (RMSE):
(3) mean Absolute Error (MAE):
in the model training process, the data set is divided into a training set and a verification set according to the ratio of 8:2, namely, the training set is used for training the model, the verification set is used for testing the model effect after the training is finished, the evaluation indexes are used for evaluating, and the model is stored persistently for subsequent application if each evaluation index meets the requirement.
The foregoing shows and describes the general principles and broad features of the present invention and advantages thereof. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.
Claims (10)
1. The network equipment flow prediction system based on the random forest algorithm is characterized by comprising a data acquisition module, a data processing module and a data processing module, wherein the data acquisition module is used for acquiring relevant data related to network equipment in operation;
a data processing module: according to the collected data, corresponding data processing is carried out, and then relevant evaluation index distribution and evaluation indexes are obtained;
a model construction module: constructing a random forest regression model through a random forest algorithm;
a model evaluation module: according to the evaluation standard of the regression prediction model, evaluating the random forest regression model by using the collected data set and the evaluation index distribution and evaluation index obtained in the data processing module, and performing persistent storage on the model meeting the requirements after evaluation;
an input-output module: and the input and output module calculates by using the relevant input data and a persistence stored model and outputs a predicted value of the network equipment flow.
2. The system of claim 1, further comprising an updating module that, when the persisted model is evaluated to be non-compliant, continues to be evaluated by the model evaluation module to evaluate a new compliant model and replaces the non-compliant old model with the new model.
3. The system for predicting network equipment traffic based on the random forest algorithm according to claim 1, wherein the method for predicting the network equipment traffic based on the random forest algorithm comprises the following steps:
data collection: according to network flow data associated with the operation of the network equipment, acquiring related data associated with the network equipment in the operation of the network equipment, and classifying the related data to obtain a data set of a plurality of types of data;
data processing: filtering, processing and analyzing each type of data set to obtain the distribution condition of the evaluation index in each type of data set;
the data feature processing of partial data sets is also included in the data sets of a plurality of types, noise is removed under the condition that the curve features of the data sets are kept according to the types of the screened data, and the evaluation index proportion of the data is obtained through a normalization processing method;
constructing a model: selecting relevant influence factors of various data by adopting a random forest algorithm, and carrying out model construction by combining a core index value and an evaluation index ratio by using a training set, and testing by using a testing set;
model evaluation cure and prediction: and dividing the data set into a training set and a testing set, training the model by using the training set, verifying the verification set, evaluating according to the evaluation index, and if the evaluation index meets the requirement, predicting the flow of the network equipment according to the model.
4. The system of claim 3, wherein the data includes at least 3 types, which are network device traffic data, host traffic data, and network design parameter information, the network device traffic data and the host traffic data are time sequence data sorted in time, and the data granularity of the time sequence data is in the order of minutes.
5. The system of claim 4, wherein in the data collection step, the network design parameter information includes a topology structure of a local area network, a maximum load and/or a bearing capacity of network traffic of each link, and compliance information for ensuring safety and stability of a network environment.
6. The system for predicting network equipment traffic based on the random forest algorithm according to claim 3, wherein the data processing specifically comprises index filtering, data preprocessing and threshold solving so as to obtain each type of data set and evaluate the distribution condition of indexes.
7. The system for predicting the network equipment flow based on the random forest algorithm according to claim 6, wherein the index filtering is to screen out qualified indexes from a plurality of types of data through correlation analysis and mechanism knowledge; the data preprocessing is to identify and process abnormal values and missing values in qualified indexes in a data set so as to realize the integrity of the data; and the threshold solution is to optimize the data classes to be collected by exploring the distribution of the qualified index values, thereby providing a basis for data screening.
8. The system for predicting network device traffic based on the random forest algorithm according to claim 4, wherein the data feature processing of the partial data set specifically comprises: and expressing the time sequence data with continuity in a time lag mode, and obtaining the evaluation index ratio of the data by denoising and normalizing.
9. The system of claim 8, wherein the denoising process comprises a smoothing process and a parameter threshold filtering, and the smoothing process adopts a filtering algorithm.
10. The system as claimed in claim 4, wherein the model is constructed by repeatedly extracting b training sample sets from a plurality of classes of data sets, constructing b regression trees based on the b training sample sets, selecting non-extracted cases to form b out-of-bag data as test sample sets, constructing a random forest, and obtaining a prediction result by using the test sample sets.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110910221.2A CN113726558A (en) | 2021-08-09 | 2021-08-09 | Network equipment flow prediction system based on random forest algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110910221.2A CN113726558A (en) | 2021-08-09 | 2021-08-09 | Network equipment flow prediction system based on random forest algorithm |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113726558A true CN113726558A (en) | 2021-11-30 |
Family
ID=78675229
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110910221.2A Pending CN113726558A (en) | 2021-08-09 | 2021-08-09 | Network equipment flow prediction system based on random forest algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113726558A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115048874A (en) * | 2022-08-16 | 2022-09-13 | 北京航空航天大学 | Aircraft design parameter estimation method based on machine learning |
CN117874654A (en) * | 2024-03-13 | 2024-04-12 | 杭州小策科技有限公司 | Risk monitoring method and system based on random forest algorithm |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106323466A (en) * | 2016-08-09 | 2017-01-11 | 西北农林科技大学 | Leaf nitrogen content high spectral evaluation method for continuous wavelet transformation analysis |
CN109597968A (en) * | 2018-12-29 | 2019-04-09 | 西安电子科技大学 | Paste solder printing Performance Influence Factor analysis method based on SMT big data |
CN111294812A (en) * | 2018-12-10 | 2020-06-16 | 中兴通讯股份有限公司 | Method and system for resource capacity expansion planning |
CN112187752A (en) * | 2020-09-18 | 2021-01-05 | 湖北大学 | Intrusion detection classification method and device based on random forest |
CN112929215A (en) * | 2021-02-04 | 2021-06-08 | 博瑞得科技有限公司 | Network flow prediction method, system, computer equipment and storage medium |
CN113038302A (en) * | 2019-12-25 | 2021-06-25 | 中国电信股份有限公司 | Flow prediction method and device and computer storage medium |
CN113067724A (en) * | 2021-03-11 | 2021-07-02 | 西安电子科技大学 | Periodic flow prediction method based on random forest |
-
2021
- 2021-08-09 CN CN202110910221.2A patent/CN113726558A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106323466A (en) * | 2016-08-09 | 2017-01-11 | 西北农林科技大学 | Leaf nitrogen content high spectral evaluation method for continuous wavelet transformation analysis |
CN111294812A (en) * | 2018-12-10 | 2020-06-16 | 中兴通讯股份有限公司 | Method and system for resource capacity expansion planning |
CN109597968A (en) * | 2018-12-29 | 2019-04-09 | 西安电子科技大学 | Paste solder printing Performance Influence Factor analysis method based on SMT big data |
CN113038302A (en) * | 2019-12-25 | 2021-06-25 | 中国电信股份有限公司 | Flow prediction method and device and computer storage medium |
CN112187752A (en) * | 2020-09-18 | 2021-01-05 | 湖北大学 | Intrusion detection classification method and device based on random forest |
CN112929215A (en) * | 2021-02-04 | 2021-06-08 | 博瑞得科技有限公司 | Network flow prediction method, system, computer equipment and storage medium |
CN113067724A (en) * | 2021-03-11 | 2021-07-02 | 西安电子科技大学 | Periodic flow prediction method based on random forest |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115048874A (en) * | 2022-08-16 | 2022-09-13 | 北京航空航天大学 | Aircraft design parameter estimation method based on machine learning |
CN115048874B (en) * | 2022-08-16 | 2023-01-24 | 北京航空航天大学 | Aircraft design parameter estimation method based on machine learning |
CN117874654A (en) * | 2024-03-13 | 2024-04-12 | 杭州小策科技有限公司 | Risk monitoring method and system based on random forest algorithm |
CN117874654B (en) * | 2024-03-13 | 2024-05-24 | 杭州小策科技有限公司 | Risk monitoring method and system based on random forest algorithm |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021184630A1 (en) | Method for locating pollutant discharge object on basis of knowledge graph, and related device | |
CN107122594B (en) | New energy vehicle battery health prediction method and system | |
CN107274105B (en) | Linear discriminant analysis-based multi-attribute decision tree power grid stability margin evaluation method | |
CN110263230B (en) | Data cleaning method and device based on density clustering | |
CN110335168B (en) | Method and system for optimizing power utilization information acquisition terminal fault prediction model based on GRU | |
CN109472075B (en) | Base station performance analysis method and system | |
Utari et al. | Implementation of data mining for drop-out prediction using random forest method | |
CN113726558A (en) | Network equipment flow prediction system based on random forest algorithm | |
CN106708738B (en) | Software test defect prediction method and system | |
CN110348683A (en) | The main genetic analysis method, apparatus equipment of electrical energy power quality disturbance event and storage medium | |
CN115935286A (en) | Abnormal point detection method, device and terminal for railway bearing state monitoring data | |
CN117290719B (en) | Inspection management method and device based on data analysis and storage medium | |
CN114662989A (en) | Transient stability self-adaptive evaluation method and device for power system | |
CN112508363A (en) | Deep learning-based power information system state analysis method and device | |
CN117349786A (en) | Evidence fusion transformer fault diagnosis method based on data equalization | |
CN117035509A (en) | Electric energy meter state evaluation method and device, electronic equipment and readable storage medium | |
CN110855519A (en) | Network flow prediction method | |
CN111209955A (en) | Airplane power supply system fault identification method based on deep neural network and random forest | |
CN113837481B (en) | Financial big data management system based on block chain | |
CN113191569A (en) | Enterprise management method and system based on big data | |
CN113282686A (en) | Method and device for determining association rule of unbalanced sample | |
CN113591897A (en) | Method, device and equipment for detecting monitoring data abnormity and readable medium | |
CN114077663A (en) | Application log analysis method and device | |
CN110569277A (en) | Method and system for automatically identifying and classifying configuration data information | |
CN116701962B (en) | Edge data processing method, device, computing equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |