CN113726558A - Network equipment flow prediction system based on random forest algorithm - Google Patents

Network equipment flow prediction system based on random forest algorithm Download PDF

Info

Publication number
CN113726558A
CN113726558A CN202110910221.2A CN202110910221A CN113726558A CN 113726558 A CN113726558 A CN 113726558A CN 202110910221 A CN202110910221 A CN 202110910221A CN 113726558 A CN113726558 A CN 113726558A
Authority
CN
China
Prior art keywords
data
model
random forest
evaluation
network equipment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110910221.2A
Other languages
Chinese (zh)
Inventor
吴飞
李霆
吴树霖
沈立翔
肖传奇
孔美美
吴珍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Fujian Electric Power Co Ltd
Original Assignee
State Grid Fujian Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Fujian Electric Power Co Ltd filed Critical State Grid Fujian Electric Power Co Ltd
Priority to CN202110910221.2A priority Critical patent/CN113726558A/en
Publication of CN113726558A publication Critical patent/CN113726558A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/147Network analysis or design for predicting network behaviour
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/145Network analysis or design involving simulating, designing, planning or modelling of a network

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention discloses a network equipment flow prediction system based on a random forest algorithm, which comprises a data acquisition module, a data processing module and a data processing module, wherein the data acquisition module is used for acquiring relevant data related to network equipment in operation; a data processing module: according to the collected data, corresponding data processing is carried out, and then relevant evaluation index distribution and evaluation indexes are obtained; a model construction module: constructing a random forest regression model through a random forest algorithm; a model evaluation module: according to the evaluation standard of the regression prediction model, evaluating the random forest regression model by using the collected data set and the evaluation index distribution and evaluation index obtained in the data processing module, and performing persistent storage on the model meeting the requirements after evaluation; an input-output module: and the input and output module calculates by using the relevant input data and a persistence stored model and outputs a predicted value of the network equipment flow.

Description

Network equipment flow prediction system based on random forest algorithm
Technical Field
The invention belongs to the technical field of network equipment flow prediction, and particularly relates to a network equipment flow prediction system based on a random forest algorithm.
Background
When a related service is needed, the total amount of flow needed to be used is not well judged, waste is caused if the total amount of flow needed to be used is too much, normal office work and other living use are affected if the total amount of flow needed to be used is too little, and therefore a method and a device capable of predicting the flow of equipment in a network are needed, so that the rough total amount of flow can be predicted, and the flow can be conveniently purchased by logistics personnel and the like for preliminary judgment.
Disclosure of Invention
The invention aims to provide a network equipment flow prediction system based on a random forest algorithm, which can roughly predict the total flow in the later period according to the existing data and the like, on one hand, is convenient for logistics and user management, and updates the flow and the like in time; on the other hand, the method can roughly perform corresponding budgeting, can timely know rough flow for large-scale systems and companies which need network monitoring, and further can make data reservation for later statistics and the like.
In order to achieve the technical effects, the invention is realized by the following technical scheme.
The network equipment flow prediction system based on the random forest algorithm comprises a data acquisition module, a data processing module and a data processing module, wherein the data acquisition module is used for acquiring relevant data related to network equipment in operation;
a data processing module: according to the collected data, corresponding data processing is carried out, and then relevant evaluation index distribution and evaluation indexes are obtained;
a model construction module: constructing a random forest regression model through a random forest algorithm;
a model evaluation module: according to the evaluation standard of the regression prediction model, evaluating the random forest regression model by using the collected data set and the evaluation index distribution and evaluation index obtained in the data processing module, and performing persistent storage on the model meeting the requirements after evaluation;
an input-output module: and the input and output module calculates by using the relevant input data and a persistence stored model and outputs a predicted value of the network equipment flow.
In the technical scheme, a random forest algorithm is selected. The algorithm is the same as the traditional regression model, and random forest regression can explain the influence of a plurality of independent variables on dependent variables. Suppose that the dependent variable Y has n observations, i.e., n cases, and k independent variables having influence on the observed values. In the process of constructing the regression tree, the random forest randomly extracts partial observed values of the dependent variable Y by adopting a Bootstrap resampling method, and randomly selects a specified number of variables from the k independent variables to determine classification tree nodes. Thus, the regression tree may be different for randomness for each construction. Based on this, the random forest can generally randomly generate hundreds or even thousands of classification trees, and the tree with the highest repetition degree is selected as the final result.
According to the technical scheme, the corresponding module units are set according to requirements, and the module units realize model construction and subsequent data input according to functions and relevance of other modules to obtain a predicted effect.
In the technical scheme, the evaluation indexes are used for evaluation, and if the evaluation indexes meet requirements, the model is subjected to persistent storage for subsequent application.
As a further improvement of the invention, the method also comprises an updating module, when the model stored in the persistence module is not in accordance with the standard, the model is continuously evaluated by the model evaluation module, a new model in accordance with the standard is evaluated, and the old model in accordance with the standard is replaced by the new model.
In the technical scheme, when the old model is eliminated, the corresponding updating and replacing can be carried out through the updating module so as to realize the effects of continuity, durability and updating.
As a further improvement of the invention, the invention also comprises a prediction method of the network equipment flow prediction system based on the random forest algorithm, which comprises the following steps:
data collection: according to network flow data associated with the operation of the network equipment, acquiring related data associated with the network equipment in the operation of the network equipment, and classifying the related data to obtain a data set of a plurality of types of data;
data processing: filtering, processing and analyzing each type of data set to obtain the distribution condition of the evaluation index in each type of data set;
the data feature processing of partial data sets is also included in the data sets of a plurality of types, noise is removed under the condition that the curve features of the data sets are kept according to the types of the screened data, and the evaluation index proportion of the data is obtained through a normalization processing method;
constructing a model: selecting relevant influence factors of various data by adopting a random forest algorithm, and carrying out model construction by combining a core index value and an evaluation index ratio by using a training set, and testing by using a testing set;
model evaluation cure and prediction: and dividing the data set into a training set and a testing set, training the model by using the training set, verifying the verification set, evaluating according to the evaluation index, and if the evaluation index meets the requirement, predicting the flow of the network equipment according to the model.
In the technical scheme, a model is constructed by utilizing multiple types of data, so that the influence of multiple factors on the flow can be fully considered, and the prediction accuracy is improved; and the post-processing of the data set, the association with the evaluation index and the like provide a relatively accurate basis for the subsequent further evaluation.
The random forest in the technical scheme is established in a random mode and comprises a classifier with a plurality of decision trees. The class of its output is determined by the mode of the class output by the respective tree.
Randomness is mainly reflected in two aspects: (1) when each tree is trained, selecting a data set with the same size as N, which can be repeated, from all training samples (the number of samples is N) to train (namely bootstrap sampling); (2) at each node, a subset of all the features is randomly selected for calculating the optimal segmentation.
Its advantages are as follows:
1. has great advantages over other algorithms on the current many data sets and good performance
2. It can process data with very high dimension (much feature) and does not need to make feature selection
3. After training, it can give which features are important
4. When a random forest is created, unbiased estimation is used for the generational error, and the model generalization capability is strong
5. High training speed and easy parallelization method
6. In the training process, the mutual influence between features can be detected
7. The realization is simpler
8. For unbalanced data sets, it may balance the error.
9. Accuracy can still be maintained if a significant portion of the features are missing.
In the technical scheme, the later stage also relates to model evaluation and solidification and the like, and whether the model is qualified or not can be monitored, so that the model is used for subsequent reutilization, verification is carried out, and the accuracy is improved.
As a further improvement of the present invention, the data at least includes 3 types, which are respectively network device traffic data, host traffic data, and network design parameter information, the traffic data of the network device and the traffic data of the host are time sequence data sorted by time, and the data granularity of the time sequence data is in the minute level.
In the technical scheme, three types of different data are utilized to collect parameters of flow of related network equipment, the equipment is fully considered, interference of various factors such as the network and the like is avoided, the considered factors are more comprehensive and diversified, and the accuracy of the later stage is improved.
As a further improvement of the present invention, in the data collecting step, the network design parameter information includes a local area network topology structure, a maximum load and/or a carrying capacity of network traffic of each link, and compliance information for ensuring safety and stability of a network environment.
In the technical scheme, various topological structures are fully considered in the network design parameters, and factors such as conformity, bearing capacity, safety and stability are also integrated, so that the whole system is more integrated in consideration of the factors, and the comprehensiveness of the network design parameters is improved.
As a further improvement of the invention, the data processing specifically comprises index filtering, data preprocessing and threshold solving to obtain the distribution condition of the evaluation index in each type of data set.
In the technical scheme, in the data processing process, preliminary noise reduction and impurity removal are carried out through filtering; then, through data preprocessing, the data is subjected to processing such as screening according to requirements and the like; and in the later stage, the data after being processed by denoising and the like is analyzed to find out the evaluation index of the core, so that the subsequent calculation is reduced, and the data is simplified.
As a further improvement of the invention, the index filtering is to screen out qualified indexes from a plurality of types of data through correlation analysis and mechanism knowledge; the data preprocessing is to identify and process abnormal values and missing values in qualified indexes in a data set so as to realize the integrity of the data; and the threshold solution is to optimize the data classes to be collected by exploring the distribution of the qualified index values, thereby providing a basis for data screening.
In the technical scheme, screening, recognition recovery processing, analysis and the like are utilized, so that the data have a complete processing process from beginning to end, the noise is low and the calculated amount is low during later-stage data application, and meanwhile, data support is provided for early-stage data screening and the like.
As a further improvement of the present invention, the data feature processing of the partial data set specifically includes: and expressing the time sequence data with continuity in a time lag mode, and obtaining the evaluation index ratio of the data by denoising and normalizing.
In the technical scheme, continuous data has volatility and instantaneity, so that the influence on model training and accuracy is large, and therefore a filtering algorithm, such as a Kalman filtering algorithm, is required to be adopted for smoothing. Different evaluation indexes often have different dimensions and dimension units, which affect the result of data analysis, and in order to eliminate the dimension influence among the indexes, the data is generally normalized, but not all models need to be normalized.
As a further improvement of the invention, the denoising process comprises a smoothing process and parameter threshold filtering, and the smoothing process adopts a filtering algorithm.
In the technical scheme, through various denoising and the like, the volatility and the instantaneity of the method can be reduced, so that the influence on the precision in later-stage operation is reduced, and the whole precision is improved.
As a further improvement of the present invention, the model is constructed by repeatedly extracting b training sample sets in a plurality of data sets, constructing b regression trees according to the b training sample sets, selecting unextracted cases to form b data outside bags as a test sample set, constructing a random forest, and obtaining a prediction result by using the test sample set.
According to the technical scheme, the training sample sets and the testing sample sets which are equal in number are selected, so that the testing precision can be improved, and an optimal prediction structure can be obtained.
Drawings
Fig. 1 is a flowchart of a method for predicting network device traffic according to the present invention;
FIG. 2 is a flow chart of a random forest regression method provided by the present invention;
FIG. 3 is a flow chart of network device traffic prediction according to the present invention;
fig. 4 is a schematic circuit diagram of traffic prediction of a network device according to the present invention.
Detailed Description
The following describes embodiments of the present invention in further detail with reference to the accompanying drawings.
Example 1
In this embodiment, a main module of a network device traffic prediction system based on a random forest algorithm is mainly introduced.
Referring to fig. 4, the invention further discloses a network device traffic prediction system based on a random forest algorithm, which comprises a data acquisition module, a data processing module and a traffic prediction module, wherein the data acquisition module is used for acquiring relevant data related to the network device in operation;
a data processing module: according to the collected data, corresponding data processing is carried out, and then relevant evaluation index distribution and evaluation indexes are obtained;
a model construction module: constructing a random forest regression model through a random forest algorithm;
a model evaluation module: according to the evaluation standard of the regression prediction model, evaluating the random forest regression model by using the collected data set and the evaluation index distribution and evaluation index obtained in the data processing module, and performing persistent storage on the model meeting the requirements after evaluation;
an input-output module: and the input and output module calculates by using the relevant input data and a persistence stored model and outputs a predicted value of the network equipment flow.
In this embodiment, corresponding module units are set as required, and the module units implement model construction and subsequent data input according to functions and relevance with other modules, so as to obtain a predicted effect.
In this embodiment, a random forest algorithm is selected. The algorithm is the same as the traditional regression model, and random forest regression can explain the influence of a plurality of independent variables on dependent variables. Suppose that the dependent variable Y has n observations, i.e., n cases, and k independent variables having influence on the observed values. In the process of constructing the regression tree, the random forest randomly extracts partial observed values of the dependent variable Y by adopting a Bootstrap resampling method, and randomly selects a specified number of variables from the k independent variables to determine classification tree nodes. Thus, the regression tree may be different for randomness for each construction. Based on this, the random forest can generally randomly generate hundreds or even thousands of classification trees, and the tree with the highest repetition degree is selected as the final result.
In this embodiment, corresponding module units are set as required, and the module units implement model construction and subsequent data input according to functions and relevance with other modules, so as to obtain a predicted effect.
Specifically, the method further comprises an updating module, when the persisted model is not in accordance with the standard, the model continues to be evaluated by the model evaluation module, a new model in accordance with the standard is evaluated, and the old model in accordance with the standard is replaced by the new model.
In this embodiment, when an old model is eliminated, a corresponding update replacement can be performed through the update module, so as to achieve the effects of persistence, permanence, and update.
Example 2
In this embodiment, a flow of a prediction method of a network device traffic prediction system based on a random forest algorithm is mainly described.
Referring to fig. 1-4, the method includes the following steps:
data collection: according to network flow data associated with the operation of the network equipment, acquiring related data associated with the network equipment in the operation of the network equipment, and classifying the related data to obtain a data set of a plurality of types of data;
data processing: filtering, processing and analyzing each type of data set to obtain the distribution condition of the evaluation index in each type of data set;
the data feature processing of partial data sets is also included in the data sets of a plurality of types, noise is removed under the condition that the curve features of the data sets are kept according to the types of the screened data, and the evaluation index proportion of the data is obtained through a normalization processing method;
constructing a model: selecting relevant influence factors of various data by adopting a random forest algorithm, and carrying out model construction by combining a core index value and an evaluation index ratio by using a training set, and testing by using a testing set;
model evaluation cure and prediction: and dividing the data set into a training set and a testing set, training the model by using the training set, verifying the verification set, evaluating according to the evaluation index, and if the evaluation index meets the requirement, predicting the flow of the network equipment according to the model.
In the embodiment, a model is constructed by utilizing multiple types of data, so that the influence of multiple factors on the flow can be fully considered, and the prediction accuracy is improved; and the post-processing of the data set, the association with the evaluation index and the like provide a relatively accurate basis for the subsequent further evaluation.
The random forest in this embodiment is a classifier that is created in a random manner and includes a plurality of decision trees. The class of its output is determined by the mode of the class output by the respective tree.
Randomness is mainly reflected in two aspects: (1) when each tree is trained, selecting a data set with the same size as N, which can be repeated, from all training samples (the number of samples is N) to train (namely bootstrap sampling); (2) at each node, a subset of all the features is randomly selected for calculating the optimal segmentation.
Its advantages are as follows:
1. has great advantages over other algorithms on the current many data sets and good performance
2. It can process data with very high dimension (much feature) and does not need to make feature selection
3. After training, it can give which features are important
4. When a random forest is created, unbiased estimation is used for the generational error, and the model generalization capability is strong
5. High training speed and easy parallelization method
6. In the training process, the mutual influence between features can be detected
7. The realization is simpler
8. For unbalanced data sets, it may balance the error.
9. Accuracy can still be maintained if a significant portion of the features are missing.
In the embodiment, the later stage also involves model evaluation and curing and the like, and the model can be monitored to be qualified or not, so that the model can be used for subsequent reutilization, verification is performed, and the accuracy is improved.
Example 3
In this embodiment, related data related to the operation of the network device is mainly introduced.
Specifically, the data at least includes 3 types, which are respectively network device traffic data, host traffic data, and network design parameter information, where the traffic data of the network device and the traffic data of the host are time sequence data sorted by time, and the data granularity of the time sequence data is in the minute level.
In this embodiment, utilize three kinds of different data, carry out the parameter acquisition of relevant network equipment flow, it has fully considered equipment to and the interference of various factors such as network itself, make the factor of considering more comprehensive, more diversified, and then promote the precision in later stage.
Further, in the data collection step, the network design parameter information includes a local area network topology structure, a maximum load and/or a bearing capacity of network traffic of each link, and compliance information for ensuring safety and stability of a network environment.
In this embodiment, the network design parameters fully consider various topological structures, and also integrate the factors such as compliance, bearing capacity, safety and stability, so that the whole system is more comprehensive in consideration of the factors, and the comprehensiveness of the network design parameters is improved.
Example 4
In this embodiment, other key steps are introduced.
Referring to the attached drawings, the data processing specifically includes index filtering, data preprocessing and threshold solving to obtain distribution conditions of the evaluation indexes in each type of data set.
In the embodiment, in the data processing process, preliminary noise reduction and impurity removal are carried out through filtering; then, through data preprocessing, the data is subjected to processing such as screening according to requirements and the like; and in the later stage, the data after being processed by denoising and the like is analyzed to find out the evaluation index of the core, so that the subsequent calculation is reduced, and the data is simplified.
Further, the index filtering specifically comprises screening out qualified indexes from a plurality of types of data through correlation analysis and mechanism knowledge; the data preprocessing is to identify and process abnormal values and missing values in qualified indexes in a data set so as to realize the integrity of the data; and the threshold solution is to optimize the data classes to be collected by exploring the distribution of the qualified index values, thereby providing a basis for data screening.
In the embodiment, screening, recognition, recovery, processing, analysis and the like are utilized, so that the data have a complete processing process from beginning to end, the noise is low and the calculated amount is low during later-stage data application, and meanwhile, data support is provided for early-stage data screening and the like.
Further, the data feature processing of the partial data set specifically includes: and expressing the time sequence data with continuity in a time lag mode, and obtaining the evaluation index ratio of the data by denoising and normalizing.
In this embodiment, the continuous data has volatility and instantaneity, so that the influence on the model training and the accuracy is large, and therefore, a filtering algorithm, such as a kalman filtering algorithm, needs to be adopted to perform smoothing processing on the model training and the accuracy. Different evaluation indexes often have different dimensions and dimension units, which affect the result of data analysis, and in order to eliminate the dimension influence among the indexes, the data is generally normalized, but not all models need to be normalized.
Further, the denoising process comprises a smoothing process and parameter threshold filtering, and the smoothing process adopts a filtering algorithm.
In the embodiment, through various denoising and the like, the volatility and the instantaneity of the method can be reduced, so that the influence on the precision in the later operation is reduced, and the whole precision is improved.
Specifically, the model is constructed by repeatedly extracting b training sample sets from a plurality of data sets, constructing b regression trees according to the b training sample sets, selecting unextracted cases to form b data outside bags as test sample sets, constructing a random forest, and obtaining a prediction result by using the test sample sets.
In this embodiment, the training sample sets and the test sample sets with the same number are selected, so that the test accuracy can be improved to obtain the optimal prediction structure.
Example 5
In this embodiment, specific applications are mainly described.
Referring to fig. 1-4, the method includes the following steps:
step 1: data preparation
The data sets mainly required for constructing the model comprise 3 types, namely network equipment flow data, host flow data and network design parameter information. The network design parameter information mainly includes information such as a local area network topology structure, a maximum load or carrying capacity of network flow of each link, and a load for ensuring the safety and stability of a network environment. The network equipment and the host flow data are time sequence data, and the data granularity is in the level of minutes.
Step 2: data processing
The main work of the data processing stage comprises index filtering, data preprocessing and threshold solving. The index filtering is mainly to screen the indexes by utilizing correlation analysis and mechanism knowledge according to the collected information; the data preprocessing is mainly used for identifying and processing abnormal values and missing values in the data set. Among the commonly used methods for outlier identification include the 3sigma criterion and the quantile method (boxplot). The threshold solution is mainly used for exploring the distribution situation of key index values, so that data support is provided for data screening.
And step 3: feature engineering
The characteristic engineering process mainly comprises 3 parts of contents: feature calculation, data filtering and data normalization processing. The network flow data is time sequence data, and the total flow demand in the whole network has continuity, so that the time lag mode is adopted for expression in the characteristic calculation process; the main process operations of data filtering are flow data smoothing processing and parameter threshold filtering. Particularly, network traffic has volatility and instantaneity, so that the influence on model training and accuracy is large, and therefore, a filtering algorithm, such as a kalman filtering algorithm, needs to be adopted to perform smoothing processing on the network traffic. Different evaluation indexes often have different dimensions and dimension units, which affect the result of data analysis, and in order to eliminate the dimension influence among the indexes, the data is generally normalized, but not all models need to be normalized.
And 4, step 4: model construction
For the flow prediction, a random forest algorithm is selected in the project. The algorithm is the same as the traditional regression model, and random forest regression can explain the influence of a plurality of independent variables on dependent variables. Suppose that the dependent variable Y has n observations, i.e., n cases, and k independent variables having influence on the observed values. In the process of constructing the regression tree, the random forest randomly extracts partial observed values of the dependent variable Y by adopting a Bootstrap resampling method, and randomly selects a specified number of variables from the k independent variables to determine classification tree nodes. Thus, the regression tree may be different for randomness for each construction. Based on this, the random forest can generally randomly generate hundreds or even thousands of classification trees, and the tree with the highest repetition degree is selected as the final result. The regression tree theta is used for forming a combined model { h (X, theta)j) J is 1,2, L, b, j is obtained by finding j regression trees h (X, θ)j) The average value of (a) forms the predicted value of the random forest regression model.
The model satisfies the condition: the training sets forming the random forest are independent of each other, so the mean square error of the prediction vector h (X) is EX,Y(Y-h(X))。
The flow of the random forest regression algorithm is as follows:
(1) from n cases of the original data set, b training sample sets are repeatedly extracted by applying a Bootstrap method, and accordingly b regression trees are constructed. And b samples composed of the unextracted cases are out-of-bag data (OOB) as a test sample set when the training samples are extracted each time.
(2) When a regression tree is constructed, m independent variables are randomly selected from k independent variables at the branch node of each treetry(mtry< k) as candidate branching variables and then determining the optimal branching among them according to the branching goodness criterion.
(3) Each regression tree recursively branches from top to bottom and grows continuously, and the number n of the trees is set as a termination condition for the growth of the regression tree;
(4) the generated b regression trees form a random forest regression model, the estimation effect of the model is evaluated by the accuracy of the prediction of the out-of-bag data (OOB), namely, the estimation effect is measured by the mean square error of the test set, and if the number of samples of the out-of-bag data is m, the random forest regression model is:
Figure BDA0003203251660000151
Figure BDA0003203251660000152
yirepresents the true value of the dependent variable in OOB,
Figure BDA0003203251660000153
representing the predicted values obtained with a random forest regression model,
Figure BDA0003203251660000154
represents the variance of the OOB prediction.
And 5: model evaluation and persistence
Referring to the evaluation standard of the regression prediction model in the industry and academic community, the evaluation indexes aiming at the regression prediction model comprise:
(1) mean Square Error (MSE):
Figure BDA0003203251660000161
(2) root Mean Square Error (RMSE):
Figure BDA0003203251660000162
(3) mean Absolute Error (MAE):
Figure BDA0003203251660000163
in the model training process, the data set is divided into a training set and a verification set according to the ratio of 8:2, namely, the training set is used for training the model, the verification set is used for testing the model effect after the training is finished, the evaluation indexes are used for evaluating, and the model is stored persistently for subsequent application if each evaluation index meets the requirement.
The foregoing shows and describes the general principles and broad features of the present invention and advantages thereof. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (10)

1. The network equipment flow prediction system based on the random forest algorithm is characterized by comprising a data acquisition module, a data processing module and a data processing module, wherein the data acquisition module is used for acquiring relevant data related to network equipment in operation;
a data processing module: according to the collected data, corresponding data processing is carried out, and then relevant evaluation index distribution and evaluation indexes are obtained;
a model construction module: constructing a random forest regression model through a random forest algorithm;
a model evaluation module: according to the evaluation standard of the regression prediction model, evaluating the random forest regression model by using the collected data set and the evaluation index distribution and evaluation index obtained in the data processing module, and performing persistent storage on the model meeting the requirements after evaluation;
an input-output module: and the input and output module calculates by using the relevant input data and a persistence stored model and outputs a predicted value of the network equipment flow.
2. The system of claim 1, further comprising an updating module that, when the persisted model is evaluated to be non-compliant, continues to be evaluated by the model evaluation module to evaluate a new compliant model and replaces the non-compliant old model with the new model.
3. The system for predicting network equipment traffic based on the random forest algorithm according to claim 1, wherein the method for predicting the network equipment traffic based on the random forest algorithm comprises the following steps:
data collection: according to network flow data associated with the operation of the network equipment, acquiring related data associated with the network equipment in the operation of the network equipment, and classifying the related data to obtain a data set of a plurality of types of data;
data processing: filtering, processing and analyzing each type of data set to obtain the distribution condition of the evaluation index in each type of data set;
the data feature processing of partial data sets is also included in the data sets of a plurality of types, noise is removed under the condition that the curve features of the data sets are kept according to the types of the screened data, and the evaluation index proportion of the data is obtained through a normalization processing method;
constructing a model: selecting relevant influence factors of various data by adopting a random forest algorithm, and carrying out model construction by combining a core index value and an evaluation index ratio by using a training set, and testing by using a testing set;
model evaluation cure and prediction: and dividing the data set into a training set and a testing set, training the model by using the training set, verifying the verification set, evaluating according to the evaluation index, and if the evaluation index meets the requirement, predicting the flow of the network equipment according to the model.
4. The system of claim 3, wherein the data includes at least 3 types, which are network device traffic data, host traffic data, and network design parameter information, the network device traffic data and the host traffic data are time sequence data sorted in time, and the data granularity of the time sequence data is in the order of minutes.
5. The system of claim 4, wherein in the data collection step, the network design parameter information includes a topology structure of a local area network, a maximum load and/or a bearing capacity of network traffic of each link, and compliance information for ensuring safety and stability of a network environment.
6. The system for predicting network equipment traffic based on the random forest algorithm according to claim 3, wherein the data processing specifically comprises index filtering, data preprocessing and threshold solving so as to obtain each type of data set and evaluate the distribution condition of indexes.
7. The system for predicting the network equipment flow based on the random forest algorithm according to claim 6, wherein the index filtering is to screen out qualified indexes from a plurality of types of data through correlation analysis and mechanism knowledge; the data preprocessing is to identify and process abnormal values and missing values in qualified indexes in a data set so as to realize the integrity of the data; and the threshold solution is to optimize the data classes to be collected by exploring the distribution of the qualified index values, thereby providing a basis for data screening.
8. The system for predicting network device traffic based on the random forest algorithm according to claim 4, wherein the data feature processing of the partial data set specifically comprises: and expressing the time sequence data with continuity in a time lag mode, and obtaining the evaluation index ratio of the data by denoising and normalizing.
9. The system of claim 8, wherein the denoising process comprises a smoothing process and a parameter threshold filtering, and the smoothing process adopts a filtering algorithm.
10. The system as claimed in claim 4, wherein the model is constructed by repeatedly extracting b training sample sets from a plurality of classes of data sets, constructing b regression trees based on the b training sample sets, selecting non-extracted cases to form b out-of-bag data as test sample sets, constructing a random forest, and obtaining a prediction result by using the test sample sets.
CN202110910221.2A 2021-08-09 2021-08-09 Network equipment flow prediction system based on random forest algorithm Pending CN113726558A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110910221.2A CN113726558A (en) 2021-08-09 2021-08-09 Network equipment flow prediction system based on random forest algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110910221.2A CN113726558A (en) 2021-08-09 2021-08-09 Network equipment flow prediction system based on random forest algorithm

Publications (1)

Publication Number Publication Date
CN113726558A true CN113726558A (en) 2021-11-30

Family

ID=78675229

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110910221.2A Pending CN113726558A (en) 2021-08-09 2021-08-09 Network equipment flow prediction system based on random forest algorithm

Country Status (1)

Country Link
CN (1) CN113726558A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115048874A (en) * 2022-08-16 2022-09-13 北京航空航天大学 Aircraft design parameter estimation method based on machine learning
CN117874654A (en) * 2024-03-13 2024-04-12 杭州小策科技有限公司 Risk monitoring method and system based on random forest algorithm

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106323466A (en) * 2016-08-09 2017-01-11 西北农林科技大学 Leaf nitrogen content high spectral evaluation method for continuous wavelet transformation analysis
CN109597968A (en) * 2018-12-29 2019-04-09 西安电子科技大学 Paste solder printing Performance Influence Factor analysis method based on SMT big data
CN111294812A (en) * 2018-12-10 2020-06-16 中兴通讯股份有限公司 Method and system for resource capacity expansion planning
CN112187752A (en) * 2020-09-18 2021-01-05 湖北大学 Intrusion detection classification method and device based on random forest
CN112929215A (en) * 2021-02-04 2021-06-08 博瑞得科技有限公司 Network flow prediction method, system, computer equipment and storage medium
CN113038302A (en) * 2019-12-25 2021-06-25 中国电信股份有限公司 Flow prediction method and device and computer storage medium
CN113067724A (en) * 2021-03-11 2021-07-02 西安电子科技大学 Periodic flow prediction method based on random forest

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106323466A (en) * 2016-08-09 2017-01-11 西北农林科技大学 Leaf nitrogen content high spectral evaluation method for continuous wavelet transformation analysis
CN111294812A (en) * 2018-12-10 2020-06-16 中兴通讯股份有限公司 Method and system for resource capacity expansion planning
CN109597968A (en) * 2018-12-29 2019-04-09 西安电子科技大学 Paste solder printing Performance Influence Factor analysis method based on SMT big data
CN113038302A (en) * 2019-12-25 2021-06-25 中国电信股份有限公司 Flow prediction method and device and computer storage medium
CN112187752A (en) * 2020-09-18 2021-01-05 湖北大学 Intrusion detection classification method and device based on random forest
CN112929215A (en) * 2021-02-04 2021-06-08 博瑞得科技有限公司 Network flow prediction method, system, computer equipment and storage medium
CN113067724A (en) * 2021-03-11 2021-07-02 西安电子科技大学 Periodic flow prediction method based on random forest

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115048874A (en) * 2022-08-16 2022-09-13 北京航空航天大学 Aircraft design parameter estimation method based on machine learning
CN115048874B (en) * 2022-08-16 2023-01-24 北京航空航天大学 Aircraft design parameter estimation method based on machine learning
CN117874654A (en) * 2024-03-13 2024-04-12 杭州小策科技有限公司 Risk monitoring method and system based on random forest algorithm
CN117874654B (en) * 2024-03-13 2024-05-24 杭州小策科技有限公司 Risk monitoring method and system based on random forest algorithm

Similar Documents

Publication Publication Date Title
WO2021184630A1 (en) Method for locating pollutant discharge object on basis of knowledge graph, and related device
CN107122594B (en) New energy vehicle battery health prediction method and system
CN107274105B (en) Linear discriminant analysis-based multi-attribute decision tree power grid stability margin evaluation method
CN110263230B (en) Data cleaning method and device based on density clustering
CN110335168B (en) Method and system for optimizing power utilization information acquisition terminal fault prediction model based on GRU
CN109472075B (en) Base station performance analysis method and system
Utari et al. Implementation of data mining for drop-out prediction using random forest method
CN113726558A (en) Network equipment flow prediction system based on random forest algorithm
CN106708738B (en) Software test defect prediction method and system
CN110348683A (en) The main genetic analysis method, apparatus equipment of electrical energy power quality disturbance event and storage medium
CN115935286A (en) Abnormal point detection method, device and terminal for railway bearing state monitoring data
CN117290719B (en) Inspection management method and device based on data analysis and storage medium
CN114662989A (en) Transient stability self-adaptive evaluation method and device for power system
CN112508363A (en) Deep learning-based power information system state analysis method and device
CN117349786A (en) Evidence fusion transformer fault diagnosis method based on data equalization
CN117035509A (en) Electric energy meter state evaluation method and device, electronic equipment and readable storage medium
CN110855519A (en) Network flow prediction method
CN111209955A (en) Airplane power supply system fault identification method based on deep neural network and random forest
CN113837481B (en) Financial big data management system based on block chain
CN113191569A (en) Enterprise management method and system based on big data
CN113282686A (en) Method and device for determining association rule of unbalanced sample
CN113591897A (en) Method, device and equipment for detecting monitoring data abnormity and readable medium
CN114077663A (en) Application log analysis method and device
CN110569277A (en) Method and system for automatically identifying and classifying configuration data information
CN116701962B (en) Edge data processing method, device, computing equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination