CN110926524B

CN110926524B - Method for predicting coupling relation between network resources and environment

Info

Publication number: CN110926524B
Application number: CN201911039383.2A
Authority: CN
Inventors: 罗涛; 刘颖; 徐艳; 雷鹏; 张洁
Original assignee: CETC 7 Research Institute
Current assignee: CETC 7 Research Institute
Priority date: 2019-10-29
Filing date: 2019-10-29
Publication date: 2022-01-04
Anticipated expiration: 2039-10-29
Also published as: CN110926524A

Abstract

The invention discloses a method for predicting a coupling relation between network resources and an environment, which comprises the following steps: preprocessing an original incomplete parameter set by a missing value processing method and a service expansion method based on multi-dimensional fuzzy mapping to obtain a primary parameter set; carrying out feature construction on the preliminarily obtained parameter set by using a feature construction method based on the multidimensional environment parameters so as to obtain data with stronger representation capability; reducing the dimension of the characteristics of the data obtained in the step S2 by combining a forward sequence search algorithm, a backward sequence search algorithm and a simulated annealing algorithm, thereby reducing the variable space of each dimension and the complexity of multi-dimensional representation model training; and training, learning and predicting the data by adopting a model training method based on a decision tree, and realizing accurate description of network resources under the constraint of a complex environment. The method can quickly and accurately complement the missing value, can provide prediction precision and simultaneously achieve the aim of improving the model efficiency.

Description

Method for predicting coupling relation between network resources and environment

Technical Field

The invention relates to the technical field of network communication, in particular to a method for predicting a coupling relation between network resources and an environment.

Background

In the actual network communication process, incomplete or inaccurate data are acquired by the environment sensing equipment due to the problems of equipment failure, sudden environmental change and the like, and further characterization results of network resource states are abnormal. Whether the representation of the network resource state is accurate or not will have a great influence on the utilization rate of the network resources.

If incomplete or inaccurate data collected by the environment sensing equipment is not filled, the data is input into the analysis model by default as a 0 value, and the 0 value input greatly influences the response of the model to normal data in the training or testing stage, so that the characterization result is inaccurate.

The existing missing value processing method is mean filling, median filling or direct discarding. The mean filling has the advantages of convenience, quickness and suitability for processing data with small variance. The disadvantage is that the mean value is directly filled, and certain errors are certainly brought about. The median value fills in fields that are appropriate for the type of text, and is used for processing because text fields have no way to average. For the more critical fields, mean-filling is not suitable to be used in order to avoid introducing too many errors. In addition, if the fields that are critical to comparison are of numeric type, median padding is not appropriate. The direct discard is mainly done for some non-important fields, so the important fields cannot be directly discarded.

The quality of the features determines the upper limit of the model effect, and the size of the feature dimensions and the strength of the representation ability all result in the final learning effect of the model. The model is typically modeled using raw features, and after the features are constructed, the data is learned. The fitting model is generally a linear regression method, a logistic regression method, or the like. The linear regression method is simple, but the fitting effect is poor for nonlinear data. Logistic regression is more complex than linear regression in expression form, but has limited learning ability, is sensitive to missing values, and needs to normalize data. And the logistic regression can only find linear segmentation, and is not suitable for nonlinear segmentation.

Based on the above analysis, the prior art realizes the prediction of network resources, which has the following disadvantages:

1. due to the problems of equipment failure, sudden environmental change and the like, incomplete or inaccurate data can be collected by the environment sensing equipment, and the missing data can have intrinsic relation, however, the intrinsic relation of the data is not fully mined in the prior art, and further the missing value is filled.

2. The effective characteristic dimension in the original data set is too low to accurately learn the characterization relation between the network resource and the multidimensional environment, and the expansion characteristic dimension is not considered in the prior art.

3. Because the influence of the multidimensional environment on resources presents a nonlinear relation, the accuracy of the prediction result of the multidimensional nonlinear function by the conventional fitting method is low.

Disclosure of Invention

The invention provides a prediction method of a network resource and environment coupling relation, which aims to solve the problem that incomplete or inaccurate data are acquired by environment sensing equipment due to equipment faults, environment mutation and the like in the current actual network communication process, and the conventional fitting method has low precision on the prediction result of a multidimensional nonlinear function by adopting a method of directly deleting a missing value or filling the missing value with previous line data or a mean value in the case of data missing in the prior art.

In order to achieve the purpose of the invention, the technical scheme is as follows: a prediction method of coupling relationship between network resource and environment includes the following steps:

s1: acquiring original data through environment sensing equipment, and preprocessing the original data by adopting a missing value processing method and a service expansion method based on multi-dimensional fuzzy mapping to obtain a primary parameter set;

s2: carrying out feature construction on the preliminarily obtained parameter set by using a feature construction method based on the multidimensional environment parameters so as to obtain data with stronger representation capability;

s3: reducing the dimension of the characteristics of the data obtained in the step S2 by combining a forward sequence search algorithm, a backward sequence search algorithm and a simulated annealing algorithm, thereby reducing the variable space of each dimension and the complexity of multi-dimensional representation model training;

s4: and (4) training, learning and predicting the data obtained in the step S3 by adopting a model training method based on a decision tree, and realizing accurate description of network resources under the constraint of a complex environment.

Preferably, in step S1, the missing value processing method based on multi-dimensional fuzzy mapping includes the following steps:

s101: selecting complete characteristics except the fields needing to be filled as a mapping subset;

s102: dividing a training set and a test set according to the missing condition of the field to be filled, wherein the data with complete fields is used as the training set, and the data with missing fields is used as the test set;

s103: training a model to learn a mapping relation between the mapping subset and the field to be filled;

s104: predicting the field to be filled by using the trained model to obtain a predicted value, and completing the multidimensional fuzzy mapping of the mapping subset and the field;

s105: the missing value is filled in with the predicted value.

Preferably, the service expansion method adopts a third-party service to fill in a specific field; the specific field comprises weather, terrain; weather can be filled by searching a weather table through time and place, and the terrain can be filled by positioning through latitude and longitude.

Further, in step S2, the feature construction method based on the multidimensional environment parameter includes the following steps:

s201: randomly selecting two numerical characteristics from the preliminarily obtained parameter set;

s202: and carrying out addition operation, subtraction operation, multiplication operation, division operation, square operation, mean value operation and variance operation on the numerical values of every two characteristics, thereby realizing the combination of every two characteristics.

Still further, in step S3, the simulated annealing algorithm specifically includes the following steps:

a1: initialization: initial temperature T, initial solution state S, and iteration number L of each T value;

a2: if k is 1, …, L, executing steps A3 to a 6;

a3: generating a new solution S';

a4: calculating an increment Δ T ═ C (S') -C (S), where C (S) is an evaluation function;

a5: if the delta T is less than 0, S 'is accepted as a new current solution, otherwise, S' is accepted as a new current solution according to the probability exp (-delta T/T);

a6: if a plurality of continuous new solutions are not accepted, outputting the current solution as the optimal solution, and ending the program;

a7: t is gradually decreased, and T- >0, and then go to step A2.

Still further, step S3, merging the forward sequence search, the backward sequence search, and the simulated annealing algorithm to perform dimension reduction on the features, includes the following steps:

s301: combining forward sequence search and backward sequence search, taking the cross validation CV score as an evaluation standard, and in the forward search, adding 1 variable each time to achieve the cross validation CV lifting state to find the most lifted variable;

s302: after the forward search is finished, adopting backward sequence search, gradually reducing the reserved characteristic items in sequence, and alternately executing the steps S301 and S302 until the CV value is not improved any more, wherein the cross validation value is optimal;

s303: adding a preselected latent variable by using a simulated annealing algorithm, selecting the combined characteristic with the highest CV score in the model training stage by using the latent variable, and alternately executing the steps S301 and S302;

s304: after the forward sequence search, the backward sequence search and the simulated annealing algorithm are completed, 4-8 unused 2-level feature reference combination features are randomly selected on the basis of the obtained combination features, then after the random sequence search, if the CV is improved, the step 301 is returned, the search is restarted until the CV score is not improved any more, and the CV score is considered to be the highest and is selected as the best feature combination.

In step S4, a model training method based on a decision tree is used to train, learn and predict data, which includes the following steps:

s401: defining the complexity of the tree, f_t(x)＝w_q(x),w∈R^T,q:R^d→ {1, 2.,. T }, splitting the tree into a structure function q and a leaf weight part w, the structure function q mapping the input onto the indices of the leaves, and w giving the score of each index number for the leaf;

s402: the complexity of each tree is defined as:

wherein γ representsThe number of the leaves is equal to the total number of the leaves,

the L2 regular term modulo squared representing w;

formula (1) defines the number of leaf nodes in the tree and the L2 regular term of the output score of each tree leaf node;

s403: under equation (1), the objective function is rewritten as:

where I represents the set of samples above each leaf: i is_j＝(i|q(x_i) J), equation (2) contains T independent univariate quadratic functions;

s404: formula (2) to w_jBy taking the derivative and making the derivative 0, one can find:

where Obj represents the maximum reduction above the target when specifying the structure of a tree;

s405: the scoring function calculation formula:

the smaller the score, the better the structure of the representation tree;

s406: enumerating the tree structure continuously, and then searching an optimal structure tree by using the formula (5) to add the optimal structure tree into the model; using a greedy method, adding a partition to the existing leaf every time, and splitting nodes according to the divided gains; calculate the gain formula for each segmentation scheme:

wherein:

represents the left sub-tree and the right sub-tree,

representing the right subtree, and gamma representing the complexity cost introduced by adding a new leaf node; if Gain<0, then the node should not split into two left and right branches.

The invention has the following beneficial effects:

1. aiming at the problems of equipment failure, environment mutation and the like in the current actual network communication process, which cause that incomplete or inaccurate data is acquired by environment sensing equipment, the invention provides a completion method for processing missing values based on multi-dimensional fuzzy mapping and service expansion, which can quickly and accurately complete the missing values and improve the utilization rate of network resources.

2. According to the invention, through carrying out feature construction based on the multidimensional environment parameters, some combined features containing multiple dimensions are generated, and the data characterization capability is increased, so that the characterization relation between the network resources and the multidimensional environment is more accurately learned.

3. Because the influence of the multidimensional environment on resources presents a nonlinear relation, the accuracy of the prediction result of the multidimensional nonlinear function by the conventional fitting method is low. The method adopts a model training method based on a decision tree to learn the mapping of a multidimensional nonlinear function, and represents the resource state as a multivariate function of multidimensional environmental factors; the model training method based on the decision tree can increase the learning ability and generalization ability of the model based on the decision tree.

Drawings

FIG. 1 is a flow chart of the steps of the prediction method described in example 1.

FIG. 2 is a flowchart of a missing value processing method based on multi-dimensional fuzzy mapping according to embodiment 1.

FIG. 3 is a schematic diagram of the fuzzy mapping of example 1 to construct'd (km)'.

FIG. 4 is a block diagram of a feature construction method based on multidimensional environment parameters in embodiment 1.

FIG. 5 is a schematic diagram of the implementation process of the simulated annealing algorithm in the embodiment 1.

FIG. 6 is a flowchart of a fusion feature reduction method based on forward sequence search, backward sequence search, and simulated annealing algorithm in embodiment 1.

Detailed Description

The invention is described in detail below with reference to the drawings and the detailed description.

Example 1

As shown in fig. 1, a method for predicting a coupling relationship between a network resource and an environment, the method comprising the steps of:

as shown in fig. 2, the missing value processing method based on the multidimensional fuzzy mapping includes the following steps:

s105: the missing value is filled in with the predicted value.

Specifically, for the'd (km)' field, the 'interference', 'transmission point', 'reception point', 'transmission point altitude (m)', 'reception point altitude (m)', 'transmission clear angle (degree)', 'reception clear angle (degree)', and 'occurrence probability (%)' fields may be used as the mapped subset, the portion of the'd (km)' field that is not missing is divided into a training set, the missing portion is trained as a test set, and finally the missing value is predicted using the trained model. The mapping structure is shown in fig. 3.

And filling specific fields by a third-party service, wherein the specific fields comprise weather and terrain. Such as filling the weather according to a time, place look-up weather table. And positioning and filling the terrain through longitude and latitude. For these specific fields, the way the service is extended enables more accurate filling of missing values.

Mapping from complete fields to missing fields can be established through fields with complete history, so that missing fields can be accurately filled.

S2: in order to enhance the characterization capability of the environment resource, the embodiment performs the feature construction on the preliminarily obtained parameter set by using the feature construction method based on the multidimensional environment parameter, so as to more accurately learn the characterization relationship between the network resource and the multidimensional environment, and obtain data with stronger characterization capability.

As shown in fig. 4, the feature construction method based on the multidimensional environment parameter includes the following steps:

By cross-combining different features, the features can be mutually linked and interacted, so that the nonlinearity which is not possessed by a single feature is expressed. The cross structure features are combined by means of addition, subtraction, multiplication, division, square, mean, variance and the like. For the numerical features, two features are subjected to operations such as addition, subtraction, multiplication, division and the like on the numerical values, and operations such as mean value, variance and the like.

S3: in order to screen redundant features generated by a feature construction method based on multi-dimensional environment parameters, the embodiment combines forward sequence search, backward sequence search and simulated annealing algorithms to perform dimension reduction on the features of the data obtained in the step S2, so that the variable space of each dimension and the complexity of multi-dimensional representation model training are reduced;

as shown in fig. 5, the simulated annealing algorithm specifically includes the following steps:

a2: if k is 1, …, L, executing steps A3 to a 6;

a3: generating a new solution S';

a7: t is gradually decreased, and T- >0, and then go to step A2.

As shown in fig. 6, in step S3, merging the forward sequence search, the backward sequence search, and the simulated annealing algorithm to perform dimension reduction on the features, the method includes the following steps:

s302: after the forward search is completed, a backward sequence search is adopted, the reserved feature items are sequentially reduced step by step, and the steps S301 and S302 are alternately executed until the CV value is not increased any more, so that the optimal cross validation value is ensured on the premise of not increasing redundant items;

s303: aiming at the problem that the forward sequence search and the backward sequence search have local minimum values, a pre-selected latent variable is added by utilizing a simulated annealing algorithm, the latent variable selects the combined characteristic with the highest CV score in the model training stage, and the steps S301 and S302 are alternately executed after the latent variable is added;

s304: combinations between variables tend to work well, which is difficult to obtain in a single variable, and therefore a random sequence search is performed. After the forward sequence search, the backward sequence search and the simulated annealing algorithm are completed, 4-8 unused 2-level feature reference combination features are randomly selected on the basis of the obtained combination features, then after the random sequence search, if the CV is improved, the step 301 is returned, the search is restarted until the CV score is not improved any more, and the CV score is considered to be the highest and is selected as the best feature combination. .

The embodiment effectively overcomes the defect that the sequence search algorithm is easy to fall into a local optimal value by introducing the simulated annealing algorithm. By fusing forward sequence search, backward sequence search and simulated annealing algorithm, the embodiment increases the effectiveness of feature selection and feature reduction, thereby retaining excellent features.

s402: the complexity of each tree is defined as:

wherein gamma represents the number of leaves,

the L2 regular term modulo squared representing w;

s403: under equation (1), the objective function is rewritten as:

s405: the scoring function calculation formula:

the smaller the score, the better the structure of the representation tree;

s406: enumerating the tree structure continuously, and then searching an optimal structure tree by using the formula (5) and adding the optimal structure tree into the model based on the decision tree; using a greedy method, adding a partition to the existing leaf every time, and splitting nodes according to the divided gains; calculate the gain formula for each segmentation scheme:

wherein:

represents the left sub-tree and the right sub-tree,

In order to enhance the learning ability and generalization ability of the model, the implementation provides a model training method based on a decision tree. Because the influence of the multidimensional environment on the network resources presents a nonlinear relation, the accuracy of the prediction result of the multidimensional nonlinear function by the conventional fitting method is low. To solve this problem, the present embodiment uses a decision tree method to learn mapping of a multidimensional nonlinear function, and characterizes the resource state as a multivariate function of multidimensional environmental factors.

It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. A method for predicting a coupling relation between a network resource and an environment is characterized in that: the prediction method comprises the following steps:

s4: training, learning and predicting the data obtained in the step S3 by adopting a model training method based on a decision tree to realize accurate description of network resources under the constraint of a complex environment;

step S1, the missing value processing method based on the multidimensional fuzzy mapping includes the following steps:

s105: the missing value is filled in with the predicted value.

2. The method of claim 1, wherein the method comprises: the method for service expansion adopts a third-party service to fill a specific field; the specific field comprises weather, terrain; weather is filled by looking up a weather table through time and place, and the terrain is filled by positioning through longitude and latitude.

3. The method for predicting the coupling relationship between the network resource and the environment according to any one of claims 1 or 2, wherein: step S2, the feature construction method based on multidimensional environment parameters includes the following steps:

4. The method of claim 3, wherein the method comprises: step S3, the simulated annealing algorithm specifically includes the following steps:

a2: if k is 1, …, L, executing steps A3 to a 6;

a3: generating a new solution S';

a7: t is gradually decreased, and T → 0, and then goes to step A2.

5. The method of claim 4, wherein the method comprises: step S3, merging the forward sequence search, the backward sequence search and the simulated annealing algorithm to reduce the dimension of the feature, comprising the following steps:

s302: after the forward search is finished, adopting a backward sequence search, gradually reducing the reserved characteristic items in sequence, and alternately executing the steps S301 and S302 until the CV value is not increased any more;