Disclosure of Invention
The invention aims to provide a sensing network anomaly detection method and device based on feature selection, which can greatly reduce the overhead of anomaly detection and can be well fused with the traditional sensing network diagnostic tool.
In order to achieve the purpose, the invention adopts the following technical scheme:
a sensing network anomaly detection method based on feature selection comprises the following steps:
sorting the collected feature attributes according to a criterion based on the correlation coefficient;
selecting a representative feature attribute set according to a feature selection calculation result based on the correlation coefficient;
and verifying the reliability of the selected representative characteristic attribute set used for representing the running state of the system through cross validation.
Said sorting the collected feature attributes according to a correlation coefficient based criterion further comprises:
preprocessing the attribute values of the collected characteristic attributes by negative value removal;
calculating correlation coefficients among different characteristic attributes by using the preprocessed attribute values;
and sorting the characteristic attributes according to the correlation coefficient and a preset sorting criterion.
Correlation coefficient between different characteristic attributes
Wherein f isiAnd fjTwo feature attributes are represented, cov covariance and var standard deviation, respectively.
The selecting of the representative feature attribute set according to the feature selection calculation result based on the correlation coefficient further comprises:
taking the first k sequenced characteristic attributes as an initial characteristic attribute set, and calculating correlation coefficients between the initial characteristic attribute set and the rest of characteristic attributes and correlation coefficients between the k characteristic attributes;
and calculating a correlation coefficient between each characteristic attribute in the characteristic attribute set and the characteristic attribute set, and deleting or adding each characteristic attribute according to a result.
The calculation formula for calculating the correlation coefficient between each feature attribute in the feature attribute set and the feature attribute set is as follows:
wherein,representing an average correlation coefficient between a feature attribute and the set of feature attributes;representing an average correlation coefficient between different characteristic attributes; f denotes the currently selected feature attribute set.
A sensing network anomaly detection device based on feature selection comprises:
a sorting module for sorting the collected characteristic attributes according to a criterion based on the correlation coefficient;
the selection module is used for selecting the representative characteristic attribute set according to the characteristic selection calculation result based on the correlation coefficient;
and the verification module is used for proving the reliability of the selected representative characteristic attribute set through cross verification.
The sorting module further comprises:
the preprocessing submodule is used for preprocessing the attribute values of the collected characteristic attributes by removing negative values;
the first operation submodule is used for calculating correlation coefficients among different characteristic attributes by utilizing the preprocessed attribute values;
and the sorting submodule is used for sorting the characteristic attributes according to the correlation coefficient and a preset sorting criterion.
The selecting module further comprises:
the second operation submodule is used for taking the front k sequenced characteristic attributes as an initial characteristic attribute set, and calculating correlation coefficients between the initial characteristic attribute set and the rest of the characteristic attributes and correlation coefficients between the k characteristic attributes;
and the characteristic selection submodule is used for calculating a correlation coefficient between each characteristic attribute in the characteristic attribute set and the characteristic attribute set, and deleting or adding each characteristic attribute according to a result.
By adopting the technical scheme of the invention, the data anomaly detection in the sensor network is realized, and the high-efficiency anomaly detection capability is realized; a feature attribute subset selection mechanism is provided, and the correctness of anomaly detection is ensured; the anomaly detection based on feature selection has the characteristics of simplicity, high efficiency and easiness in implementation; and the sensor network diagnostic tool can be integrated with other sensor network diagnostic tools as a fusible component.
Detailed Description
The technical scheme of the invention is further explained by the specific implementation mode in combination with the attached drawings.
As shown in fig. 1, a method for detecting an anomaly of a sensor network based on feature selection according to an embodiment of the present invention includes:
s101, sorting the collected characteristic attributes according to a criterion based on the correlation coefficient.
The collected characteristic attributes refer to all state parameters fed back by each sensor through the sensing network, each parameter corresponds to one characteristic attribute, and the source data shown in fig. 1 is the characteristic attributes. In particular, the description of the partial characterization attributes is shown in the following table:
and after the characteristic attributes are collected, preprocessing the attribute values, wherein the preprocessing comprises negative value removal or non-negative value retention.
And calculating correlation coefficients among different characteristic attributes by using the preprocessed attribute values. The way in which the correlation coefficients between different characteristic properties are calculated is shown in fig. 2. Correlation coefficient between different characteristic attributes
Due to ρ (f)i,fj)=ρ(fj,fi) Thus, the attribute set matrix composed among the attribute of the feature is a triangular matrix.
Wherein f isiAnd fjRespectively representing two characteristic attributes, cov representing covariance and var representing standard deviation;representing an average correlation coefficient between a feature attribute and the set of feature attributes;representing an average correlation coefficient between different characteristic attributes; f denotes the currently selected feature attribute set.
For a value of xiCharacteristic property f ofiAnd a value of yjCharacteristic property f ofjThe estimation of the correlation coefficient is:
the user can choose the definition for high relevance according to his own system settings. For example, a threshold value of 0.95 may be chosen for fully correlated attribute values, and if there is such a correlation between two attribute values, it may be considered redundant information; the correlation coefficient is considered to be relatively high for values with correlation between 0.75 and 0.95, but whether the data is redundant still needs to be further determined.
And sorting the characteristic attributes according to the correlation coefficient and a preset sorting criterion.
The sorting criteria include 2:
1) for the feature attribute fiMore advanced ranking of feature attributes including high correlation coefficient
2) The ranking is further advanced if there are two or more feature attributes having the same high correlation coefficient value, which feature attribute has a large average correlation coefficient value.
And S102, selecting the representative feature attribute set according to the feature selection calculation result based on the correlation coefficient.
And taking the top k characteristic attributes after being sorted according to the sorting criterion as an initial characteristic attribute set. The correlation coefficients between the set of characteristic attributes and the remaining characteristic attributes, including the correlation coefficients between the k characteristic attributes, may then be calculated.
The set of attribute correlation means that the correlation coefficient thereof increases as the correlation coefficient between the characteristic attribute and the attribute set containing k characteristic attributes increases, and decreases as the correlation coefficient with the attribute set containing k characteristic attributes decreases.
And calculating a correlation coefficient between each characteristic attribute in the characteristic attribute set and the characteristic attribute set, and deleting or retaining each characteristic attribute according to a result.
Defining the average correlation coefficient between the characteristic attribute and the characteristic attribute set (the selected characteristic attribute set capable of representing the system operation state) asThe average correlation coefficient between different characteristic attributes isThe correlation coefficient that measures the correlation between the feature attribute sets is calculated as:
the characteristic selection method based on the correlation coefficient is that the calculation of the correlation coefficient is carried out on a characteristic attribute group (a single characteristic attribute can also be used as a characteristic attribute group) and a characteristic attribute set according to the formula, if the correlation coefficient is larger than or equal to a threshold value set by a user, the characteristic attribute is deleted, and if the correlation coefficient is smaller than the threshold value set by the user, the characteristic attribute is added; the correlation coefficient is extremely low, and the characteristic attribute group (a single characteristic attribute can also be used as one characteristic attribute group) which is lower than the threshold value set by the user is ignored because the characteristic attribute group has no correlation with the characteristic attribute group.
In addition, at this time, the feature attribute search of this step may be performed by using the best-first search strategy, so as to ensure the selection of the feature attribute subset. Here, the best-first search strategy is a heuristic search strategy that attempts to predict the solution that is closest to the best path, and this particular search type is called greedy best-first search.
And S103, verifying the reliability of the selected representative characteristic attribute set through cross validation.
And performing leave-one-out cross validation on the selected feature attribute subset, wherein the cross validation is to use only one feature attribute in the feature attribute set as validation material, and the rest feature attributes are left as training material. This step continues until each feature attribute is treated as a verification material. Equivalent to using k-fold cross validation.
And determining whether the output representative characteristic attribute subset meets the sorting rule in the S101 through the leave-one-out cross validation. The representative feature attribute set is used to represent an operational state of the system.
In the process of network diagnosis by a plurality of sensor network diagnostic tools, a diagnostic data collection process exists, and the representative characteristic attribute subset selection in the embodiment of the invention can be integrated in the diagnostic data collection stage of other diagnostic tools to be used as a fusible component to reduce the collection of diagnostic information and reduce the network overhead.
Correspondingly, the embodiment of the invention provides a sensing network anomaly detection device based on feature selection. As shown in fig. 3, the apparatus includes: the device comprises a sorting module, a selecting module and a verifying module. Wherein:
a sorting module for sorting the collected characteristic attributes according to a criterion based on the correlation coefficient;
the selection module is used for selecting the representative characteristic attribute set according to the characteristic selection calculation result based on the correlation coefficient;
and the verification module is used for proving the reliability of the selected representative characteristic attribute set through cross verification.
The sorting module further comprises: the device comprises a preprocessing submodule, a first operation submodule and a sequencing submodule. Wherein:
the preprocessing submodule is used for preprocessing the attribute values of the collected characteristic attributes by removing negative values;
the first operation submodule is connected with the preprocessing submodule and used for calculating correlation coefficients among different characteristic attributes by using the attribute values after preprocessing;
and the sorting submodule is connected with the first operation submodule and is used for sorting the characteristic attributes according to the correlation coefficient and a preset sorting criterion.
The selecting module further comprises: a second operation submodule and a feature selection submodule. Wherein:
the second operation submodule is connected with the sorting submodule and used for taking the sorted front k characteristic attributes as an initial characteristic attribute set and calculating a correlation coefficient between the initial characteristic attribute set and the rest characteristic attributes and a correlation coefficient between the k characteristic attributes;
and the characteristic selection submodule is connected with the second operation submodule and the verification module and is used for calculating a correlation coefficient between each characteristic attribute in the characteristic attribute set and deleting or adding each characteristic attribute according to a result.
The first operation module and the second operation module calculate correlation coefficients between different characteristic attributes
Wherein f isiAnd fjTwo feature attributes are represented, cov covariance and var standard deviation, respectively.
The feature selection module calculates a correlation coefficient between each feature attribute in the feature attribute set and the feature attribute set according to a calculation formula:
wherein,representing an average correlation coefficient between a feature attribute and the set of feature attributes;to representAverage correlation coefficients between different characteristic attributes; f denotes the currently selected feature attribute set.
By adopting the technical scheme of the invention, the data anomaly detection in the sensor network is realized, and the high-efficiency anomaly detection capability is realized; a feature attribute subset selection mechanism is provided, and the correctness of anomaly detection is ensured; the anomaly detection based on feature selection has the characteristics of simplicity, high efficiency and easiness in implementation; the sensor network diagnostic tool can also be integrated with other sensor network diagnostic tools as a fusible component.
The above description is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.