CN112004204B

CN112004204B - High-dimensional data anomaly detection method based on layered processing in industrial Internet of things

Info

Publication number: CN112004204B
Application number: CN202010805928.2A
Authority: CN
Inventors: 韩光洁; 屠隽弢; 刘立
Original assignee: Changzhou Campus of Hohai University
Current assignee: Changzhou Campus of Hohai University
Priority date: 2020-08-12
Filing date: 2020-08-12
Publication date: 2022-09-23
Anticipated expiration: 2040-08-12
Also published as: CN112004204A

Abstract

The invention discloses a high-dimensional data anomaly detection method based on layered processing in an industrial Internet of things. Firstly, trust selection is carried out on data collected by monitoring equipment in a data preprocessing stage. And constructing a trust verification model by utilizing mutual information according to the space-time correlation, and eliminating disturbance caused by industrial noise and machine aging. Secondly, executing a single-source anomaly detection algorithm to obtain a local anomaly detection result. The method comprehensively considers the difference of the heterogeneous equipment in type, starting time and transmission time, and fully utilizes the characteristics of time sequence data in the industrial environment. The data with the time stamp is received and transmitted through the data buffer queue model, and data transmission overhead and processing time delay among networks with different granularities are effectively reduced. And finally, running a multi-source anomaly detection algorithm on the edge nodes and analyzing the data situation to obtain a global data anomaly detection result. The method meets the requirements of low load and low time delay of the Internet of things equipment under the background of industrial big data, and improves the accuracy and reliability of data anomaly detection.

Description

High-dimensional data anomaly detection method based on hierarchical processing in industrial Internet of things

Technical Field

The invention belongs to the technical field of industrial Internet of things safety protection, and particularly relates to a method for identifying and protecting industrial sensitive data in an industrial Internet of things sensing layer and a network layer.

Background

The fourth Industrial revolution accelerated by the Internet of Things (Industrial Internet of Things) has triggered a global booming. With the continuous development of the traditional sensor network architecture, IIoT has made great progress. The edge computing combines core functions of network, computing, storage, application and the like, an open platform is created, redundant data is eliminated, key information is extracted, and meanwhile transmission pressure is relieved. The IIoT has the capabilities of sensing, calculating, deciding and transmitting by combining the change of edge objects of the edge intelligent sensing network. The intelligent edge integration method has wide application prospect in IIoT. The IIoT supporting the cognitive technology realizes semantic representation, sensing data association and AI modeling in a network plane, and improves the comprehension capability of the network. However, the traditional cognitive technology lacks a theoretical basis of high-level decision, and cannot recommend an optimal operation scheme based on IIoT data. With the rapid development of information science, Machine Learning (Machine Learning) based on an intelligent computing method plays an important role in the edge intelligent IIoT.

The industrial Internet of things network can be mainly divided into a sensing layer, a network layer and an application layer. A large number of heterogeneous IIoT nodes are deployed in a sensing layer and are responsible for collecting data of peripheral equipment, and mainly complete sensing and specific lightweight computing tasks. The edge server in the network layer integrates machine learning, processes data information and provides reliable judgment for high-level decision making. Machine learning mainly includes several processes: sensing, understanding, learning, judging, and reasoning. In order to obtain reliable decision making effect, the machine learning algorithm based on data driving puts high demands on data quality.

In practical industrial internet of things application scenarios, there are usually many limitations. First, data quality plays a crucial role in event detection, and background noise and irreversible aging of equipment in an industrial environment contribute to the drift of monitored data. In addition to serious resource limitations and environmental problems, these IIoT devices are vulnerable to external attackers, which makes the collected industrial data unreliable and unable to meet the detection requirements. Most of the traditional anomaly detection means only consider the situation under ideal conditions, and neglect to carry out necessary screening on input data in a preprocessing stage, so that obvious detection errors are caused and even results are inverted. Secondly, industrial production has very high requirements on timeliness, equipment in a perception layer often does not have the capacity of executing complex calculation, data are processed in a centralized mode through an edge server in an off-line mode, high time delay and high resource occupancy rate are caused, and timely and effective early warning is lacked for the occurrence of emergency. Meanwhile, massive high-dimensional data are obtained by utilizing the large-range deployment of heterogeneous equipment, and because the sensor nodes are different in type, starting time and transmission period, actual data flow is not a time continuous matrix, so that the waste of channel resources is caused.

The existing data detection method generally cannot reflect the overall data abnormal condition, the data is simply marked as 'normal' and 'abnormal', and the analysis and prediction of abnormal results are lacked. Data detection is abnormal due to equipment faults, operation state changes, external intrusion, emergencies and transmission disturbance, information values contained in different abnormal results have differences, and corresponding decisions made by a system can be directly influenced. When key nodes in a sensitive area are damaged, important data transmission false information can be caused, a control center lacks support of related data, the whole industrial production system is in a stagnation state, even irreversible serious accidents and disasters occur, and huge economic and property losses are caused.

Therefore, a more complete high-dimensional industrial data anomaly detection method must be researched, the self calculation capability of the IIoT equipment is utilized to the maximum extent, the detection reliability and accuracy are improved, the processing time delay and the communication cost are reduced, and powerful data support is provided for system decision.

Disclosure of Invention

Aiming at the problems, the invention provides a high-dimensional data anomaly detection method based on layered processing in an industrial Internet of things, which is characterized in that under the actual industrial environment condition, a credible data verification model based on multi-element Gaussian distribution is constructed, the problems of the difference of heterogeneous equipment in space-time distribution, noise interference, drift caused by equipment aging, data falsification by unknown attackers and the like are comprehensively considered, a time window T and a data buffer area queue are added, the limitation of uneven dynamic distribution of high-dimensional industrial data is eliminated, local and global anomaly information is respectively obtained by using single-source and multi-source anomaly detection algorithms, low-delay and high-precision detection is realized, and powerful data support is finally provided for system decision.

The technical purpose is achieved, the technical effect is achieved, and the invention is realized through the following technical scheme:

a high-dimensional data anomaly detection method based on hierarchical processing in an industrial Internet of things comprises the following steps:

(1) building trusted data verification model

Dividing the whole industrial production area into a plurality of sub-areas according to different work tasks; each subarea is provided with a plurality of heterogeneous devices for monitoring the running condition of the machine, each heterogeneous device is provided with a plurality of sensor nodes, and sensing data D of each type are collected to provide decision basis for the control system; by dividing a time window T, the equipment exchanges data information with a neighbor node in a communication range of the equipment in a wireless transmission mode; constructing a trusted data verification model, calculating the credibility of data, reducing noise and disturbance caused by instrument aging by using a real state updating mechanism, and obtaining data with high monitoring quality;

(2) local data anomaly detection

The heterogeneous equipment of the perception layer has light-weight computing capacity and performs local anomaly detection on data which passes trust verification; setting corresponding normal intervals by combining historical data samples according to the difference of working environments in different areas, and analyzing an abnormal detection result by adopting a fuzzy theory to obtain the data abnormal degree distribution condition of a single source; the time window T is added into the data buffer area queue, so that the problem of uneven dynamic distribution of high-dimensional industrial data is solved, the communication overhead in the data transmission process is reduced, and the detection efficiency is improved;

(3) global data anomaly detection

The edge server can obtain all monitoring data information in the area where the edge server is located, detect the local abnormal labeling data in the time window T again, execute a multi-source abnormal detection algorithm, and avoid result deviation caused by single data detection; and distinguishing isolated anomalies and aggregation anomalies, analyzing the variation trend of the anomalous data through the functional relation in the time domain, predicting the causes of the anomalies, retaining data containing sensitive information, effectively evaluating the global data anomaly result and finally transmitting the global data anomaly result to a control center.

In the step (1), a plurality of sub-regions are divided according to different tasks, and the whole region is represented in a set formIs Z ═ Z ₁ ,Z ₂ …Z _e Where e denotes the specific number owned by each sub-area, sub-area Z _e The sensing devices within are each assigned a unique ID and a pair of encryption keys to maintain security during data collection.

The expressions of the various types of sensing data collected by the heterogeneous equipment in the step (1) are

n≤W _m Wherein D is _j Representing a data matrix, W, collected by the device j _m Representing the number of all data types, X _n A data flow vector representing a certain type of attribute, expressed as

Wherein x _t Representing data points collected at time t.

In the step (1), the credibility of the data collected by the heterogeneous equipment in different states is obtained by constructing a credible data verification model, when the credibility of the data is lower than a tolerable confidence interval, the data point is judged to be untrustworthy and discarded, and the credibility S is _k Is calculated by the formula

Wherein S _k Representing the trustworthy value of the sensing device k,

the data representing the actual observations that are being made,

representing the estimated actual value of the measured value,

represents the mean of the data and d represents the distance function.

The above-mentioned distance function d is calculated by the following formula:

wherein std _m Representing the standard deviation of the actual observations.

In the step (1), a real state updating mechanism is used to reduce noise and interference caused by instrument aging, and the real state value updating process is as follows:

the expression of the data normal interval Tr in the step (2) is

Wherein

Represents the lower bound of class i data collected by device j,

representing the upper bound of the i-type data collected by the device j, the calculation formula of the abnormal degree is as follows:

wherein T is greater than or equal to 1 and less than or equal to T, and T represents the length of a time window; due to the influence of the difference of the industrial environment, the abnormal degree calculation is divided into two types of 1 or 0, which cannot reflect the actual situation, so that the abnormal degree calculation formula is rewritten into:

in the step (2), the local data in the time window T is uploaded to the edge server in a queue manner by using the data buffer queue, where the data processing delay ω stored in the queue needs to satisfy the following condition:

wherein Cnt (Q) _j ) Representing the maximum queue length for device j and p representing the processing delay of a single data.

In the step (3), the multi-source anomaly detection algorithm executed on the edge server adds a spatial dimension on the basis of the single-source anomaly detection algorithm, and the multi-source anomaly detection algorithm has a discriminant formula:

condition needs to satisfy the following conditions:

where H (i, j) represents a highly correlated data set, Ψ represents a valid detection coefficient, m and n represent IDs of related heterogeneous devices, and t represents a time at which data is acquired.

In the step (3), the isolated anomalies and the aggregation anomalies are distinguished by using the time stamps of the data, the isolated anomalies are characterized in that data points near the abnormal values are normal data, the aggregation anomalies are characterized in that abnormal values continuously appear, the reasons for the abnormal values are predicted by solving the partial derivatives of the abnormal data curves before and after and the positions of inflection points, and sensitive data information is reserved.

The invention has the beneficial effects that:

the method has the advantages that trust selection is carried out on data in a preprocessing stage, disturbance caused by industrial noise and machine aging is eliminated by applying a trust verification model based on multivariate Gaussian distribution according to space-time correlation, high-quality industrial data acquisition is guaranteed, a light-weight single-source anomaly detection algorithm is executed by equipment in a sensing layer, local anomaly results are obtained, a data buffer queue model is added when time sequence data are received and sent, the problem of uneven dynamic distribution of the industrial data is avoided, reasonable inference is carried out on data situations through the multisource anomaly detection algorithm operated by edge nodes, global data anomaly detection results are finally obtained, processing delay and calculation load are effectively reduced, reliable data support is provided for system decision, and important significance is brought to protection of the ecological environment and safety of the whole IIoT.

Drawings

FIG. 1 is a diagram of a network model according to an embodiment of the present invention;

FIG. 2 is a data buffer queue model according to an embodiment of the present invention;

FIG. 3 is a diagram of a multi-source data anomaly analysis according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The following detailed description of the principles of the invention is provided in connection with the accompanying drawings.

The method aims to solve the problems that under actual industrial conditions, the quality of input monitoring is guaranteed, disturbance caused by noise interference and equipment aging is avoided, and dynamic industrial data distribution is not uniform and high-dimensional data is different; meanwhile, the efficiency and the precision of online learning are improved, the reason of the occurrence of the abnormity is accurately predicted, and the problems of the limiting conditions of IIoT (inter-integrated time delay), low energy consumption, low resource occupancy rate, large-range reliable transmission and the like are met as far as possible, so that the invention provides a data abnormity detection method based on hierarchical processing in the industrial Internet of things, which comprises the following steps:

step one, building a trusted data verification model

As shown in fig. 1, the whole industrial production area is divided into a plurality of sub-areas according to different work tasks; each subarea is provided with a plurality of heterogeneous devices for monitoring the running condition of the machine, each heterogeneous device is provided with various sensors, and various types of sensing data D are collected to provide important decision basis for the control system; by dividing a time window T, the equipment exchanges data information with a neighbor node in a communication range of the equipment in a wireless transmission mode; and constructing a trusted data verification model, calculating the credibility of the data, and reducing the disturbance caused by noise and instrument aging by using a real state updating mechanism so as to obtain the data with high monitoring quality.

In order to effectively process industrial data collected in a large range, reduce processing time delay and calculation redundancy and improve detection efficiency, IIoT equipment in the same area can exchange data. Each sub-area is assigned a respective edge server. The edge servers belong to active equipment, are deployed in an IIoT network layer and process industrial data uploaded by equipment in a perception layer, and can communicate through ISDN gateways and feed back global abnormal results to a control center.

The whole industrial area is represented as Z ═ Z in a set form ₁ ,Z ₂ …Z _e Where e denotes the specific number owned by each sub-area, sub-area Z _e The sensing devices within are each assigned a unique ID and a pair of encryption keys to maintain security during data collection.

The expression of each type of sensing data collected by the heterogeneous equipment is

n≤ W _m Wherein D is _j Representing a data matrix, W, collected by the device j _m Representing the number of all data types, X _n A data flow vector representing a certain type of attribute, expressed as

Wherein x _t Representing data points collected at time t. Perception layer equipment and single-hop range thereof in wireless modeThe nodes within the enclosure communicate and transmit data information acquired within the time window T to each other.

The attack of an external attacker can destroy the authenticity of original data and hijack the IIoT device to send false error information. Therefore, confidence calculations are required in the preprocessing stage to avoid the input of low quality data. Obtaining the credibility of data collected by heterogeneous equipment in different states by constructing a credible data verification model, judging that the data point is untrustworthy and abandoning when the credibility of the data is lower than a tolerable confidence interval, and judging the credibility S _k The calculation formula of (2) is as follows:

wherein S _k Representing the trustworthy value of the sensing device k,

the data representing the actual observations of the object,

representing the estimated actual value of the measured value,

represents the mean of the data and d represents the distance function.

The calculation formula of the distance function d is as follows:

wherein std _m Representing the standard deviation of the actual observations.

And reducing noise and interference caused by instrument aging by using a real state updating mechanism, wherein the real state value updating process comprises the following steps:

step two, local data anomaly detection

The equipment in the perception layer has light computing power and performs local anomaly detection on data which passes trust verification; due to the limitation of self conditions of the equipment, corresponding normal intervals are set according to the difference of working environments in different areas and by combining with historical data samples, and the abnormal detection results are analyzed by adopting a fuzzy theory to obtain the distribution condition of the abnormal degree of the data of a single source.

As shown in fig. 2, the heterogeneous devices are different in start time, monitoring period, and transmission time slot, and the original transmission mode cannot meet the requirement of time delay. By utilizing the time window T and adding a new data buffer queue, the problem of uneven dynamic distribution of high-dimensional industrial data is solved, the communication overhead in the data transmission process is reduced, and the detection efficiency is improved.

The expression of the data normal interval Tr is

Wherein

Represents the lower bound of class i data collected by device j,

the above data buffer queue uploads the local data in the time window T to the edge server in a queue manner, where the data processing delay ω stored in the queue needs to satisfy the following condition:

Step three, global data anomaly detection

The edge server can obtain all monitoring data information of the region, firstly, the data with abnormal local labels in the time window T are re-detected, a multi-source abnormal detection algorithm is executed, and result deviation caused by single data detection is avoided; and distinguishing isolated anomalies and aggregation anomalies, analyzing the variation trend of the anomalous data through the functional relation in the time domain, predicting the causes of the anomalies, retaining data containing sensitive information, effectively evaluating the global data anomaly result and finally transmitting the global data anomaly result to a control center.

The edge server executes a multi-source anomaly detection algorithm, a space dimension is added on the basis of a single-source anomaly detection algorithm, and the multi-source anomaly detection algorithm has a discriminant formula as follows:

condition requires the following equality conditions to be satisfied:

where H (i, j) represents a highly correlated data set, Ψ represents a valid detection coefficient, m and n represent the IDs of the associated heterogeneous devices, and t represents the time at which the data was acquired.

As shown in fig. 3, the anomalies belong to aggregate anomalies or are also called continuous anomalies, the occurrence of an emergency event has diffusion and continuation properties, which cause abnormal changes of data in a certain period, the data on both sides of an inflection point show different trends, and after the event is finished, the data still detected as anomalies can be rapidly converged to a normal region.

The isolated anomaly and the aggregation anomaly are distinguished by using the time stamp of the data, the isolated anomaly is characterized in that data points near the abnormal value are normal data, the aggregation anomaly is characterized in that the abnormal value continuously appears, the reason for the abnormal value is predicted by solving the partial derivative of the curve of the abnormal data before and after and the position of an inflection point, and sensitive data information is reserved.

In summary, the following steps:

the invention discloses a high-dimensional data anomaly detection method based on layered processing in an industrial Internet of things, which comprises the steps of dividing sub-regions according to different production tasks under the actual industrial environment condition, selecting a trusted node by constructing a trusted data verification model based on multi-element Gaussian distribution, ensuring the quality of input data, comprehensively considering the problems of the difference of heterogeneous equipment in space-time distribution, noise interference, disturbance caused by equipment aging, unknown attacker tampering data, hijacking IIoT equipment and the like, adding a time window T and a data buffer queue, reducing the communication overhead in the data transmission process, eliminating the limitation of uneven dynamic distribution of high-dimensional industrial data, respectively executing an anomaly detection algorithm in a sensing layer and a network layer by utilizing a fuzzy evidence theory, obtaining local and global anomaly information, realizing low-delay and high-precision detection, and finally, powerful data support is provided for system decision, and the normal operation of the whole industrial production line is ensured.

The foregoing shows and describes the general principles, essential features, and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A high-dimensional data anomaly detection method based on hierarchical processing in an industrial Internet of things is characterized by comprising the following steps:

(1) building trusted data verification model

Dividing the whole industrial production area into a plurality of sub-areas according to different work tasks; each subarea is provided with a plurality of heterogeneous devices for monitoring the running condition of the machine, each heterogeneous device is provided with a plurality of sensor nodes, and sensing data D of each type are collected to provide decision basis for the control system; by dividing a time window T, the heterogeneous equipment exchanges data information with a neighbor node in a communication range of the heterogeneous equipment in a wireless transmission mode; constructing a trusted data verification model, calculating the credibility of data, and reducing noise and disturbance caused by instrument aging by using a real state updating mechanism to obtain data with high monitoring quality;

(2) local data anomaly detection

The heterogeneous equipment of the perception layer has light computing power and performs local anomaly detection on data which passes trust verification; setting corresponding normal data intervals by combining historical data samples according to the difference of working environments in different regions, and analyzing an abnormal detection result by adopting a fuzzy theory to obtain the distribution condition of the abnormal degree of the data of a single source; the time window T is added into the data buffer queue, so that the problem of uneven dynamic distribution of high-dimensional industrial data is solved, the communication overhead in the data transmission process is reduced, and the detection efficiency is improved;

(3) global data anomaly detection

The edge server can obtain all monitoring data information in the area where the edge server is located, detect the local abnormal labeling data in the time window T again, execute a multi-source abnormal detection algorithm, and avoid result deviation caused by single data detection; distinguishing isolated anomalies and aggregate anomalies, analyzing the variation trend of the abnormal data through the functional relation in the time domain, predicting the reasons of the anomalies, reserving the data containing sensitive information, effectively evaluating the global data anomaly result and finally transmitting the global data anomaly result to a control center;

in the step (1), a plurality of sub-regions are divided according to different tasks, and the whole region is represented as Z ═ Z in a set form ₁ ,Z ₂ …Z _e Where e denotes the specific number owned by each sub-area, sub-area Z _e The sensing devices in the system are all endowed with unique IDs and paired encryption keys, and the security during data collection is maintained;

in the step (1), expressions of various types of sensing data collected by heterogeneous equipment are

Wherein D _j Representing a data matrix, W, collected by the device j _m Representing the number of all data types, X _n A data flow vector representing a certain type of attribute, expressed as

Wherein x _t Represents a data point collected at time t;

Wherein S _k Representing the trustworthy value of the sensing device k,

the data representing the actual observations that are being made,

representing the estimated actual value of the measured value,

represents the mean of the data, d represents the distance function;

the calculation formula of the distance function d is

Wherein std _m A standard deviation representing the actual observed value;

the expression of the data normal interval Tr in the step (2) is

Wherein

Represents the lower bound of class i data collected by device j,

the calculation formula of the abnormal degree representing the upper bound of the i-type data collected by the equipment j is

Wherein T is greater than or equal to 1 and less than or equal to T, and T represents the length of a time window; due to the influence of industrial environment difference on the calculation of the abnormal degree, the data abnormal result is divided into 1 or 0 which can not reflect the actual situation, so the abnormal degree calculation formula is rewritten into

In the step (2), the local data in the time window T is uploaded to the edge server in a queue manner by using a data buffer queue, where a data processing delay ω stored in the queue needs to satisfy the following condition:

wherein Cnt (Q) _j ) Represents the maximum queue length of the device j, and p represents the processing delay of single data;

the multi-source abnormality detection algorithm executed on the edge server in the step (3) is that the spatial dimension is added on the basis of the single-source abnormality detection algorithm, and the multi-source abnormality detection algorithm has a discriminant of

Condition is required to satisfy the following conditions

Wherein H (i, j) represents a high-correlation data set, psi represents an effective detection coefficient, m and n represent IDs of related heterogeneous devices, and t represents the time when data are acquired;

in the step (3), the isolated anomalies and the aggregation anomalies are distinguished by using the time stamps of the data, the isolated anomalies are characterized in that data points near the abnormal values are normal data, the aggregation anomalies are characterized in that abnormal values continuously appear, the reasons for the abnormal values are predicted by solving the partial derivative of the abnormal data curve and the positions of inflection points, and the data containing sensitive information are reserved.