CN112949735A

CN112949735A - Liquid hazardous chemical substance volatile concentration abnormity discovery method based on outlier data mining

Info

Publication number: CN112949735A
Application number: CN202110273839.2A
Authority: CN
Inventors: 薛善良; 彭振峰; 韦青燕; 肖雪
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2021-03-15
Filing date: 2021-03-15
Publication date: 2021-06-11

Abstract

A liquid hazardous chemical substance volatile concentration abnormity discovery method based on outlier data mining is characterized in that firstly, a dividing information entropy is introduced to determine the weight of outlier attributes; the method comprises the steps that a density-based clustering algorithm is used for screening an original data set collected by a sensor to obtain a primary outlier data set, and the operation efficiency of the algorithm is improved; then, usePReplacing the reachable distance in the local abnormal factor algorithm by the weight; finally using the newly defined basesPAnd calculating the outlier degree of the objects in the preliminary outlier data set by using the local outlier LOFBP of the weight. According to the invention, a large amount of gas concentration sensor data is processed by using a data mining technology, so that the data reliability of a single gas concentration sensor can be improved, and the data of a plurality of gas concentration sensor arrays form a whole to estimate the gas concentration in space, thereby effectively helping dangerous chemical production and processing enterprises to improve the production safety risk identification capability and prevent production accidents.

Description

Liquid hazardous chemical substance volatile concentration abnormity discovery method based on outlier data mining

Technical Field

The invention relates to an outlier data mining method, in particular to a liquid-state hazardous chemical substance volatile concentration abnormity discovery method based on local abnormal factor improvement.

Background

The storage and transportation problems of the liquid dangerous chemicals always relate to the life and property safety of people in China. Different liquid hazardous chemicals have different properties. Volatility is common to most liquid hazardous chemicals, such as gasoline, LNG, liquid ammonia, alcohols and benzene are common volatile hazardous chemicals. In practice, petrochemical enterprises often need to frequently monitor a plurality of safety indexes in the processes of storage and transportation to ensure that accidental leakage does not occur in the production process. Monitoring of gas volatilization of liquid hazardous chemicals is an important basis for judging accidental leakage. In open-work situations, certain errors, even false positives, often occur due to limitations of single gas sensor deployment. Therefore, enterprises also use a large number of sensors to form an array for detection, the number of sensors can reduce false alarms and improve accuracy, but a large number of sensor raw data need to be preprocessed before being used. Therefore, the method has higher significance and realization value for solving the problems of efficiently preprocessing the data, accurately identifying individual sensors with accidental errors, excavating the sensor data of outliers and the like.

The traditional outlier data mining algorithm has the problems of inconvenience and overlong running time in monitoring data of a liquid hazardous chemical gas concentration sensor. The data collected by the gas concentration sensor is determined according to actual conditions, and shows strong attractiveness and unpredictability. But the behavioral causes of their data outliers may be multiple. The security problem often has higher timeliness requirement, the outlier data mining algorithm needs to have fast execution efficiency, and the outlier data points can be accurately positioned, so that timely and accurate data can be provided for subsequent analysis. The invention applies a data mining technology to gas concentration data outlier detection, provides a liquid hazardous chemical substance volatilization concentration abnormity finding method improved based on a local abnormal factor algorithm, and assists a subsequent algorithm to carry out more targeted batch processing on a large amount of sensor data.

Disclosure of Invention

The invention aims to solve the problems of discomfort and overlong running time of a large amount of sensor monitoring data of the existing outlier data mining algorithm in the case that a gas concentration sensor array collects batch data, and provides a liquid hazardous chemical substance volatilization concentration abnormity discovery method based on outlier data mining.

The technical scheme of the invention is as follows:

the liquid hazardous chemical substance volatile concentration abnormity discovery method based on outlier data mining utilizes liquid hazardous chemical substance gas concentration sensor monitoring data to conduct outlier mining, and can improve the efficiency of processing the liquid hazardous chemical substance volatile gas concentration monitoring data. The method is characterized in that firstly, a division information entropy is introduced to determine the weight of the outlier; then, screening an original data set acquired by the gas concentration sensor by using an OPTIC clustering algorithm to obtain a primary outlier data set, and improving the operation efficiency of the algorithm; replacing the reachable distance in the LOF algorithm by the P weight; and finally, calculating the outlier degree of the object in the preliminary outlier data set by using a newly defined Local Outliers Factor (LOFBP) based on P weight, and improving the execution efficiency while keeping the detection precision of the algorithm.

For the problem that the running time of the existing outlier sensor data detected by the existing outlier data mining algorithm is too high, the LOFBP algorithm adopts OPTIC as preprocessing, and redefines an outlier factor in the local outlier algorithm. In order to reduce the time complexity of outlier mining, the data set is reduced and the mining efficiency is improved on the premise of not influencing the final analysis result. In order to solve the defects of the local outlier factor algorithm, the LOFBP algorithm introduces a division information entropy during distance measurement, replaces the reachable distance in the traditional LOF algorithm with a P weight and redefines the local outlier factor, and can greatly improve the detection accuracy.

The method specifically comprises the following steps:

step 1: reading an original data set S acquired by a gas sensor;

step 2: calculating a de-partitioned information entropy delta (N) for all attributes in a data set_i)；

a) In order to improve the quality of detection of outliers, the distance between data objects in the OPTICS algorithm is measured by adopting weighted distance, and the weight of the attribute is determined by removing one division information entropy increment. Entropy is a measure of how much information a system contains, and therefore, the entropy E (x) value measures the uncertainty of a data set. It is defined as:

E(x)＝-[p(x₁)·log p(x₁)]-[p(x₂)·log p(x₂)]…-[p(x_n)·log p(x_n)] (1)

in formula (1), x is a random variable, and the possible set of values is s (x) { x }_1，x₂，......，x_n}；

p (x) represents the probability of taking the value x.

b) To highlight outlier attributes, the weight of the associated attribute is defined by the change in entropy value after one attribute is removed. Let attribute set be N ═ N₁，N₂，...，N_mGet N_i(i ═ 1, 2.., m) divides N into two parts: { N_iAnd { N-N }_iP ═ P }, denoted as P ═ P₁，P₂In which P is₁＝{N_i}，P₂＝{N₁，N₂，…，N_i-1，N_i+1，…，N_mGet one division information entropy increment delta (N)_i) Defined as formula (2), the larger the value, the more N_iThe more uncertainty of the removed data set is reduced:

Δ(N_i)＝E(N)-E(P) (2)

in the formula (2), Δ (N)_i) Representing set N removing N_iThe later information entropy changes;

e (N) information entropy representing the attribute set N;

the calculation formula of E (P) is:

c) if two data objects are p ═ p respectively₁，p₂，…，p_m}，p′＝{q₁，q₂，...，q_mAnd the weighted distance between the two is denoted as dist (p, p'), then the weighted distance based on the one-division-information entropy increment is defined as:

dist(p，p′)＝[Δ(N₁)×d(p₁，p′₁)]+[Δ(N₂)×d(p₂，p′₂)]+…+[Δ(N_m)×d(p_m，p′_m)] (4)

and step 3: calculating the reachable distance of all objects in the data set;

if p is a core object, the larger of the core distance of p and the distance of o from p is defined as the reachable distance of o with respect to p; if p is a non-core object, then p has no definition of a core object. Thus for object p, o e S, the reachable distance is defined as follows:

in formula (5), reachDist is the reachable distance of o with respect to p;

the reachable distance is calculated if and only if the p-core object.

And 4, step 4: obtaining a preliminary outlier data set S using the OPTIC algorithm₂；

Step 4.1: after the points in the neighborhood are added into the unordered queue, the whole unordered queue is not required to be sorted, and only the minimum point of the reachable distance is taken out through comparison and stored into a temporary variable. When a new point in the non-ordered queue needs to be processed, only the minimum point of the temporary variable storage needs to be taken out, and the reachable graph is obtained through the method.

And 5: calculating k distances and k distance neighborhoods of all objects in the preliminary outlier data set, and calculating a P weight;

p-weighting is a distance-based method for finding outlier data by measuring the degree of outlier of an object in a data set by P-weighting. The sum of the distances between an arbitrary object P and its nearest k objects is called P weight, and is calculated as follows:

W_k(p)＝d₁(p，nb₁(p))+d₂(p，nb₂(p))+…+d_k(p，nb_k(p)) (6)

in the formula (6), W_k(P) is the P weight;

nb_i(p) the ith neighbor representing p;

d_k(p，nb_k(p)) represents the distance of point p to the kth object adjacent to p.

Step 6: calculating local density Ldp based on the P weight;

when the P weight is used for measuring the outlier degree of an object, the operation is simple, but only outlier data with single density can be found, so the algorithm adopts the idea of local outlier factor algorithm to improve the P weight. In the LOF algorithm, for the core point P, the reachable distance of any point o to P is defined as the larger of d (P, o) and coredist (P), where the reachable distance in the local outlier algorithm is replaced with the P weight of the core object. Since the distance between any two points has already been calculated in the OPTIC algorithm, dist in the above equation_i(p，nb_i(p)) use is made of the weighted distances that have been calculated during the clustering process of the OPTICS algorithm. Given a data set S, P is any point in the set, the local density Ldp based on the P weight value proposed herein_k(p) may be represented by the following formula:

in the formula (7), N_k(p) represents a k-distance neighborhood of object p;

dist_i(p，nb_i(p)) is a weighted distance based on the de-one partition information entropy increment.

And 7: calculating local reachable density LOFBP based on P weight;

according to the definition of the local outlier factor in the LOF algorithm, the local outlier factor can be defined by Ldp_k(P) analogize local outliers based on P weights in LOFBPAnd (4) defining the factor. The local outlier factor based on the P weight is found by the mean of the ratio of the object density in the epsilon neighborhood of object P to the object P density, and this value is denoted as LOFBP_k(p) of the formula (I). If LOFBP_kThe closer the value of (p) is to 1, the more p and N_k(p) in which the density of the objects is not very different, p and N_k(p) the object may belong to a cluster; if LOFBP_kThe smaller the value of (p) is, the higher the density of p is, the lower the value is, the lower the density is, the_k(p) density of objects, whereas the more likely p is an outlier. The local outlier definition for object p is shown below:

in equation (8), LOFBP_k(p) is the local reachable density of p points, which can be used as an index of data outlier;

and 8: and outputting the local reachable density LOFBP in a descending order to obtain outlier data.

The invention has the beneficial effects that:

(1) the invention overcomes the defect of the traditional outlier mining algorithm for anomaly discovery that the execution efficiency of the sensor array data processing is low. Outlier data points are mined by analyzing raw gas concentration sensor data. The method can assist the subsequent algorithm in processing a large amount of sensing data, and improve the execution efficiency.

(2) The LOFBP provided by the invention uses OPTICS to preprocess the original data set of the gas concentration sensor, introduces a division information entropy when measuring the distance, uses a P weight to replace the reachable distance, and redefines a local outlier factor, thereby effectively improving the accuracy of outlier mining while ensuring the efficiency.

(3) Compared with LOF, P-weight and LODCD algorithms, the LOFBP provided by the invention has the advantages that the comprehensive performance of the mining effect and the operation efficiency is highest, and the consumed time is very short while the outlier data points are effectively mined.

According to the method, through analyzing the original data of the gas concentration sensor, outlier data points can be more efficiently and accurately excavated, so that the execution of a subsequent space concentration estimation algorithm or other realistic significance methods is assisted, the execution efficiency of the algorithm is improved under the same hardware condition, and the production safety risk identification capability is improved.

Drawings

FIG. 1 is a flow chart of the present invention.

FIG. 2 is a schematic of a simulated data set for use with the present invention.

FIG. 3 is a graph comparing the detection accuracy of the algorithm of the present invention with three other algorithms on a simulated data set.

FIG. 4 is a graph comparing the detection accuracy of the algorithm of the present invention with three other algorithms on an Iris dataset.

Fig. 5 is a graph comparing the detection accuracy of the algorithm of the present invention with that of the other three algorithms on the break-cancer data set.

FIG. 6 is a run time comparison of the algorithm of the present invention with three other algorithms.

Detailed Description

The invention is explained in more detail below with reference to the drawings and the examples.

The invention aims to solve the problems that the existing outlier data mining algorithm can not effectively detect outlier gas concentration sensor data points and the running time is too high. Firstly, weights of various attributes, such as sensor deployment position information, wind speed, temperature, concentration and the like, are determined by using de-one partition information entropy. Then screening an original data set of the gas concentration sensor by using an OPTIC clustering algorithm to obtain a primary outlier data set, and replacing the reachable distance in the local abnormal factor algorithm by using a P weight; and finally, calculating the outlier degree of the objects in the preliminary outlier data set by using a newly defined local outlier factor LOFBP based on the P weight, and excavating the outlier data points.

Fig. 1 is a flow chart of the present invention, and the specific implementation process is as follows:

step 1: reading an original data set S acquired by a gas sensor;

And step 3: calculating the reachable distance reachDist of all objects in the data set;

step 6: calculating the local density based on the P weight;

and 7: calculating local reachable density based on the P weight;

and 8: and outputting the local reachable density in a descending order to obtain outlier data.

In the experiment, 1 simulation data set and 2 UCI data sets are used for carrying out method effectiveness verification and efficiency analysis. The gas concentration sensor data is used in the implementation as real experimental data from UCI authorization-free. Simulated data set as shown in fig. 2, two clusters and outliers of different densities are included in the data set. The number of the cluster data represented by the black dot symbols is 500, and the number of the outliers represented by the black triangles is 10. Statistical analysis was performed on the first 10 data of the run results. Counting the number R of outlier data in the first 10 data of the operation result₀From R₀The detection accuracy of the LOF, P-weight, LODCD and LOFBP algorithms under different k values is calculated, and the comparison result is shown in FIG. 3. Fig. 4 and 5 are graphs of the detection accuracy of four algorithms based on the Iris and Breast-cancer data sets, respectively. As can be seen from fig. 3 to 5, the overall detection accuracy of the LOFBP algorithm is the highest. Fig. 6 is a comparison graph of the running time of the four algorithms on the break-candidate data set, and it can be seen from fig. 6 that the running time of the P weight algorithm is the lowest, the running time of the LODCD algorithm is the highest, and the running efficiencies of the LOFBP algorithm and the LOF algorithm are substantially equal. The results show that the LOFBP algorithm provided by the invention not only has a better data mining effect, but also can ensure higher operation efficiency.

The data of the gas concentration sensor used in the implementation process is taken as an example of ethanol serving as a typical liquid hazardous chemical, and the experimental environment is windless, the temperature is 22.4 ℃, and the humidity is 68.92%. The data are concentration indexes of 72 metal oxide gas concentration sensors, each eight of the sensors are divided into nine groups, and the sampling frequency is 100Hz and lasts for 20 seconds. Subtracting the sampling consumption time of the device, collecting 1928 times of data in total, and performing data processing and outlier analysis by using the algorithm of the invention to quickly locate the outlier sampling data points, wherein 72 x 1928 x 138816 are the total sampling data points. The sampled data points sorted according to the size of the outlier index can facilitate the data to be processed in various aspects subsequently. For example: 1. because the gas concentration is difficult to realize jumping under the condition of quick sampling, the data point of which the outlier factor is larger than a certain threshold can be preliminarily judged to be a noise point caused by equipment failure, and can be eliminated; 2. the data points collected by the same group of sensors are high in general outlier index, so that whether the equipment works in a required normal working environment or not can be checked, and the sampling result is possibly influenced by the working temperature or the integrated hardware environment; 3. a large number of data distributions of different groups present similar clusters, it may be that the relative position of the single sensor deployment in the sensor array affects the sampling results, and so on.

In practice, through outlier analysis of a large number of data indexes, noise points can be eliminated in time in the work of carrying out feature engineering, data point distribution rules can be analyzed more effectively, or factors which cannot be considered in detail in the design process of experiments or production practices can be found, meanwhile, the execution efficiency of a subsequent exception handling algorithm can be accelerated, and the method has great practical significance.

The present invention is not concerned with parts which are the same as or can be implemented using prior art techniques.

Claims

1. A liquid hazardous chemical substance volatile concentration abnormity discovery method based on outlier data mining is characterized by comprising the following steps of firstly, introducing a division information entropy to determine the weight of outlier attributes; then, screening an original data set acquired by the gas concentration sensor by using an OPTIC clustering algorithm to obtain a primary outlier data set, and improving the operation efficiency of the algorithm; replacing the reachable distance in the LOF algorithm by the P weight; and finally, calculating the outlier degree of the object in the preliminary outlier data set by using a newly defined Local Outliers Factor (LOFBP) based on P weight, and improving the execution efficiency while keeping the detection precision of the algorithm.

2. Method according to claim 1, characterized in that it comprises the following steps:

step 1: reading a gas sensor raw data set;

step 2: calculating the entropy increment of the one-off division information of all attributes in the data set;

and step 3: calculating the reachable distance of all objects in the data set;

and 4, step 4: acquiring a primary outlier data set by using an OPTIC algorithm;

step 6: calculating the local density based on the P weight;

and 7: calculating local reachable density based on the P weight;

3. The method of claim 2, wherein the OPTICS algorithm obtaining a preliminary outlier data set comprises: after the points in the neighborhood are added into the disordered queue, the whole disordered queue is not required to be sequenced, and the minimum point of the reachable distance can be taken out and stored into a temporary variable only by comparing the newly added point with the original minimum point; when a new point in the non-ordered queue needs to be processed, only the minimum point of the temporary variable storage needs to be taken out, and the reachable graph is obtained through the method.

4. The method of claim 2, wherein the de-partitioned information entropy delta Δ (N) is computed for all attributes in the data set_i) The method comprises the following steps:

a) in order to improve the quality of detection of outliers, the distance between data objects in an OPTIC algorithm is measured by adopting weighted distance, and the weight of attributes is determined by removing a division information entropy increment; the information entropy is used for measuring how much information a system contains, so that the information entropy E (x) value can measure the uncertainty of a data set; it is defined as:

in formula (1), x is a random variable, and the possible set of values is s (x) { x }₁,x₂,……,x_n}；

p (x) represents the probability of taking the value x;

b) to highlight outlier attributes, the weight of the associated attribute is defined by the change in entropy value after one attribute is removed. Let attribute set be N ═ N₁,N₂,…,N_mGet N_i(i ═ 1,2, …, m), dividing N into two parts: { N_iAnd { N-N }_iP ═ P }, denoted as P ═ P₁，P₂In which P is₁＝{N_i}，P₂＝{N₁，N₂，…，N_i-1，N_i+1，…，N_mGet one division information entropy increment delta (N)_i) Defined as formula (2), the larger the value, the more N_iThe more uncertainty of the removed data set is reduced:

Δ(N_i)＝E(N)-E(P) (2)

e (N) information entropy representing the attribute set N;

the calculation formula of E (P) is:

c) if two data objects are p ═ p respectively₁，p₂，…，p_m}，p′＝{q₁，q₂，…，q_mAnd the weighted distance between the two is denoted as dist (p, p'), then the weighted distance based on the one-division-information entropy increment is defined as:

dist(p，p′)＝[Δ(N₁)×d(p₁，p′₁)]+[Δ(N₂)×d(p₂，p′₂)]+…+[Δ(N_m)×d(p_m，p′_m)] (4)。

5. the method of claim 2, wherein when calculating the reachable distance of all objects in the data set, if p is a core object, the larger of the core distance of p and the distance of o from p is defined as the reachable distance of o with respect to p; if p is a non-core object, then p is defined as a non-core object; thus for object p, o e S, the reachable distance is defined as follows:

in formula (5), reachDist is the reachable distance of o with respect to p; the reachable distance is calculated if and only if the p-core object.

6. The method of claim 2, wherein k-distance and k-distance neighborhoods of all objects in the preliminary outlier data set are calculated, and P-weight is calculated; the P weight is a method for finding outlier data based on distance, and the method measures the outlier degree of a certain object in a data set through the P weight; the sum of the distances between an arbitrary object P and its nearest k objects is called P weight, and is calculated as follows:

W_k(p)＝d₁(p，nb₁(p))+d₂(p，nb₂(p))+…+d_k(p，nb_k(p)) (6)

in the formula (6), W_k(P) is the P weight;

nb_i(p) the ith neighbor representing p;

7. The method as claimed in claim 2, wherein when calculating the local density Ldp based on the P-weight, and measuring the degree of outlier of the object by using the P-weight, the operation is simple but only outlier data of a single density can be found, so the algorithm uses the local density LdpThe P weight is improved by the idea of the outlier factor algorithm; in the LOF algorithm, for a core point P, the reachable distance from any point o to P is defined as the larger of d (P, o) and coredist (P), where the reachable distance in the local outlier algorithm is replaced by the P weight of the core object; since the distance between any two points has already been calculated in the OPTIC Algorithm, dist_i(p,nb_i(p)) use is made of the weighted distances that have been calculated during the clustering process of the OPTICS algorithm; if a data set S is given, and P is any point in the set, the local density Ldp based on the weight of P_k(p) may be represented by the following formula:

in the formula (7), N_k(p) represents a k-distance neighborhood of object p; dist_i(p,nb_i(p)) is a weighted distance based on the de-one partition information entropy increment.

8. The method of claim 2, wherein calculating the local reachable density LOFBP based on P weight is based on the local outlier definition in LOF algorithm, which can be defined by Ldp_k(P) analogizing the definition of local outlier factors based on P weight in LOFBP; the local outlier factor based on the P weight is found by the mean of the ratio of the object density in the epsilon neighborhood of object P to the object P density, and this value is denoted as LOFBP_k(p); if LOFBP_kThe closer the value of (p) is to 1, the more p and N_k(p) in which the density of the objects is not very different, p and N_k(p) the object may belong to a cluster; if LOFBP_kThe smaller the value of (p) is, the higher the density of p is, the lower the value is, the lower the density is, the_k(p) density of objects, whereas the more likely p is an outlier; the local outlier definition for object p is shown below:

in equation (8), LOFBP_k(p) is the local achievable density of p points, which can be used as an indicator of data outliers.