CN112949735A - Liquid hazardous chemical substance volatile concentration abnormity discovery method based on outlier data mining - Google Patents
Liquid hazardous chemical substance volatile concentration abnormity discovery method based on outlier data mining Download PDFInfo
- Publication number
- CN112949735A CN112949735A CN202110273839.2A CN202110273839A CN112949735A CN 112949735 A CN112949735 A CN 112949735A CN 202110273839 A CN202110273839 A CN 202110273839A CN 112949735 A CN112949735 A CN 112949735A
- Authority
- CN
- China
- Prior art keywords
- outlier
- weight
- distance
- algorithm
- data set
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 36
- 239000000383 hazardous chemical Substances 0.000 title claims abstract description 16
- 238000007418 data mining Methods 0.000 title claims abstract description 15
- 239000007788 liquid Substances 0.000 title claims abstract description 15
- 239000000126 substance Substances 0.000 title claims abstract description 11
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 69
- 238000012216 screening Methods 0.000 claims abstract description 4
- 238000001514 detection method Methods 0.000 claims description 12
- 238000003860 storage Methods 0.000 claims description 4
- 238000005192 partition Methods 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 claims description 2
- 238000004519 manufacturing process Methods 0.000 abstract description 5
- 230000002159 abnormal effect Effects 0.000 abstract description 4
- 238000005516 engineering process Methods 0.000 abstract description 2
- 238000003491 array Methods 0.000 abstract 1
- 238000012993 chemical processing Methods 0.000 abstract 1
- 238000012824 chemical production Methods 0.000 abstract 1
- 238000005070 sampling Methods 0.000 description 7
- 238000005065 mining Methods 0.000 description 6
- 238000004458 analytical method Methods 0.000 description 5
- 238000012544 monitoring process Methods 0.000 description 5
- 238000012545 processing Methods 0.000 description 5
- UHOVQNZJYSORNB-UHFFFAOYSA-N Benzene Chemical compound C1=CC=CC=C1 UHOVQNZJYSORNB-UHFFFAOYSA-N 0.000 description 3
- LFQSCWFLJHTTHZ-UHFFFAOYSA-N Ethanol Chemical compound CCO LFQSCWFLJHTTHZ-UHFFFAOYSA-N 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 238000009826 distribution Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- WWYNJERNGUHSAO-XUDSTZEESA-N (+)-Norgestrel Chemical compound O=C1CC[C@@H]2[C@H]3CC[C@](CC)([C@](CC4)(O)C#C)[C@@H]4[C@@H]3CCC2=C1 WWYNJERNGUHSAO-XUDSTZEESA-N 0.000 description 1
- QGZKDVFQNNGYKY-UHFFFAOYSA-N Ammonia Chemical compound N QGZKDVFQNNGYKY-UHFFFAOYSA-N 0.000 description 1
- 206010006187 Breast cancer Diseases 0.000 description 1
- 208000026310 Breast neoplasm Diseases 0.000 description 1
- 206010028980 Neoplasm Diseases 0.000 description 1
- 150000001298 alcohols Chemical class 0.000 description 1
- 238000010420 art technique Methods 0.000 description 1
- 230000003542 behavioural effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 201000011510 cancer Diseases 0.000 description 1
- 238000012938 design process Methods 0.000 description 1
- 201000008071 iris cancer Diseases 0.000 description 1
- 230000009191 jumping Effects 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 229910044991 metal oxide Inorganic materials 0.000 description 1
- 150000004706 metal oxides Chemical class 0.000 description 1
- 238000013450 outlier detection Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/2433—Single-class perspective, e.g. one-against-all classification; Novelty detection; Outlier detection
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A liquid hazardous chemical substance volatile concentration abnormity discovery method based on outlier data mining is characterized in that firstly, a dividing information entropy is introduced to determine the weight of outlier attributes; the method comprises the steps that a density-based clustering algorithm is used for screening an original data set collected by a sensor to obtain a primary outlier data set, and the operation efficiency of the algorithm is improved; then, usePReplacing the reachable distance in the local abnormal factor algorithm by the weight; finally using the newly defined basesPAnd calculating the outlier degree of the objects in the preliminary outlier data set by using the local outlier LOFBP of the weight. According to the invention, a large amount of gas concentration sensor data is processed by using a data mining technology, so that the data reliability of a single gas concentration sensor can be improved, and the data of a plurality of gas concentration sensor arrays form a whole to estimate the gas concentration in space, thereby effectively helping dangerous chemical production and processing enterprises to improve the production safety risk identification capability and prevent production accidents.
Description
Technical Field
The invention relates to an outlier data mining method, in particular to a liquid-state hazardous chemical substance volatile concentration abnormity discovery method based on local abnormal factor improvement.
Background
The storage and transportation problems of the liquid dangerous chemicals always relate to the life and property safety of people in China. Different liquid hazardous chemicals have different properties. Volatility is common to most liquid hazardous chemicals, such as gasoline, LNG, liquid ammonia, alcohols and benzene are common volatile hazardous chemicals. In practice, petrochemical enterprises often need to frequently monitor a plurality of safety indexes in the processes of storage and transportation to ensure that accidental leakage does not occur in the production process. Monitoring of gas volatilization of liquid hazardous chemicals is an important basis for judging accidental leakage. In open-work situations, certain errors, even false positives, often occur due to limitations of single gas sensor deployment. Therefore, enterprises also use a large number of sensors to form an array for detection, the number of sensors can reduce false alarms and improve accuracy, but a large number of sensor raw data need to be preprocessed before being used. Therefore, the method has higher significance and realization value for solving the problems of efficiently preprocessing the data, accurately identifying individual sensors with accidental errors, excavating the sensor data of outliers and the like.
The traditional outlier data mining algorithm has the problems of inconvenience and overlong running time in monitoring data of a liquid hazardous chemical gas concentration sensor. The data collected by the gas concentration sensor is determined according to actual conditions, and shows strong attractiveness and unpredictability. But the behavioral causes of their data outliers may be multiple. The security problem often has higher timeliness requirement, the outlier data mining algorithm needs to have fast execution efficiency, and the outlier data points can be accurately positioned, so that timely and accurate data can be provided for subsequent analysis. The invention applies a data mining technology to gas concentration data outlier detection, provides a liquid hazardous chemical substance volatilization concentration abnormity finding method improved based on a local abnormal factor algorithm, and assists a subsequent algorithm to carry out more targeted batch processing on a large amount of sensor data.
Disclosure of Invention
The invention aims to solve the problems of discomfort and overlong running time of a large amount of sensor monitoring data of the existing outlier data mining algorithm in the case that a gas concentration sensor array collects batch data, and provides a liquid hazardous chemical substance volatilization concentration abnormity discovery method based on outlier data mining.
The technical scheme of the invention is as follows:
the liquid hazardous chemical substance volatile concentration abnormity discovery method based on outlier data mining utilizes liquid hazardous chemical substance gas concentration sensor monitoring data to conduct outlier mining, and can improve the efficiency of processing the liquid hazardous chemical substance volatile gas concentration monitoring data. The method is characterized in that firstly, a division information entropy is introduced to determine the weight of the outlier; then, screening an original data set acquired by the gas concentration sensor by using an OPTIC clustering algorithm to obtain a primary outlier data set, and improving the operation efficiency of the algorithm; replacing the reachable distance in the LOF algorithm by the P weight; and finally, calculating the outlier degree of the object in the preliminary outlier data set by using a newly defined Local Outliers Factor (LOFBP) based on P weight, and improving the execution efficiency while keeping the detection precision of the algorithm.
For the problem that the running time of the existing outlier sensor data detected by the existing outlier data mining algorithm is too high, the LOFBP algorithm adopts OPTIC as preprocessing, and redefines an outlier factor in the local outlier algorithm. In order to reduce the time complexity of outlier mining, the data set is reduced and the mining efficiency is improved on the premise of not influencing the final analysis result. In order to solve the defects of the local outlier factor algorithm, the LOFBP algorithm introduces a division information entropy during distance measurement, replaces the reachable distance in the traditional LOF algorithm with a P weight and redefines the local outlier factor, and can greatly improve the detection accuracy.
The method specifically comprises the following steps:
step 1: reading an original data set S acquired by a gas sensor;
step 2: calculating a de-partitioned information entropy delta (N) for all attributes in a data seti);
a) In order to improve the quality of detection of outliers, the distance between data objects in the OPTICS algorithm is measured by adopting weighted distance, and the weight of the attribute is determined by removing one division information entropy increment. Entropy is a measure of how much information a system contains, and therefore, the entropy E (x) value measures the uncertainty of a data set. It is defined as:
E(x)=-[p(x1)·log p(x1)]-[p(x2)·log p(x2)]…-[p(xn)·log p(xn)] (1)
in formula (1), x is a random variable, and the possible set of values is s (x) { x }1,x2,......,xn};
p (x) represents the probability of taking the value x.
b) To highlight outlier attributes, the weight of the associated attribute is defined by the change in entropy value after one attribute is removed. Let attribute set be N ═ N1,N2,...,NmGet Ni(i ═ 1, 2.., m) divides N into two parts: { NiAnd { N-N }iP ═ P }, denoted as P ═ P1,P2In which P is1={Ni},P2={N1,N2,…,Ni-1,Ni+1,…,NmGet one division information entropy increment delta (N)i) Defined as formula (2), the larger the value, the more NiThe more uncertainty of the removed data set is reduced:
Δ(Ni)=E(N)-E(P) (2)
in the formula (2), Δ (N)i) Representing set N removing NiThe later information entropy changes;
e (N) information entropy representing the attribute set N;
the calculation formula of E (P) is:
c) if two data objects are p ═ p respectively1,p2,…,pm},p′={q1,q2,...,qmAnd the weighted distance between the two is denoted as dist (p, p'), then the weighted distance based on the one-division-information entropy increment is defined as:
dist(p,p′)=[Δ(N1)×d(p1,p′1)]+[Δ(N2)×d(p2,p′2)]+…+[Δ(Nm)×d(pm,p′m)] (4)
and step 3: calculating the reachable distance of all objects in the data set;
if p is a core object, the larger of the core distance of p and the distance of o from p is defined as the reachable distance of o with respect to p; if p is a non-core object, then p has no definition of a core object. Thus for object p, o e S, the reachable distance is defined as follows:
in formula (5), reachDist is the reachable distance of o with respect to p;
the reachable distance is calculated if and only if the p-core object.
And 4, step 4: obtaining a preliminary outlier data set S using the OPTIC algorithm2;
Step 4.1: after the points in the neighborhood are added into the unordered queue, the whole unordered queue is not required to be sorted, and only the minimum point of the reachable distance is taken out through comparison and stored into a temporary variable. When a new point in the non-ordered queue needs to be processed, only the minimum point of the temporary variable storage needs to be taken out, and the reachable graph is obtained through the method.
And 5: calculating k distances and k distance neighborhoods of all objects in the preliminary outlier data set, and calculating a P weight;
p-weighting is a distance-based method for finding outlier data by measuring the degree of outlier of an object in a data set by P-weighting. The sum of the distances between an arbitrary object P and its nearest k objects is called P weight, and is calculated as follows:
Wk(p)=d1(p,nb1(p))+d2(p,nb2(p))+…+dk(p,nbk(p)) (6)
in the formula (6), Wk(P) is the P weight;
nbi(p) the ith neighbor representing p;
dk(p,nbk(p)) represents the distance of point p to the kth object adjacent to p.
Step 6: calculating local density Ldp based on the P weight;
when the P weight is used for measuring the outlier degree of an object, the operation is simple, but only outlier data with single density can be found, so the algorithm adopts the idea of local outlier factor algorithm to improve the P weight. In the LOF algorithm, for the core point P, the reachable distance of any point o to P is defined as the larger of d (P, o) and coredist (P), where the reachable distance in the local outlier algorithm is replaced with the P weight of the core object. Since the distance between any two points has already been calculated in the OPTIC algorithm, dist in the above equationi(p,nbi(p)) use is made of the weighted distances that have been calculated during the clustering process of the OPTICS algorithm. Given a data set S, P is any point in the set, the local density Ldp based on the P weight value proposed hereink(p) may be represented by the following formula:
in the formula (7), Nk(p) represents a k-distance neighborhood of object p;
disti(p,nbi(p)) is a weighted distance based on the de-one partition information entropy increment.
And 7: calculating local reachable density LOFBP based on P weight;
according to the definition of the local outlier factor in the LOF algorithm, the local outlier factor can be defined by Ldpk(P) analogize local outliers based on P weights in LOFBPAnd (4) defining the factor. The local outlier factor based on the P weight is found by the mean of the ratio of the object density in the epsilon neighborhood of object P to the object P density, and this value is denoted as LOFBPk(p) of the formula (I). If LOFBPkThe closer the value of (p) is to 1, the more p and Nk(p) in which the density of the objects is not very different, p and Nk(p) the object may belong to a cluster; if LOFBPkThe smaller the value of (p) is, the higher the density of p is, the lower the value is, the lower the density is, thek(p) density of objects, whereas the more likely p is an outlier. The local outlier definition for object p is shown below:
in equation (8), LOFBPk(p) is the local reachable density of p points, which can be used as an index of data outlier;
and 8: and outputting the local reachable density LOFBP in a descending order to obtain outlier data.
The invention has the beneficial effects that:
(1) the invention overcomes the defect of the traditional outlier mining algorithm for anomaly discovery that the execution efficiency of the sensor array data processing is low. Outlier data points are mined by analyzing raw gas concentration sensor data. The method can assist the subsequent algorithm in processing a large amount of sensing data, and improve the execution efficiency.
(2) The LOFBP provided by the invention uses OPTICS to preprocess the original data set of the gas concentration sensor, introduces a division information entropy when measuring the distance, uses a P weight to replace the reachable distance, and redefines a local outlier factor, thereby effectively improving the accuracy of outlier mining while ensuring the efficiency.
(3) Compared with LOF, P-weight and LODCD algorithms, the LOFBP provided by the invention has the advantages that the comprehensive performance of the mining effect and the operation efficiency is highest, and the consumed time is very short while the outlier data points are effectively mined.
According to the method, through analyzing the original data of the gas concentration sensor, outlier data points can be more efficiently and accurately excavated, so that the execution of a subsequent space concentration estimation algorithm or other realistic significance methods is assisted, the execution efficiency of the algorithm is improved under the same hardware condition, and the production safety risk identification capability is improved.
Drawings
FIG. 1 is a flow chart of the present invention.
FIG. 2 is a schematic of a simulated data set for use with the present invention.
FIG. 3 is a graph comparing the detection accuracy of the algorithm of the present invention with three other algorithms on a simulated data set.
FIG. 4 is a graph comparing the detection accuracy of the algorithm of the present invention with three other algorithms on an Iris dataset.
Fig. 5 is a graph comparing the detection accuracy of the algorithm of the present invention with that of the other three algorithms on the break-cancer data set.
FIG. 6 is a run time comparison of the algorithm of the present invention with three other algorithms.
Detailed Description
The invention is explained in more detail below with reference to the drawings and the examples.
The invention aims to solve the problems that the existing outlier data mining algorithm can not effectively detect outlier gas concentration sensor data points and the running time is too high. Firstly, weights of various attributes, such as sensor deployment position information, wind speed, temperature, concentration and the like, are determined by using de-one partition information entropy. Then screening an original data set of the gas concentration sensor by using an OPTIC clustering algorithm to obtain a primary outlier data set, and replacing the reachable distance in the local abnormal factor algorithm by using a P weight; and finally, calculating the outlier degree of the objects in the preliminary outlier data set by using a newly defined local outlier factor LOFBP based on the P weight, and excavating the outlier data points.
Fig. 1 is a flow chart of the present invention, and the specific implementation process is as follows:
step 1: reading an original data set S acquired by a gas sensor;
step 2: calculating a de-partitioned information entropy delta (N) for all attributes in a data seti);
And step 3: calculating the reachable distance reachDist of all objects in the data set;
and 4, step 4: obtaining a preliminary outlier data set S using the OPTIC algorithm2;
And 5: calculating k distances and k distance neighborhoods of all objects in the preliminary outlier data set, and calculating a P weight;
step 6: calculating the local density based on the P weight;
and 7: calculating local reachable density based on the P weight;
and 8: and outputting the local reachable density in a descending order to obtain outlier data.
In the experiment, 1 simulation data set and 2 UCI data sets are used for carrying out method effectiveness verification and efficiency analysis. The gas concentration sensor data is used in the implementation as real experimental data from UCI authorization-free. Simulated data set as shown in fig. 2, two clusters and outliers of different densities are included in the data set. The number of the cluster data represented by the black dot symbols is 500, and the number of the outliers represented by the black triangles is 10. Statistical analysis was performed on the first 10 data of the run results. Counting the number R of outlier data in the first 10 data of the operation result0From R0The detection accuracy of the LOF, P-weight, LODCD and LOFBP algorithms under different k values is calculated, and the comparison result is shown in FIG. 3. Fig. 4 and 5 are graphs of the detection accuracy of four algorithms based on the Iris and Breast-cancer data sets, respectively. As can be seen from fig. 3 to 5, the overall detection accuracy of the LOFBP algorithm is the highest. Fig. 6 is a comparison graph of the running time of the four algorithms on the break-candidate data set, and it can be seen from fig. 6 that the running time of the P weight algorithm is the lowest, the running time of the LODCD algorithm is the highest, and the running efficiencies of the LOFBP algorithm and the LOF algorithm are substantially equal. The results show that the LOFBP algorithm provided by the invention not only has a better data mining effect, but also can ensure higher operation efficiency.
The data of the gas concentration sensor used in the implementation process is taken as an example of ethanol serving as a typical liquid hazardous chemical, and the experimental environment is windless, the temperature is 22.4 ℃, and the humidity is 68.92%. The data are concentration indexes of 72 metal oxide gas concentration sensors, each eight of the sensors are divided into nine groups, and the sampling frequency is 100Hz and lasts for 20 seconds. Subtracting the sampling consumption time of the device, collecting 1928 times of data in total, and performing data processing and outlier analysis by using the algorithm of the invention to quickly locate the outlier sampling data points, wherein 72 x 1928 x 138816 are the total sampling data points. The sampled data points sorted according to the size of the outlier index can facilitate the data to be processed in various aspects subsequently. For example: 1. because the gas concentration is difficult to realize jumping under the condition of quick sampling, the data point of which the outlier factor is larger than a certain threshold can be preliminarily judged to be a noise point caused by equipment failure, and can be eliminated; 2. the data points collected by the same group of sensors are high in general outlier index, so that whether the equipment works in a required normal working environment or not can be checked, and the sampling result is possibly influenced by the working temperature or the integrated hardware environment; 3. a large number of data distributions of different groups present similar clusters, it may be that the relative position of the single sensor deployment in the sensor array affects the sampling results, and so on.
In practice, through outlier analysis of a large number of data indexes, noise points can be eliminated in time in the work of carrying out feature engineering, data point distribution rules can be analyzed more effectively, or factors which cannot be considered in detail in the design process of experiments or production practices can be found, meanwhile, the execution efficiency of a subsequent exception handling algorithm can be accelerated, and the method has great practical significance.
The present invention is not concerned with parts which are the same as or can be implemented using prior art techniques.
Claims (8)
1. A liquid hazardous chemical substance volatile concentration abnormity discovery method based on outlier data mining is characterized by comprising the following steps of firstly, introducing a division information entropy to determine the weight of outlier attributes; then, screening an original data set acquired by the gas concentration sensor by using an OPTIC clustering algorithm to obtain a primary outlier data set, and improving the operation efficiency of the algorithm; replacing the reachable distance in the LOF algorithm by the P weight; and finally, calculating the outlier degree of the object in the preliminary outlier data set by using a newly defined Local Outliers Factor (LOFBP) based on P weight, and improving the execution efficiency while keeping the detection precision of the algorithm.
2. Method according to claim 1, characterized in that it comprises the following steps:
step 1: reading a gas sensor raw data set;
step 2: calculating the entropy increment of the one-off division information of all attributes in the data set;
and step 3: calculating the reachable distance of all objects in the data set;
and 4, step 4: acquiring a primary outlier data set by using an OPTIC algorithm;
and 5: calculating k distances and k distance neighborhoods of all objects in the preliminary outlier data set, and calculating a P weight;
step 6: calculating the local density based on the P weight;
and 7: calculating local reachable density based on the P weight;
and 8: and outputting the local reachable density in a descending order to obtain outlier data.
3. The method of claim 2, wherein the OPTICS algorithm obtaining a preliminary outlier data set comprises: after the points in the neighborhood are added into the disordered queue, the whole disordered queue is not required to be sequenced, and the minimum point of the reachable distance can be taken out and stored into a temporary variable only by comparing the newly added point with the original minimum point; when a new point in the non-ordered queue needs to be processed, only the minimum point of the temporary variable storage needs to be taken out, and the reachable graph is obtained through the method.
4. The method of claim 2, wherein the de-partitioned information entropy delta Δ (N) is computed for all attributes in the data seti) The method comprises the following steps:
a) in order to improve the quality of detection of outliers, the distance between data objects in an OPTIC algorithm is measured by adopting weighted distance, and the weight of attributes is determined by removing a division information entropy increment; the information entropy is used for measuring how much information a system contains, so that the information entropy E (x) value can measure the uncertainty of a data set; it is defined as:
E(x)=-[p(x1)·log p(x1)]-[p(x2)·log p(x2)]…-[p(xn)·log p(xn)] (1)
in formula (1), x is a random variable, and the possible set of values is s (x) { x }1,x2,……,xn};
p (x) represents the probability of taking the value x;
b) to highlight outlier attributes, the weight of the associated attribute is defined by the change in entropy value after one attribute is removed. Let attribute set be N ═ N1,N2,…,NmGet Ni(i ═ 1,2, …, m), dividing N into two parts: { NiAnd { N-N }iP ═ P }, denoted as P ═ P1,P2In which P is1={Ni},P2={N1,N2,…,Ni-1,Ni+1,…,NmGet one division information entropy increment delta (N)i) Defined as formula (2), the larger the value, the more NiThe more uncertainty of the removed data set is reduced:
Δ(Ni)=E(N)-E(P) (2)
in the formula (2), Δ (N)i) Representing set N removing NiThe later information entropy changes;
e (N) information entropy representing the attribute set N;
the calculation formula of E (P) is:
c) if two data objects are p ═ p respectively1,p2,…,pm},p′={q1,q2,…,qmAnd the weighted distance between the two is denoted as dist (p, p'), then the weighted distance based on the one-division-information entropy increment is defined as:
dist(p,p′)=[Δ(N1)×d(p1,p′1)]+[Δ(N2)×d(p2,p′2)]+…+[Δ(Nm)×d(pm,p′m)] (4)。
5. the method of claim 2, wherein when calculating the reachable distance of all objects in the data set, if p is a core object, the larger of the core distance of p and the distance of o from p is defined as the reachable distance of o with respect to p; if p is a non-core object, then p is defined as a non-core object; thus for object p, o e S, the reachable distance is defined as follows:
in formula (5), reachDist is the reachable distance of o with respect to p; the reachable distance is calculated if and only if the p-core object.
6. The method of claim 2, wherein k-distance and k-distance neighborhoods of all objects in the preliminary outlier data set are calculated, and P-weight is calculated; the P weight is a method for finding outlier data based on distance, and the method measures the outlier degree of a certain object in a data set through the P weight; the sum of the distances between an arbitrary object P and its nearest k objects is called P weight, and is calculated as follows:
Wk(p)=d1(p,nb1(p))+d2(p,nb2(p))+…+dk(p,nbk(p)) (6)
in the formula (6), Wk(P) is the P weight;
nbi(p) the ith neighbor representing p;
dk(p,nbk(p)) represents the distance of point p to the kth object adjacent to p.
7. The method as claimed in claim 2, wherein when calculating the local density Ldp based on the P-weight, and measuring the degree of outlier of the object by using the P-weight, the operation is simple but only outlier data of a single density can be found, so the algorithm uses the local density LdpThe P weight is improved by the idea of the outlier factor algorithm; in the LOF algorithm, for a core point P, the reachable distance from any point o to P is defined as the larger of d (P, o) and coredist (P), where the reachable distance in the local outlier algorithm is replaced by the P weight of the core object; since the distance between any two points has already been calculated in the OPTIC Algorithm, disti(p,nbi(p)) use is made of the weighted distances that have been calculated during the clustering process of the OPTICS algorithm; if a data set S is given, and P is any point in the set, the local density Ldp based on the weight of Pk(p) may be represented by the following formula:
in the formula (7), Nk(p) represents a k-distance neighborhood of object p; disti(p,nbi(p)) is a weighted distance based on the de-one partition information entropy increment.
8. The method of claim 2, wherein calculating the local reachable density LOFBP based on P weight is based on the local outlier definition in LOF algorithm, which can be defined by Ldpk(P) analogizing the definition of local outlier factors based on P weight in LOFBP; the local outlier factor based on the P weight is found by the mean of the ratio of the object density in the epsilon neighborhood of object P to the object P density, and this value is denoted as LOFBPk(p); if LOFBPkThe closer the value of (p) is to 1, the more p and Nk(p) in which the density of the objects is not very different, p and Nk(p) the object may belong to a cluster; if LOFBPkThe smaller the value of (p) is, the higher the density of p is, the lower the value is, the lower the density is, thek(p) density of objects, whereas the more likely p is an outlier; the local outlier definition for object p is shown below:
in equation (8), LOFBPk(p) is the local achievable density of p points, which can be used as an indicator of data outliers.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110273839.2A CN112949735A (en) | 2021-03-15 | 2021-03-15 | Liquid hazardous chemical substance volatile concentration abnormity discovery method based on outlier data mining |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110273839.2A CN112949735A (en) | 2021-03-15 | 2021-03-15 | Liquid hazardous chemical substance volatile concentration abnormity discovery method based on outlier data mining |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112949735A true CN112949735A (en) | 2021-06-11 |
Family
ID=76229759
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110273839.2A Pending CN112949735A (en) | 2021-03-15 | 2021-03-15 | Liquid hazardous chemical substance volatile concentration abnormity discovery method based on outlier data mining |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112949735A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114943434A (en) * | 2022-05-16 | 2022-08-26 | 南京航空航天大学 | Liquid hazardous chemical substance loading and unloading crane position dynamic allocation method based on LOF outlier |
CN117272215A (en) * | 2023-11-21 | 2023-12-22 | 江苏达海智能系统股份有限公司 | Intelligent community safety management method and system based on data mining |
CN117436024A (en) * | 2023-12-19 | 2024-01-23 | 湖南翰文云机电设备有限公司 | Fault diagnosis method and system based on drilling machine operation data analysis |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109063733A (en) * | 2018-06-27 | 2018-12-21 | 西安理工大学 | A kind of outlier detection method based on the two-parameter factor that peels off |
-
2021
- 2021-03-15 CN CN202110273839.2A patent/CN112949735A/en active Pending
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109063733A (en) * | 2018-06-27 | 2018-12-21 | 西安理工大学 | A kind of outlier detection method based on the two-parameter factor that peels off |
Non-Patent Citations (1)
Title |
---|
肖雪等: "基于改进的OPTICS聚类和LOPW的离群数据检测算法", 《计算机工程与科学》, pages 885 - 892 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114943434A (en) * | 2022-05-16 | 2022-08-26 | 南京航空航天大学 | Liquid hazardous chemical substance loading and unloading crane position dynamic allocation method based on LOF outlier |
CN117272215A (en) * | 2023-11-21 | 2023-12-22 | 江苏达海智能系统股份有限公司 | Intelligent community safety management method and system based on data mining |
CN117272215B (en) * | 2023-11-21 | 2024-02-02 | 江苏达海智能系统股份有限公司 | Intelligent community safety management method and system based on data mining |
CN117436024A (en) * | 2023-12-19 | 2024-01-23 | 湖南翰文云机电设备有限公司 | Fault diagnosis method and system based on drilling machine operation data analysis |
CN117436024B (en) * | 2023-12-19 | 2024-03-08 | 湖南翰文云机电设备有限公司 | Fault diagnosis method and system based on drilling machine operation data analysis |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109816031B (en) | Transformer state evaluation clustering analysis method based on data imbalance measurement | |
CN113092981B (en) | Wafer data detection method and system, storage medium and test parameter adjustment method | |
CN107679734A (en) | It is a kind of to be used for the method and system without label data classification prediction | |
US11640328B2 (en) | Predicting equipment fail mode from process trace | |
CN112949735A (en) | Liquid hazardous chemical substance volatile concentration abnormity discovery method based on outlier data mining | |
CN112131575B (en) | Concept drift detection method based on classification error rate and consistency prediction | |
CN110543907A (en) | fault classification method based on microcomputer monitoring power curve | |
CN114004137A (en) | Multi-source meteorological data fusion and pretreatment method | |
CN110889441A (en) | Distance and point density based substation equipment data anomaly identification method | |
CN113298162A (en) | Bridge health monitoring method and system based on K-means algorithm | |
CN111709668A (en) | Power grid equipment parameter risk identification method and device based on data mining technology | |
CN111400911A (en) | GNSS deformation information identification and early warning method based on EWMA control chart | |
CN117669394B (en) | Mountain canyon bridge long-term performance comprehensive evaluation method and system | |
Cui et al. | Analysis and prediction of pipeline corrosion defects based on data analytics of in-line inspection | |
CN112329868A (en) | CLARA clustering-based manufacturing and processing equipment group energy efficiency state evaluation method | |
CN113255810B (en) | Network model testing method based on key decision logic design test coverage rate | |
CN115659271A (en) | Sensor abnormality detection method, model training method, system, device, and medium | |
CN112765219B (en) | Stream data abnormity detection method for skipping steady region | |
CN114597886A (en) | Power distribution network operation state evaluation method based on interval type two fuzzy clustering analysis | |
JP2008258486A (en) | Distribution analysis method and system, abnormality facility estimation method and system, program for causing computer to execute its distribution analysis method or its abnormality facility estimation method, and recording medium readable by computer having its program recorded therein | |
CN117591836B (en) | Pipeline detection data analysis method and related device | |
CN112884167B (en) | Multi-index anomaly detection method based on machine learning and application system thereof | |
CN113190406B (en) | IT entity group anomaly detection method under cloud native observability | |
CN117113248B (en) | Gas volume data anomaly detection method based on data driving | |
CN117236572B (en) | Method and system for evaluating performance of dry powder extinguishing equipment based on data analysis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |