CN116738353A

CN116738353A - Pharmaceutical workshop air filter element performance detection method based on data analysis

Info

Publication number: CN116738353A
Application number: CN202311020323.2A
Authority: CN
Inventors: 沈丹; 赵梅香; 祁帅剑
Original assignee: Antos Nano Technology Suzhou Co ltd
Current assignee: Wuxi Duoning Biotechnology Co ltd
Priority date: 2023-08-15
Filing date: 2023-08-15
Publication date: 2023-09-12
Anticipated expiration: 2043-08-15
Also published as: CN116738353B

Abstract

The application relates to the field of data processing, and provides a pharmaceutical workshop air filter element performance detection method based on data analysis, which comprises the following steps: determining a neighborhood range of the current data point based on a density change condition of a first scatter plot of the data points before the current data point and a density change condition of a second scatter plot of the data points after the current data point; determining an initial K value for the current data point based on a density of data points of a neighborhood range of the current data point, thereby determining an initial K value for each data point; determining a final K value for the final scatter plot based on the initial K value for each data point; and carrying out anomaly detection on the data points in the final scatter diagram based on the final K value by using an LOF anomaly detection algorithm, and determining the performance of the air filter element according to the anomaly detection result. The method improves the accuracy of abnormal data detection and improves the effect of detecting the performance of the air filter element.

Description

Pharmaceutical workshop air filter element performance detection method based on data analysis

Technical Field

The application relates to the field of data processing, in particular to a pharmaceutical workshop air filter element performance detection method based on data analysis.

Background

In pharmaceutical workshops, control of air quality is critical to the production process. The air filter element is an important device for air purification of a pharmaceutical workshop, and the performance of the air filter element is very important to the control of the air quality of the pharmaceutical workshop. In the prior art, the performance of the air filter element is generally detected through an LOF abnormal data detection algorithm, specifically, the local outlier factor of each data point is calculated, and whether the air filter element is an abnormal data point is determined based on the local outlier factor of each data point. When the filter element performance is detected through the original LOF abnormal data detection algorithm, the time consumption is too long, and in the process of data analysis, the minimum neighborhood is difficult to define, namely proper K value cannot be selected, so that the abnormal point with low abnormality degree is easily considered as a normal data point or the normal data point is misjudged as an abnormal point.

Disclosure of Invention

The application provides a pharmaceutical workshop air filter element performance detection method based on data analysis, which can improve the accuracy of abnormal data detection and improve the effect of detecting the air filter element performance.

In a first aspect, the application provides a pharmaceutical workshop air filter element performance detection method based on data analysis, which comprises the following steps:

determining a neighborhood range of the current data point based on a density change condition of a first scatter plot of the data points before the current data point and a density change condition of a second scatter plot of the data points after the current data point;

determining an initial K value for the current data point based on a density of data points of a neighborhood range of the current data point, thereby determining an initial K value for each data point;

determining a final K value for the final scatter plot based on the initial K value for each data point;

and carrying out anomaly detection on the data points in the final scatter diagram based on the final K value by using an LOF anomaly detection algorithm, and determining the performance of the air filter element according to the anomaly detection result.

In an alternative embodiment, determining the neighborhood of the current data point based on the density change of a first scatter plot of data points preceding the current data point and the density change of a second scatter plot of data points following the current data point includes:

calculating a first density change index of the first scatter plot;

calculating a second density change index of the second scattergram;

and determining the neighborhood radius of the current data point based on the first density change index, the second density change index and the cluster of the current data point, thereby determining the neighborhood range of the current data point.

In an alternative embodiment, calculating a first density change index for a first scatter plot includes:

determining a first clustering center from a first scatter diagram, wherein the coordinates of the first clustering center are the same as the coordinates of the current data point in the current scatter diagram;

clustering is carried out based on the first clustering center, and a first clustering cluster corresponding to the first clustering center is obtained;

a first density change index of the first scatter plot is determined based on the first cluster.

In an alternative embodiment, determining a first density change index for the first scatter plot based on the first cluster includes:

vertically mapping the data points in the first cluster to a first reference straight line to obtain a first mapping point set, wherein the first reference straight line is a connecting line of the coordinate origin of the first scatter diagram and the first cluster center;

and calculating a first density change index of the first scatter diagram based on the distance variance from the mapping point in the first mapping point set to the coordinate origin of the first scatter diagram, the distance variance from the mapping point in the first mapping point set to the first clustering center and the number of the mapping points in the first mapping point set.

In an alternative embodiment, calculating a second density change index for a second scatter plot includes:

determining a second cluster center from the second scatter plot, the coordinates of the second cluster center being the same as the coordinates of the current data point in the current scatter plot;

clustering is carried out based on the second clustering center, and a second clustering cluster corresponding to the second clustering center is obtained;

a second density change index of the second scatter plot is determined based on the second cluster.

In an alternative embodiment, determining a second density change index for the second scatter plot based on the second cluster of clusters includes:

vertically mapping the data points in the second cluster to a second reference straight line to obtain a second mapping point set, wherein the second reference straight line is a connecting line of the coordinate origin of the second scatter diagram and the second cluster center;

and calculating a second density change index of the second scatter diagram based on the distance variance from the mapping point in the second mapping point set to the coordinate origin of the second scatter diagram, the distance variance from the mapping point in the second mapping point set to the second aggregation center and the number of the mapping points in the second mapping point set.

In an alternative embodiment, determining the neighborhood radius of the current data point based on the first density variation index, the second density variation index, and the cluster of current data points includes:

clustering is carried out by taking the current data point as a clustering center, so as to obtain a current cluster corresponding to the current data point;

and determining the neighborhood radius of the current data point based on the Euclidean distance between the data point farthest from the current data point in the current cluster and the current data point, the first density change index and the second density change index.

In an alternative embodiment, determining the initial K value for the current data point based on the density of data points in a neighborhood of the current data point includes:

the initial K value of the current data point is determined based on the number of data points in the current scatter plot corresponding to the current data point, the density of the data points in the neighborhood range of the current data point, and the global density of the data points in the current scatter plot.

In an alternative embodiment, determining a final K value for the final scatter plot based on the initial K value for each data point includes:

the mode of the initial K values for all data points is determined as the final K value of the final scatter plot.

In an alternative embodiment, using an LOF anomaly detection algorithm to detect anomalies in data points in a final scatter plot based on a final K value, and determining performance of the air filter based on the anomalies detection result, comprising:

calculating local outlier factor values of data points in the final scatter diagram based on the final K value by using an LOF anomaly detection algorithm;

if the local outlier factor of a data point is greater than a threshold, the data point is an outlier data point; if abnormal data points exist in the final scatter diagram, determining that the air filter element is unqualified in performance.

The application has the beneficial effects that the method is different from the prior art, and the method for detecting the performance of the air filter element of the pharmaceutical workshop based on data analysis comprises the following steps: determining a neighborhood range of the current data point based on a density change condition of a first scatter plot of the data points before the current data point and a density change condition of a second scatter plot of the data points after the current data point; determining an initial K value for the current data point based on a density of data points of a neighborhood range of the current data point, thereby determining an initial K value for each data point; determining a final K value for the final scatter plot based on the initial K value for each data point; and carrying out anomaly detection on the data points in the final scatter diagram based on the final K value by using an LOF anomaly detection algorithm, and determining the performance of the air filter element according to the anomaly detection result. According to the method, the initial K value of the data point is determined based on the density of the data points in the neighborhood range of the data point, and each data point corresponds to one initial K value, so that the final K value of the final scatter diagram is determined based on the initial K value of each data point, and anomaly monitoring is carried out, the problem that the anomaly point with lower anomaly degree is considered to be a normal data point or the normal data point is misjudged to be the anomaly point can be avoided, the detection precision of the anomaly data is improved, and the performance detection effect of the air filter element is improved.

Drawings

FIG. 1 is a schematic flow chart of an embodiment of a method for detecting performance of air filter elements in a pharmaceutical workshop based on data analysis according to the present application;

FIG. 2 is a schematic diagram of one embodiment of a scatter plot;

fig. 3 is a flowchart of an embodiment of step S11 in fig. 1.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

For the pharmaceutical industry, when the GMP regulation shows that dust particles in air of 1 cubic meter need to be detected in the moving test, the laser dust particle counter of 28.3L/min is required to detect the particles in the air, and in order to ensure the accuracy of the detection result, the continuous detection is required to be carried out for 35 minutes, so that the concentration of the particles in the air of a pharmaceutical workshop and the size of the particles are obtained. When the data of the particulate matters in the air are collected, in order to better observe the size and concentration change of the particulate matters in the air, the data are collected every time t, the empirical value of t is two hours, and an operator can adjust the collection interval time according to actual conditions.

And analyzing and processing the data according to the acquired data of the particulate matters in the air of the pharmaceutical workshop, and judging the performance of the air filter element according to the abnormal degree of the data. The application selects LOF abnormal data detection algorithm to process the acquired data. When abnormal data is detected through the LOF algorithm, the selection of the K value is a key problem, and the selection of the proper K value can improve the accuracy and the robustness of the LOF algorithm result.

Analysis of data on particulate matter in air, the particle size in units ofIn a pharmaceutical workshop, the size of the particulate matter in the air is generally required to be equally divided into two grades, 0.5 +.>And 5->Wherein the number requirements of different grades for different sizes of particulate matter are different as shown in the following table:

particle radius greater than or equal to 0.5 for both types of sizesLess than 5->And a particle radius of greater than or equal to 5>And (3) analyzing the particles, establishing corresponding two-dimensional data, and constructing a two-dimensional data scattered point distribution map. Wherein, the ordinate of the two-dimensional data scatter diagram is the density of particles in the air, namely the number of dust particles per cubic meter, the abscissa is the radius of the particles, and the unit is +.>. FIG. 2 shows that the radius of the particles is greater than or equal to 0.5 +.>Less than 5->The two-dimensional data scatter plot of the particulate matter constructed in other radius size ranges is constructed in the same way.

Processing data according to different areas in a workshop, wherein the A-level area is a high-risk operation area, such as a filling area, an area for placing a rubber plug barrel, an open ampoule bottle and an open penicillin bottle, and an area for sterile assembly line or connection operation; the B-level region is a background region where the A-level region is located and corresponds to hundred levels in the table; the C/D level area is relatively low, wherein the C level is ten thousand levels in the table, the D level is hundred thousand levels in the table, different levels have different requirements when analyzing the acquired data, and the accuracy of a critical value is ensured when abnormality detection is carried out.

The method is characterized in that analysis is carried out on the constructed data scatter diagram, in the original LOF algorithm, the value of K is artificially set, in the corresponding data scatter diagram, the density of data points in the diagram is not fixed, if the change of the density of the data points in the scatter diagram is not considered, the same K value is selected, the abnormal data detection effect is poor, and therefore the air filter element performance detection effect is affected. The application provides a pharmaceutical workshop air filter element performance detection method based on data analysis, which avoids the problem that an abnormal point with lower abnormality degree is regarded as a normal data point or the normal data point is misjudged as the abnormal point, improves the accuracy of abnormal data detection and improves the effect of detecting the air filter element performance.

The present application will be described in detail with reference to the accompanying drawings and examples.

Referring to fig. 1, fig. 1 is a flow chart of an embodiment of a method for detecting performance of an air filter element in a pharmaceutical workshop based on data analysis according to the present application, which specifically includes:

step S11: the neighborhood range of the current data point is determined based on the density change of the first scatter plot of data points preceding the current data point and the density change of the second scatter plot of data points following the current data point.

In the application, data points in the scatter diagram can be roughly analyzed, clean levels of different levels are analyzed, the particulate matter data are marked as abnormal data outside the corresponding allowable range, and if the abnormal data are generated, the quality of the air filter element in the workshop is unqualified.

If abnormal data points which obviously do not accord with the table are not generated in the scatter diagram, randomly selecting a data point in the scatter diagram for analysis, and selecting proper K according to the density around the data point. Setting randomly selected data points as current data points o, and setting corresponding coordinates of the current data points o in a scatter diagram asWherein->For the particle radius corresponding to the current data point o, +.>For the number of dust particles per cubic meter corresponding to the current data point o, 0.5 +.>For example, in this class of standards, the particle size is greater than or equal to 0.5 +.>Less than 5->The number of particles per cubic meter is not more than 3500. Since the air filter element generally filters larger particles and passes smaller particles, smaller particles are more densely packed and less densely packed in the corresponding scatter plot.

And selecting a proper K value according to the density in the neighborhood range of the current data point o, setting the neighborhood range of the current data point o as a circle with the current data point o as a circle center and R as a radius, analyzing the current data point o, selecting a proper radius R value, determining the value of K corresponding to the current data point o according to the density of the points in the neighborhood range of the current data point o, and then selecting the K value of the whole data set.

When data are collected, data are collected every time t, so that in the process of collecting the data, the rate of reducing the processing density of the particulate matters with larger radius along with the air filter element is larger, and the rate of reducing the processing density of the particulate matters with smaller radius along with the air filter element is smaller. And recording the number of times of data acquisition of the acquired current data point o as the nth time of acquisition in the process of acquiring data, and analyzing a scatter diagram obtained by the nth-1 time of data acquisition and the (n+1) th time of data acquisition. It can be understood that the scatter plot acquired by the n-1 th data acquisition is the first scatter plot of the data points before the current data point o; the scatter diagram obtained by the n+1st data acquisition is the second scatter diagram of the data points after the current data point o. If the current data point o is the first or last acquired data, only the scatter diagram corresponding to the last or previous data is analyzed.

Referring to fig. 3, step S11 includes:

step S21: a first density change index of the first scattergram is calculated.

Specifically, a first scatter diagram formed by the n-1 th collected data points is analyzed, a first clustering center is determined from the first scatter diagram, and specifically, the coordinate where the current data point o is found isAnd taking the coordinates as a first clustering center, and in the first scatter diagram of the data acquired for the n-1 time, taking the coordinate point of the current data point o as the first clustering center, and processing the first scatter diagram acquired for the n-1 time by using DBSCAN clustering to obtain a first clustering cluster taking the current data point o as the first clustering center. A first density change index of the first scatter plot is determined based on the first cluster. The clustering method is DBSCAN clustering, the empirical value of Epsilon (Epsilon) of the DBSCAN clustering is 5, and the empirical value of MinPts is 8.

In a specific embodiment, connecting the origin of coordinates of the first scatter plot with the first cluster center to obtain a first reference line, and recording the first reference line asVertically mapping data points in a first cluster taking the current data point o as a first cluster center to a first reference straight line +.>And marking the corresponding mapping points to obtain a first mapping point set. Based on the distance variance from the mapping points in the first mapping point set to the coordinate origin of the first scatter point, the distance variance from the mapping points in the first mapping point set to the first cluster center, and the number of the mapping points in the first mapping point setAnd calculating the quantity to obtain a first density change index of the first scatter diagram.

In particular, at a first density variation indexFor the purposes of illustration, add>The calculation mode of (a) is as follows:

in the above-mentioned method, the step of,for the first reference line->The Euclidean distance from the ith mapping point in the first cluster of the upper current data point o to the coordinate origin of the first scatter point,/>The Euclidean distance average value from all the mapping points to the coordinate origin of the first scatter diagram is obtained; />For the first reference line->Euclidean distance from the ith mapping point in the first cluster of the current data point o to the current data point o (first cluster center), and +.>The Euclidean distance average value from all mapping points to the current data point o is obtained;for the distance variance of the mapping points in the first set of mapping points to the origin of coordinates of the first scatter plot,for the distance variance from the mapping points in the first mapping point set to the first clustering center, m is a first reference straight line +.>The number of map points in the first set of map points.

Wherein,,the variance of (2) reflects the first reference line +.>The density degree between the upper mapping points is the first reference straight lineThe higher the degree of density between the mapped points, the +.>The smaller the variance of (2); />The variance of (2) reflects the first reference line +.>The distance of the upper mapping point from the current data point o, the first reference straight line +.>The closer the upper mapping point is to the current data point o, +.>The smaller the variance of the corresponding, the first density change index +.>The closer to 1.

Step S22: a second density change index of the second scattergram is calculated.

Further, determining a second cluster center from the second scatter plot, wherein the coordinates of the second cluster center are the same as the coordinates of the current data point in the current scatter plot; clustering is carried out based on the second clustering center, and a second clustering cluster corresponding to the second clustering center is obtained; a second density change index of the second scatter plot is determined based on the second cluster. Specifically, the data points in the second cluster are vertically mapped to a second reference straight line to obtain a second mapping point set, wherein the second reference straight line is a connecting line of the coordinate origin of the second scatter diagram and the second cluster center; and calculating a second density change index of the second scatter diagram based on the distance variance from the mapping point in the second mapping point set to the coordinate origin of the second scatter diagram, the distance variance from the mapping point in the second mapping point set to the second aggregation center and the number of the mapping points in the second mapping point set.

Second density change indexIs calculated by the method of (a) and the first density change index +.>The calculation method of (2) is the same and will not be described in detail herein.

Step S23: and determining the neighborhood radius of the current data point based on the first density change index, the second density change index and the cluster of the current data point, thereby determining the neighborhood range of the current data point.

Specifically, clustering is carried out by taking the current data point as a clustering center, so as to obtain a current clustering cluster corresponding to the current data point; and determining the neighborhood radius of the current data point based on the Euclidean distance between the data point farthest from the current data point in the current cluster and the current data point, the first density change index and the second density change index.

In one embodiment, if the current data point o corresponds to the first density change index of the nth-1 th timeAnd the second density change index ++1th time>The larger the difference value is, the larger the corresponding density change is, which means that when the air filter element filters the particles with the particle radius corresponding to the current data point o in the air, the better the filtering effect on the diameter of the particles is, and in the process of data acquisition, the density of the particles with the corresponding size, which have the better filtering effect, in the data scatter diagram is smaller, if smaller K is selected, the accuracy of data detection can be affected. In the abnormal data detection process, aiming at the point with larger density change, selecting a larger R value, wherein R is the neighborhood radius of the current data point, and considering the density change of the corresponding point in the larger neighborhood; and for the point with smaller density change, a smaller R value is selected, and the degree of abnormality of the point is analyzed in a smaller neighborhood because the density change is not obvious, so that the calculation process is not excessively redundant, and the corresponding formula is as follows:

in the method, in the process of the application,the maximum value of Euclidean distance between the current data point o and each data point in the current cluster. The current clustering cluster is obtained by clustering the scatter diagram corresponding to the current data point o by taking the current data point o as a clustering center. />A second density change index in a second scatter plot in the n+1st data acquisition for the current data point o; />A first density change index in the first scatter plot in the n-1 th data acquisition for the current data point o.

Wherein when the first density change index and the second density change index change more greatly, namely the first density change index of the current data point o in the first scatter diagram of the n-1 data acquisitionSecond Density Change index +.A second scatter plot from the n+1st data acquisition>When the value difference of the data points o is larger, the concentration of the particles with corresponding radius size is more obvious in the filtering process, and in order to acquire more information in the data scatter diagram, when the density analysis is carried out in the neighborhood of the current data point o, a larger R is selected; similarly, in the opposite case, a smaller R should be selected.

The neighborhood radius of the current data point o is calculated through the formula, so that the neighborhood range of the current data point o can be determined. Further, the neighborhood range for each data point in the final scatter plot can be determined in the manner described above.

Step S12: an initial K value for the current data point is determined based on the density of the data points of the neighborhood of the current data point, thereby determining an initial K value for each data point.

Specifically, the initial K value of the current data point is determined based on the number of data points in the current scatter plot corresponding to the current data point, the density of data points in the neighborhood of the current data point, and the global density of data points in the current scatter plot.

In a specific embodiment, the density of the data points o is calculated in the neighborhood of the radius R of the current data point o, and if the density in the neighborhood R is larger, the K value of the current data point o should be properly reduced, so that the calculation amount in the reachable neighborhood of the current data point o is reduced; if the density in the R neighborhood is smaller, the K value of the point o should be properly increased, and the information of the data points in the larger reachable neighborhood is considered, so that the accuracy of judging the abnormal data is increased. Specific:

wherein,,an initial K value which is self-adaptive to the current data point o; q isThe total number of samples in the current scatter plot corresponding to the current data point o, according to an empirical method, in the algorithm for detecting LOF anomaly data, the K value is typically selected as the square root of the total number of samples of the data set, i.e. +.>；/>The density of the data point in the neighborhood of the R radius of the current data point o is the density of the neighborhood range;the global density of the data points in the current scatter diagram corresponding to the current data point o is the density of the scatter diagram; />() The function is a downward rounding function, and the data point corresponding to the initial K value is ensured to be an integer.

If the density of data points in the neighborhood of the current data point o is less than the global density, i.eWhen the value of (2) is greater than 1, the current data point o should be chosen to be greater +.>The value, consider the information of the data point in the larger reachable neighborhood range, increase the precision of judging the abnormal data; if the density of the data points in the neighborhood of the current data point o is greater than the global density, i.e. +.>When the value of (2) is smaller than 1, the current data point o should be chosen smaller +.>The value reduces the amount of computation in the reachable neighborhood for the current data point o.

So far, the corresponding data point o of the current data point is obtainedIs a value of (2). In the same way, the processing method comprises the steps of,and obtaining self-adaptive initial K values corresponding to all data points in the scatter diagram.

Step S13: a final K value for the final scatter plot is determined based on the initial K value for each data point.

Specifically, the mode of the initial K values for all data points is determined as the final K value of the final scatter plot.

Step S14: and carrying out anomaly detection on the data points in the final scatter diagram based on the final K value by using an LOF anomaly detection algorithm, and determining the performance of the air filter element according to the anomaly detection result.

Calculating local outlier factor values of data points in the final scatter diagram based on the final K value by using an LOF anomaly detection algorithm; if the local outlier factor of the data point is larger than the threshold value, taking an empirical value of 1 by the threshold value, and taking the data point as an abnormal data point; if abnormal data points exist in the final scatter diagram, determining that the air filter element is unqualified in performance.

According to the method, the neighborhood radius with the corresponding size is selected according to the density change degree of the data points in the acquisition process, and the proper k value is selected according to the density of the data points in the neighborhood, so that the problem that the abnormal point with lower abnormality degree is easily considered as the normal data point or the normal data point is misjudged as the abnormal point when the LOF algorithm performs abnormality detection on the data is avoided, the accuracy of abnormality data detection is improved, and the effect of detecting the performance of the air filter element is improved.

The foregoing is only the embodiments of the present application, and therefore, the scope of the present application is not limited by the above embodiments, and all equivalent structures or equivalent processes using the descriptions and the drawings of the present application or direct or indirect application in other related technical fields are included in the scope of the present application.

Claims

1. The method for detecting the performance of the air filter element of the pharmaceutical workshop based on data analysis is characterized by comprising the following steps of:

performing anomaly detection on data points in a final scatter diagram based on a final K value by using an LOF anomaly detection algorithm, and determining the performance of the air filter element according to an anomaly detection result;

determining an initial K value for the current data point based on a density of data points of a neighborhood range of the current data point, comprising:

2. The method of claim 1, wherein determining the neighborhood range of the current data point based on the density change of a first scatter plot of data points preceding the current data point and the density change of a second scatter plot of data points following the current data point comprises:

calculating a first density change index of the first scatter plot;

calculating a second density change index of the second scattergram;

3. The method for detecting air filter performance in a pharmaceutical plant based on data analysis of claim 2, wherein calculating a first density change index for a first scatter plot comprises:

4. The method for detecting performance of a pharmaceutical shop air filter element based on data analysis according to claim 3, wherein determining a first density change index of the first scattergram based on the first cluster comprises:

5. The method for detecting air filter performance in a pharmaceutical plant based on data analysis of claim 2, wherein calculating a second density change index for a second scatter plot comprises:

6. The method of claim 5, wherein determining a second density variation index for the second scattergram based on the second cluster comprises:

7. The method of claim 2, wherein determining a neighborhood radius for the current data point based on the first density change index, the second density change index, and the cluster of current data points comprises:

8. The method for detecting air filter performance in a pharmaceutical plant based on data analysis of claim 1, wherein determining a final K value for a final scatter plot based on the initial K value for each data point comprises:

9. The method for detecting the performance of an air filter element in a pharmaceutical workshop based on data analysis according to claim 1, wherein the abnormality detection of the data points in the final scatter diagram based on the final K value by using the LOF abnormality detection algorithm, and determining the performance of the air filter element according to the abnormality detection result, comprises: