CN114298147A - Abnormal sample detection method and device, electronic equipment and storage medium - Google Patents
Abnormal sample detection method and device, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN114298147A CN114298147A CN202111396617.6A CN202111396617A CN114298147A CN 114298147 A CN114298147 A CN 114298147A CN 202111396617 A CN202111396617 A CN 202111396617A CN 114298147 A CN114298147 A CN 114298147A
- Authority
- CN
- China
- Prior art keywords
- data sample
- target
- detected
- sample
- determining
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000002159 abnormal effect Effects 0.000 title claims abstract description 49
- 238000001514 detection method Methods 0.000 title claims description 9
- 238000000034 method Methods 0.000 claims abstract description 51
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 38
- 238000004891 communication Methods 0.000 claims description 19
- 238000007781 pre-processing Methods 0.000 claims description 15
- 238000004590 computer program Methods 0.000 claims description 8
- 238000010606 normalization Methods 0.000 description 11
- 230000008569 process Effects 0.000 description 9
- 238000010586 diagram Methods 0.000 description 5
- 230000009471 action Effects 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 2
- 238000003064 k means clustering Methods 0.000 description 2
- 230000002547 anomalous effect Effects 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the invention provides a method and a device for detecting an abnormal sample, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a data sample to be detected, and predicting a target clustering category corresponding to the data sample to be detected by using a preset clustering algorithm; determining a target distance between the data sample to be detected and the target clustering category central point, and searching a target quantile corresponding to the target distance; and comparing the target quantile with a preset threshold value, and determining whether the data sample to be detected is an abnormal sample according to a comparison result. Predicting a target cluster category corresponding to a data sample to be detected through a preset clustering algorithm, determining a target distance between the data sample to be detected and a target cluster category central point, searching a target quantile corresponding to the target distance, comparing the target quantile with a preset threshold value, and determining whether the data sample to be detected is an abnormal sample according to a comparison result, so that the abnormal sample can be separated.
Description
Technical Field
The embodiment of the invention relates to the technical field of artificial intelligence, in particular to a method and a device for detecting an abnormal sample, electronic equipment and a storage medium.
Background
In the sample data set, there is a significant difference between some sample data and most other sample data, and we generally refer to this sample data as an outlier or singular sample. In order to ensure the data quality of the sample data set, it is generally necessary to perform anomaly detection on each sample in the sample data set, so as to remove anomalous samples.
In the existing detection method for the abnormal samples, because a plurality of categories exist in a data sample set, data characteristics cannot be summarized in a uniform characterization mode, and the abnormal samples cannot be separated. Therefore, it is necessary to preliminarily recognize the data sample set based on the complexity of the data sample set, and in view of this, a method for detecting an abnormal sample is urgently needed so as to separate the abnormal sample.
Disclosure of Invention
In order to solve the technical problems that due to the fact that multiple categories exist in a data sample set, data characteristics cannot be summarized in a unified representation mode, and abnormal samples cannot be separated, the embodiment of the invention provides a method and a device for detecting the abnormal samples, electronic equipment and a storage medium.
In a first aspect of the embodiments of the present invention, there is provided a method for detecting an abnormal sample, the method including:
acquiring a data sample to be detected, and predicting a target clustering category corresponding to the data sample to be detected by using a preset clustering algorithm;
determining a target distance between the data sample to be detected and the target clustering category central point, and searching a target quantile corresponding to the target distance;
and comparing the target quantile with a preset threshold value, and determining whether the data sample to be detected is an abnormal sample according to a comparison result.
In an optional embodiment, before performing the method, the method further comprises:
acquiring a data sample set, wherein the data sample set at least comprises one data sample;
clustering all the data samples in the data sample set by using a preset clustering algorithm to generate N clustering categories;
for any one data sample in the data sample set, determining the cluster category corresponding to the data sample, and determining the distance between the data sample and the cluster category central point;
for any one of the cluster categories, determining a distribution of the distances within the cluster category, and determining different quantiles corresponding to the distances within the cluster category.
In an optional embodiment, the clustering all the data samples in the data sample set by using a preset clustering algorithm to generate N clustering categories includes:
acquiring N clustering categories specified by a user, or determining the N clustering categories according to inflection points of the elbow graph;
and clustering all the data samples in the data sample set by using a preset clustering algorithm to generate N clustering categories.
In an optional embodiment, the determining the distance between the data sample and the cluster category center point includes:
determining Euclidean distances between the data samples and the cluster category center points;
the determining, for any of the cluster categories, the distribution of the distances within the cluster category and the different quantiles corresponding to the distances within the cluster category, includes:
for any one of the cluster categories, determining the distribution of the Euclidean distances in the cluster category, and determining different quantiles corresponding to the Euclidean distances in the cluster category;
the determining the target distance between the data sample to be detected and the target clustering category central point and searching the target quantile corresponding to the target distance comprises the following steps:
determining a target Euclidean distance between the data sample to be detected and the target clustering category central point, and searching a target quantile corresponding to the target Euclidean distance.
In an optional embodiment, the clustering all the data samples in the data sample set by using a preset clustering algorithm to generate N clustering categories includes:
preprocessing all the data samples in the set of data samples, wherein the preprocessing at least includes missing value padding;
all the data samples in the preprocessed data sample set are normalized to obtain normalized data samples corresponding to all the data samples;
clustering all the standardized data samples by using a preset clustering algorithm to generate N clustering categories;
the determining the cluster category corresponding to the data sample and the distance between the data sample and the cluster category center point for any one of the data samples in the data sample set includes:
for any normalized data sample, determining the cluster category corresponding to the normalized data sample and the distance between the cluster category and the center point of the cluster category.
In an optional embodiment, the predicting, by using a preset clustering algorithm, a target clustering class corresponding to the data sample to be detected includes:
preprocessing the data sample to be detected, wherein the preprocessing at least comprises missing value filling;
normalizing the preprocessed data sample to be detected to obtain a standardized data sample to be detected;
predicting a target clustering category corresponding to the standardized data sample to be detected by using a preset clustering algorithm;
the determining the target distance between the data sample to be detected and the target cluster category center point includes:
and determining the target distance between the standardized data sample to be detected and the target clustering class central point.
In an optional embodiment, the determining whether the data sample to be detected is an abnormal sample according to the comparison result includes:
if the target quantile is larger than the preset threshold value, determining that the data sample to be detected is an abnormal sample;
and if the target quantile is less than or equal to the preset threshold, determining that the data sample to be detected is a non-abnormal sample.
In a second aspect of the embodiments of the present invention, there is provided an apparatus for detecting an abnormal sample, the apparatus including:
the sample acquisition module is used for acquiring a data sample to be detected;
the class prediction module is used for predicting a target cluster class corresponding to the data sample to be detected by using a preset clustering algorithm;
the distance determining module is used for determining a target distance between the data sample to be detected and the target clustering category central point;
the quantile searching module is used for searching a target quantile corresponding to the target distance;
the quantile comparison module is used for comparing the target quantile with a preset threshold value;
and the sample detection module is used for determining whether the data sample to be detected is an abnormal sample according to the comparison result.
In a third aspect of the embodiments of the present invention, there is further provided an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;
a memory for storing a computer program;
a processor, configured to implement the method for detecting an abnormal sample according to the first aspect when executing a program stored in a memory.
In a fourth aspect of the embodiments of the present invention, there is also provided a storage medium, in which instructions are stored, and when the storage medium is run on a computer, the storage medium causes the computer to execute the method for detecting an abnormal sample described in the first aspect.
In a fifth aspect of the embodiments of the present invention, there is also provided a computer program product containing instructions, which when run on a computer, causes the computer to execute the method for detecting an abnormal sample described in the first aspect.
According to the technical scheme provided by the embodiment of the invention, the data sample to be detected is obtained, the target cluster category corresponding to the data sample to be detected is predicted by using a preset clustering algorithm, the target distance between the data sample to be detected and the center point of the target cluster category is determined, the target quantile corresponding to the target distance is searched, the target quantile is compared with a preset threshold value, and whether the data sample to be detected is an abnormal sample is determined according to the comparison result. Predicting a target cluster category corresponding to a data sample to be detected through a preset clustering algorithm, determining a target distance between the data sample to be detected and a target cluster category central point, searching a target quantile corresponding to the target distance, comparing the target quantile with a preset threshold value, and determining whether the data sample to be detected is an abnormal sample according to a comparison result, so that the abnormal sample can be separated.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.
Fig. 1 is a schematic flow chart illustrating an implementation of a data sample processing method according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart illustrating an implementation of a method for detecting an abnormal sample according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a target Euclidean distance shown in an embodiment of the present invention;
FIG. 4 is a schematic structural diagram of an abnormal sample detection device shown in the embodiment of the present invention;
fig. 5 is a schematic structural diagram of an electronic device shown in the embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
As shown in fig. 1, an implementation flow diagram of a method for processing a data sample according to an embodiment of the present invention is provided, where the method is applied to a processor, and specifically includes the following steps;
s101, a data sample set is obtained, wherein the data sample set at least comprises one data sample.
In this embodiment of the present invention, the data sample set includes at least one data sample, where the data sample may be any type of data sample, for example, an image type data sample, and this is not limited in this embodiment of the present invention.
For example, in the embodiment of the present invention, a data sample set is obtained, where the data sample set includes 1000 data samples, and the 1000 data samples may be data samples of an image type, and may also be data samples of a text type.
S102, clustering all the data samples in the data sample set by using a preset clustering algorithm to generate N clustering categories.
In the embodiment of the present invention, for all data samples in the sample data set, a preset clustering algorithm is used to cluster all data samples in the sample data set, so that N cluster categories, for example, 5 cluster categories, can be generated.
Since there may be defects in the data sample, for example, a certain sample feature in the sample data lacks a feature value, missing value padding processing needs to be performed on the data sample at this time, which means that preprocessing needs to be performed on the data sample.
Based on this, in the embodiment of the present invention, for all data samples in the sample data set, all data samples in the sample data set are preprocessed, where the preprocessing at least includes missing value padding.
The missing value padding may be a default padding method of the system, or may be a padding method set by the user, so that all data samples in the sample data set are preprocessed according to any one of the above padding methods from a given data set and a given feature space.
In addition, in the embodiment of the present invention, in order to make the subsequent distance calculation within a uniform scale range, normalization processing needs to be performed on the preprocessed data sample, where normalization processing may be understood as normalization processing.
Based on this, in the embodiment of the present invention, all the data samples in the preprocessed data sample set are normalized, so as to obtain the normalized data samples corresponding to all the data samples, and the subsequent distance calculation can be facilitated to be in a uniform scale range.
The normalization process may be a system default normalization process (e.g., max-min data normalization), or may be a normalization process set by a user, so that all data samples in the preprocessed data sample set are normalized.
Thus, through the preprocessing and the standardization, the standardized data samples corresponding to all the data samples can be obtained, and all the standardized data samples are clustered by using a preset clustering algorithm to generate N clustering categories.
It should be noted that, for the preset clustering algorithm, the preset clustering algorithm may specifically be a K-means clustering algorithm in the embodiment of the present invention, and certainly, other relatively mature clustering algorithms in the market may also be used, and the embodiment of the present invention does not limit this.
In addition, the cluster number K, for example, the cluster category 5, may be set by a user, or may automatically select an optimal cluster number, so as to implement the clustering of the data samples, which is not limited in the embodiment of the present invention.
Based on the method, N clustering categories specified by a user are obtained, or the N clustering categories are determined according to the inflection point of the elbow graph, so that all standardized data samples are clustered by using a preset clustering algorithm, and the N clustering categories are generated.
S103, aiming at any one data sample in the data sample set, determining the cluster category corresponding to the data sample, and determining the distance between the data sample and the cluster category central point.
In the embodiment of the invention, for all data samples in the data sample set, for any data sample in the data sample set, the cluster category corresponding to the data sample is determined, and the distance between the data sample and the center point of the cluster category is determined.
For all the data samples in the data sample set, the normalized data samples corresponding to the data samples can be obtained through the preprocessing and the normalization, so that the cluster category corresponding to the normalized data samples is determined for any normalized data sample, and the distance between the normalized data sample and the center point of the cluster category is determined.
For example, for a normalized data sample 1, the cluster class a corresponding to the normalized data sample is determined, and the distance between the normalized data sample 1 and the center point of the cluster class a is determined, and the processing for the remaining normalized data samples is similar to the processing for the normalized data sample 1, so that the distance between each normalized data sample and the center point of its cluster class can be obtained, as shown in table 1 below.
TABLE 1
It should be noted that, in the embodiment of the present invention, the distance may specifically refer to a euclidean distance, so that for any normalized data sample, a cluster class corresponding to the normalized data sample is determined, and a euclidean distance between the normalized data sample and a center point of the cluster class is determined.
S104, aiming at any one clustering category, determining the distribution of the distances in the clustering category and determining different quantiles corresponding to the distances in the clustering category.
In the embodiment of the present invention, for N cluster categories, for any one of the N cluster categories, a distribution of distances within the cluster category is determined, where a distance refers to a distance between each standardized sample data within the cluster category and a center point of the cluster category.
In addition, for the N cluster categories, for any one of the N cluster categories, the distance inside the cluster category is mapped with different quantiles, so that different quantiles corresponding to the distance inside the cluster category can be determined.
For example, for 5 cluster categories, for any one of the 5 cluster categories, the distance inside the cluster category is mapped with 100 quantiles, so that different quantiles corresponding to the distance inside the cluster category can be determined.
In the embodiment of the present invention, the distance refers to a euclidean distance, and for any one of the N clustering categories, the distribution of the euclidean distances within the clustering category is determined, and different quantiles corresponding to the euclidean distances within the clustering category are determined, thereby completing the processing of the data sample set.
For example, for 5 cluster classes, for any one of the 5 cluster classes, a distribution of euclidean distances within the cluster class is determined, where euclidean distance refers to the euclidean distance between each normalized data sample within the cluster class and the center point of the cluster class.
In addition, for 5 cluster categories, aiming at any one of the 5 cluster categories, the Euclidean distance inside the cluster category is mapped with 100 quantiles, so that different quantiles corresponding to the Euclidean distance inside the cluster category can be determined, and the processing of the data sample set is completed.
After the above processing, as shown in fig. 2, an implementation flow diagram of the method for detecting an abnormal sample according to the embodiment of the present invention is shown, where the method is applied to a processor, and specifically includes the following steps:
s201, acquiring a data sample to be detected, and predicting a target clustering category corresponding to the data sample to be detected by using a preset clustering algorithm.
In the embodiment of the invention, the data sample to be detected is obtained, wherein the data sample to be detected is a new data sample, and the target cluster category corresponding to the data sample to be detected is predicted by using a preset clustering algorithm, wherein the target cluster category can be any one of the N cluster categories.
In the embodiment of the present invention, the data sample to be detected is preprocessed, where the preprocessing at least includes missing value padding, and specifically, the preprocessing flow is similar to that described above, and details of the embodiment of the present invention are not repeated here.
In addition, normalization processing is performed on the preprocessed data sample to be detected to obtain a standardized data sample to be detected, and specifically, the normalization processing is similar to that described above. Therefore, the target clustering category corresponding to the standardized data sample to be detected can be predicted by utilizing a preset clustering algorithm.
For example, in the embodiment of the present invention, the data sample to be detected is preprocessed, where the preprocessing at least includes missing value padding, and specifically, the data sample to be detected is preprocessed according to a missing value padding manner set by a user from a given data set and a feature space.
And normalizing the preprocessed data sample to be detected by utilizing maximum-minimum data normalization to obtain the normalized data sample to be detected, so that the target clustering class (clustering class A) corresponding to the normalized data sample to be detected can be predicted by utilizing a K-means clustering algorithm.
S202, determining a target distance between the data sample to be detected and the target clustering class central point, and searching a target quantile corresponding to the target distance.
In the embodiment of the invention, for the data sample to be detected, the target distance between the data sample to be detected and the target clustering class central point is determined, and the target quantile corresponding to the target distance is searched.
And if the target distance is the Euclidean distance, determining the target Euclidean distance between the data sample to be detected and the target cluster class central point for the data sample to be detected, and searching a target quantile corresponding to the target Euclidean distance.
In addition, the standardized data sample to be detected is obtained after the data sample to be detected is preprocessed and standardized, so that the target Euclidean distance between the standardized data sample to be detected and the target clustering class central point can be determined, and the target quantile corresponding to the target Euclidean distance is searched.
For example, the data sample to be detected is preprocessed and normalized to obtain a normalized data sample to be detected, a target euclidean distance between the normalized data sample to be detected and a center point of a target cluster type (cluster type a) is determined, as shown in fig. 3, and a target quantile corresponding to the target euclidean distance is searched.
S203, comparing the target quantile with a preset threshold value, and determining whether the data sample to be detected is an abnormal sample according to the comparison result.
In the embodiment of the invention, a threshold value can be preset by a user, the target quantile corresponding to the target Euclidean distance is compared with the threshold value set by the user, and whether the data sample to be detected is an abnormal sample or not is determined according to the comparison result.
And if the target quantile is smaller than or equal to the preset threshold, determining that the data sample to be detected is not an abnormal sample, namely a normal sample.
For example, for a target quantile corresponding to the target euclidean distance, if the target quantile is greater than 5, the data sample to be detected is determined to be an abnormal sample, and if the target quantile is less than or equal to 5, the data sample to be detected is determined to be a non-abnormal sample, i.e., a normal sample.
Through the above description of the technical scheme provided by the embodiment of the invention, the data sample to be detected is obtained, the target cluster category corresponding to the data sample to be detected is predicted by using the preset clustering algorithm, the target distance between the data sample to be detected and the center point of the target cluster category is determined, the target quantile corresponding to the target distance is searched, the target quantile is compared with the preset threshold value, and whether the data sample to be detected is an abnormal sample is determined according to the comparison result.
Predicting a target cluster category corresponding to a data sample to be detected through a preset clustering algorithm, determining a target distance between the data sample to be detected and a target cluster category central point, searching a target quantile corresponding to the target distance, comparing the target quantile with a preset threshold value, and determining whether the data sample to be detected is an abnormal sample according to a comparison result, so that the abnormal sample can be separated.
Corresponding to the above method embodiment, an embodiment of the present invention further provides an apparatus for detecting an abnormal sample, as shown in fig. 4, the apparatus may include: the system comprises a sample acquisition module 410, a category prediction module 420, a distance determination module 430, a quantile search module 440, a quantile comparison module 450 and a sample detection module 460.
A sample obtaining module 410, configured to obtain a data sample to be detected;
the category prediction module 420 is configured to predict a target clustering category corresponding to the data sample to be detected by using a preset clustering algorithm;
a distance determining module 430, configured to determine a target distance between the data sample to be detected and the target cluster category center point;
a quantile search module 440, configured to search for a target quantile corresponding to the target distance;
a quantile comparison module 450, configured to compare the target quantile with a preset threshold;
and the sample detection module 460 is configured to determine whether the data sample to be detected is an abnormal sample according to the comparison result.
An embodiment of the present invention further provides an electronic device, as shown in fig. 5, including a processor 51, a communication interface 52, a memory 53 and a communication bus 54, where the processor 51, the communication interface 52, and the memory 53 complete mutual communication through the communication bus 54,
a memory 53 for storing a computer program;
the processor 51 is configured to implement the following steps when executing the program stored in the memory 53:
acquiring a data sample to be detected, and predicting a target clustering category corresponding to the data sample to be detected by using a preset clustering algorithm; determining a target distance between the data sample to be detected and the target clustering category central point, and searching a target quantile corresponding to the target distance; and comparing the target quantile with a preset threshold value, and determining whether the data sample to be detected is an abnormal sample according to a comparison result.
The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.
The communication interface is used for communication between the electronic equipment and other equipment.
The Memory may include a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.
The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.
In another embodiment of the present invention, there is also provided a storage medium having instructions stored therein, which when run on a computer, cause the computer to execute the method for detecting an abnormal sample according to any one of the above embodiments.
In yet another embodiment, a computer program product containing instructions is provided, which when run on a computer causes the computer to perform the method for detecting an abnormal sample as described in any of the above embodiments.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a storage medium or transmitted from one storage medium to another, for example, from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The storage medium may be any available medium that can be accessed by a computer or a data storage device including one or more available media integrated servers, data centers, and the like. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.
Claims (10)
1. A method for detecting an abnormal sample, the method comprising:
acquiring a data sample to be detected, and predicting a target clustering category corresponding to the data sample to be detected by using a preset clustering algorithm;
determining a target distance between the data sample to be detected and the target clustering category central point, and searching a target quantile corresponding to the target distance;
and comparing the target quantile with a preset threshold value, and determining whether the data sample to be detected is an abnormal sample according to a comparison result.
2. The method of claim 1, further comprising, prior to performing the method:
acquiring a data sample set, wherein the data sample set at least comprises one data sample;
clustering all the data samples in the data sample set by using a preset clustering algorithm to generate N clustering categories;
for any one data sample in the data sample set, determining the cluster category corresponding to the data sample, and determining the distance between the data sample and the cluster category central point;
for any one of the cluster categories, determining a distribution of the distances within the cluster category, and determining different quantiles corresponding to the distances within the cluster category.
3. The method of claim 2, wherein the clustering all the data samples in the data sample set by using a preset clustering algorithm to generate N cluster categories comprises:
acquiring N clustering categories specified by a user, or determining the N clustering categories according to inflection points of the elbow graph;
and clustering all the data samples in the data sample set by using a preset clustering algorithm to generate N clustering categories.
4. The method of claim 2, wherein determining the distance between the data sample and the cluster class center point comprises:
determining Euclidean distances between the data samples and the cluster category center points;
the determining, for any of the cluster categories, the distribution of the distances within the cluster category and the different quantiles corresponding to the distances within the cluster category, includes:
for any one of the cluster categories, determining the distribution of the Euclidean distances in the cluster category, and determining different quantiles corresponding to the Euclidean distances in the cluster category;
the determining the target distance between the data sample to be detected and the target clustering category central point and searching the target quantile corresponding to the target distance comprises the following steps:
determining a target Euclidean distance between the data sample to be detected and the target clustering category central point, and searching a target quantile corresponding to the target Euclidean distance.
5. The method according to any one of claims 2 to 4, wherein the clustering all the data samples in the data sample set by using a preset clustering algorithm to generate N cluster categories comprises:
preprocessing all the data samples in the set of data samples, wherein the preprocessing at least includes missing value padding;
all the data samples in the preprocessed data sample set are normalized to obtain normalized data samples corresponding to all the data samples;
clustering all the standardized data samples by using a preset clustering algorithm to generate N clustering categories;
the determining the cluster category corresponding to the data sample and the distance between the data sample and the cluster category center point for any one of the data samples in the data sample set includes:
for any normalized data sample, determining the cluster category corresponding to the normalized data sample and the distance between the cluster category and the center point of the cluster category.
6. The method according to claim 5, wherein the predicting the target cluster category corresponding to the data sample to be detected by using a preset clustering algorithm comprises:
preprocessing the data sample to be detected, wherein the preprocessing at least comprises missing value filling;
normalizing the preprocessed data sample to be detected to obtain a standardized data sample to be detected;
predicting a target clustering category corresponding to the standardized data sample to be detected by using a preset clustering algorithm;
the determining the target distance between the data sample to be detected and the target cluster category center point includes:
and determining the target distance between the standardized data sample to be detected and the target clustering class central point.
7. The method according to claim 1, wherein the determining whether the data sample to be detected is an abnormal sample according to the comparison result comprises:
if the target quantile is larger than the preset threshold value, determining that the data sample to be detected is an abnormal sample;
and if the target quantile is less than or equal to the preset threshold, determining that the data sample to be detected is a non-abnormal sample.
8. An apparatus for detecting an abnormal sample, the apparatus comprising:
the sample acquisition module is used for acquiring a data sample to be detected;
the class prediction module is used for predicting a target cluster class corresponding to the data sample to be detected by using a preset clustering algorithm;
the distance determining module is used for determining a target distance between the data sample to be detected and the target clustering category central point;
the quantile searching module is used for searching a target quantile corresponding to the target distance;
the quantile comparison module is used for comparing the target quantile with a preset threshold value;
and the sample detection module is used for determining whether the data sample to be detected is an abnormal sample according to the comparison result.
9. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;
a memory for storing a computer program;
a processor for implementing the method steps of any one of claims 1 to 7 when executing a program stored on a memory.
10. A storage medium on which a computer program is stored, which program, when being executed by a processor, carries out the method according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111396617.6A CN114298147A (en) | 2021-11-23 | 2021-11-23 | Abnormal sample detection method and device, electronic equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111396617.6A CN114298147A (en) | 2021-11-23 | 2021-11-23 | Abnormal sample detection method and device, electronic equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114298147A true CN114298147A (en) | 2022-04-08 |
Family
ID=80966229
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111396617.6A Pending CN114298147A (en) | 2021-11-23 | 2021-11-23 | Abnormal sample detection method and device, electronic equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114298147A (en) |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP3156941A1 (en) * | 2015-10-12 | 2017-04-19 | Siemens Aktiengesellschaft | System, method and a computer program product for analyzing data |
CN109978070A (en) * | 2019-04-03 | 2019-07-05 | 北京市天元网络技术股份有限公司 | A kind of improved K-means rejecting outliers method and device |
CN111797887A (en) * | 2020-04-16 | 2020-10-20 | 中国电力科学研究院有限公司 | Anti-electricity-stealing early warning method and system based on density screening and K-means clustering |
CN111814910A (en) * | 2020-08-12 | 2020-10-23 | 中国工商银行股份有限公司 | Abnormality detection method, abnormality detection device, electronic apparatus, and storage medium |
CN111814523A (en) * | 2019-04-12 | 2020-10-23 | 北京京东尚科信息技术有限公司 | Human body activity recognition method and device |
CN112001409A (en) * | 2020-07-01 | 2020-11-27 | 中国电力科学研究院有限公司 | Power distribution network line loss abnormity diagnosis method and system based on K-means clustering algorithm |
CN112905412A (en) * | 2021-01-29 | 2021-06-04 | 清华大学 | Method and device for detecting abnormity of key performance index data |
-
2021
- 2021-11-23 CN CN202111396617.6A patent/CN114298147A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP3156941A1 (en) * | 2015-10-12 | 2017-04-19 | Siemens Aktiengesellschaft | System, method and a computer program product for analyzing data |
CN109978070A (en) * | 2019-04-03 | 2019-07-05 | 北京市天元网络技术股份有限公司 | A kind of improved K-means rejecting outliers method and device |
CN111814523A (en) * | 2019-04-12 | 2020-10-23 | 北京京东尚科信息技术有限公司 | Human body activity recognition method and device |
CN111797887A (en) * | 2020-04-16 | 2020-10-20 | 中国电力科学研究院有限公司 | Anti-electricity-stealing early warning method and system based on density screening and K-means clustering |
CN112001409A (en) * | 2020-07-01 | 2020-11-27 | 中国电力科学研究院有限公司 | Power distribution network line loss abnormity diagnosis method and system based on K-means clustering algorithm |
CN111814910A (en) * | 2020-08-12 | 2020-10-23 | 中国工商银行股份有限公司 | Abnormality detection method, abnormality detection device, electronic apparatus, and storage medium |
CN112905412A (en) * | 2021-01-29 | 2021-06-04 | 清华大学 | Method and device for detecting abnormity of key performance index data |
Non-Patent Citations (1)
Title |
---|
程明畅 等: "基于分位数半径的动态K-means算法", 《南京大学学报(自然科学)》 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110008080B (en) | Business index anomaly detection method and device based on time sequence and electronic equipment | |
CN110083475B (en) | Abnormal data detection method and device | |
CN111538642B (en) | Abnormal behavior detection method and device, electronic equipment and storage medium | |
CN107798047B (en) | Repeated work order detection method, device, server and medium | |
CN108829715B (en) | Method, apparatus, and computer-readable storage medium for detecting abnormal data | |
CN112818066A (en) | Time sequence data anomaly detection method and device, electronic equipment and storage medium | |
CN107911397B (en) | Threat assessment method and device | |
CN111062013A (en) | Account filtering method and device, electronic equipment and machine-readable storage medium | |
CN108399115B (en) | Operation and maintenance operation detection method and device and electronic equipment | |
CN109740621B (en) | Video classification method, device and equipment | |
CN111339137A (en) | Data verification method and device | |
CN114662602A (en) | Outlier detection method and device, electronic equipment and storage medium | |
CN113918438A (en) | Method and device for detecting server abnormality, server and storage medium | |
CN111814557A (en) | Action flow detection method, device, equipment and storage medium | |
CN114463345A (en) | Multi-parameter mammary gland magnetic resonance image segmentation method based on dynamic self-adaptive network | |
CN112307086B (en) | Automatic data verification method and device in fire service | |
CN112995765A (en) | Network resource display method and device | |
CN115932144B (en) | Chromatograph performance detection method, chromatograph performance detection device, chromatograph performance detection equipment and computer medium | |
CN114298147A (en) | Abnormal sample detection method and device, electronic equipment and storage medium | |
CN116661954A (en) | Virtual machine abnormality prediction method, device, communication equipment and storage medium | |
CN110795308A (en) | Server inspection method, device, equipment and storage medium | |
US20230245421A1 (en) | Face clustering method and apparatus, classification storage method, medium and electronic device | |
CN114429177A (en) | Equipment fingerprint feature screening method and device, electronic equipment and storage medium | |
CN113763305B (en) | Method and device for calibrating defect of article and electronic equipment | |
CN114398228A (en) | Method and device for predicting equipment resource use condition and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20220408 |
|
RJ01 | Rejection of invention patent application after publication |