CN114298147A - Abnormal sample detection method and device, electronic equipment and storage medium - Google Patents

Abnormal sample detection method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN114298147A
CN114298147A CN202111396617.6A CN202111396617A CN114298147A CN 114298147 A CN114298147 A CN 114298147A CN 202111396617 A CN202111396617 A CN 202111396617A CN 114298147 A CN114298147 A CN 114298147A
Authority
CN
China
Prior art keywords
data sample
target
detected
sample
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111396617.6A
Other languages
Chinese (zh)
Inventor
林建明
杨懿宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Wuyu Technology Co ltd
Original Assignee
Shenzhen Wuyu Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Wuyu Technology Co ltd filed Critical Shenzhen Wuyu Technology Co ltd
Priority to CN202111396617.6A priority Critical patent/CN114298147A/en
Publication of CN114298147A publication Critical patent/CN114298147A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention provides a method and a device for detecting an abnormal sample, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a data sample to be detected, and predicting a target clustering category corresponding to the data sample to be detected by using a preset clustering algorithm; determining a target distance between the data sample to be detected and the target clustering category central point, and searching a target quantile corresponding to the target distance; and comparing the target quantile with a preset threshold value, and determining whether the data sample to be detected is an abnormal sample according to a comparison result. Predicting a target cluster category corresponding to a data sample to be detected through a preset clustering algorithm, determining a target distance between the data sample to be detected and a target cluster category central point, searching a target quantile corresponding to the target distance, comparing the target quantile with a preset threshold value, and determining whether the data sample to be detected is an abnormal sample according to a comparison result, so that the abnormal sample can be separated.

Description

Abnormal sample detection method and device, electronic equipment and storage medium
Technical Field
The embodiment of the invention relates to the technical field of artificial intelligence, in particular to a method and a device for detecting an abnormal sample, electronic equipment and a storage medium.
Background
In the sample data set, there is a significant difference between some sample data and most other sample data, and we generally refer to this sample data as an outlier or singular sample. In order to ensure the data quality of the sample data set, it is generally necessary to perform anomaly detection on each sample in the sample data set, so as to remove anomalous samples.
In the existing detection method for the abnormal samples, because a plurality of categories exist in a data sample set, data characteristics cannot be summarized in a uniform characterization mode, and the abnormal samples cannot be separated. Therefore, it is necessary to preliminarily recognize the data sample set based on the complexity of the data sample set, and in view of this, a method for detecting an abnormal sample is urgently needed so as to separate the abnormal sample.
Disclosure of Invention
In order to solve the technical problems that due to the fact that multiple categories exist in a data sample set, data characteristics cannot be summarized in a unified representation mode, and abnormal samples cannot be separated, the embodiment of the invention provides a method and a device for detecting the abnormal samples, electronic equipment and a storage medium.
In a first aspect of the embodiments of the present invention, there is provided a method for detecting an abnormal sample, the method including:
acquiring a data sample to be detected, and predicting a target clustering category corresponding to the data sample to be detected by using a preset clustering algorithm;
determining a target distance between the data sample to be detected and the target clustering category central point, and searching a target quantile corresponding to the target distance;
and comparing the target quantile with a preset threshold value, and determining whether the data sample to be detected is an abnormal sample according to a comparison result.
In an optional embodiment, before performing the method, the method further comprises:
acquiring a data sample set, wherein the data sample set at least comprises one data sample;
clustering all the data samples in the data sample set by using a preset clustering algorithm to generate N clustering categories;
for any one data sample in the data sample set, determining the cluster category corresponding to the data sample, and determining the distance between the data sample and the cluster category central point;
for any one of the cluster categories, determining a distribution of the distances within the cluster category, and determining different quantiles corresponding to the distances within the cluster category.
In an optional embodiment, the clustering all the data samples in the data sample set by using a preset clustering algorithm to generate N clustering categories includes:
acquiring N clustering categories specified by a user, or determining the N clustering categories according to inflection points of the elbow graph;
and clustering all the data samples in the data sample set by using a preset clustering algorithm to generate N clustering categories.
In an optional embodiment, the determining the distance between the data sample and the cluster category center point includes:
determining Euclidean distances between the data samples and the cluster category center points;
the determining, for any of the cluster categories, the distribution of the distances within the cluster category and the different quantiles corresponding to the distances within the cluster category, includes:
for any one of the cluster categories, determining the distribution of the Euclidean distances in the cluster category, and determining different quantiles corresponding to the Euclidean distances in the cluster category;
the determining the target distance between the data sample to be detected and the target clustering category central point and searching the target quantile corresponding to the target distance comprises the following steps:
determining a target Euclidean distance between the data sample to be detected and the target clustering category central point, and searching a target quantile corresponding to the target Euclidean distance.
In an optional embodiment, the clustering all the data samples in the data sample set by using a preset clustering algorithm to generate N clustering categories includes:
preprocessing all the data samples in the set of data samples, wherein the preprocessing at least includes missing value padding;
all the data samples in the preprocessed data sample set are normalized to obtain normalized data samples corresponding to all the data samples;
clustering all the standardized data samples by using a preset clustering algorithm to generate N clustering categories;
the determining the cluster category corresponding to the data sample and the distance between the data sample and the cluster category center point for any one of the data samples in the data sample set includes:
for any normalized data sample, determining the cluster category corresponding to the normalized data sample and the distance between the cluster category and the center point of the cluster category.
In an optional embodiment, the predicting, by using a preset clustering algorithm, a target clustering class corresponding to the data sample to be detected includes:
preprocessing the data sample to be detected, wherein the preprocessing at least comprises missing value filling;
normalizing the preprocessed data sample to be detected to obtain a standardized data sample to be detected;
predicting a target clustering category corresponding to the standardized data sample to be detected by using a preset clustering algorithm;
the determining the target distance between the data sample to be detected and the target cluster category center point includes:
and determining the target distance between the standardized data sample to be detected and the target clustering class central point.
In an optional embodiment, the determining whether the data sample to be detected is an abnormal sample according to the comparison result includes:
if the target quantile is larger than the preset threshold value, determining that the data sample to be detected is an abnormal sample;
and if the target quantile is less than or equal to the preset threshold, determining that the data sample to be detected is a non-abnormal sample.
In a second aspect of the embodiments of the present invention, there is provided an apparatus for detecting an abnormal sample, the apparatus including:
the sample acquisition module is used for acquiring a data sample to be detected;
the class prediction module is used for predicting a target cluster class corresponding to the data sample to be detected by using a preset clustering algorithm;
the distance determining module is used for determining a target distance between the data sample to be detected and the target clustering category central point;
the quantile searching module is used for searching a target quantile corresponding to the target distance;
the quantile comparison module is used for comparing the target quantile with a preset threshold value;
and the sample detection module is used for determining whether the data sample to be detected is an abnormal sample according to the comparison result.
In a third aspect of the embodiments of the present invention, there is further provided an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;
a memory for storing a computer program;
a processor, configured to implement the method for detecting an abnormal sample according to the first aspect when executing a program stored in a memory.
In a fourth aspect of the embodiments of the present invention, there is also provided a storage medium, in which instructions are stored, and when the storage medium is run on a computer, the storage medium causes the computer to execute the method for detecting an abnormal sample described in the first aspect.
In a fifth aspect of the embodiments of the present invention, there is also provided a computer program product containing instructions, which when run on a computer, causes the computer to execute the method for detecting an abnormal sample described in the first aspect.
According to the technical scheme provided by the embodiment of the invention, the data sample to be detected is obtained, the target cluster category corresponding to the data sample to be detected is predicted by using a preset clustering algorithm, the target distance between the data sample to be detected and the center point of the target cluster category is determined, the target quantile corresponding to the target distance is searched, the target quantile is compared with a preset threshold value, and whether the data sample to be detected is an abnormal sample is determined according to the comparison result. Predicting a target cluster category corresponding to a data sample to be detected through a preset clustering algorithm, determining a target distance between the data sample to be detected and a target cluster category central point, searching a target quantile corresponding to the target distance, comparing the target quantile with a preset threshold value, and determining whether the data sample to be detected is an abnormal sample according to a comparison result, so that the abnormal sample can be separated.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.
Fig. 1 is a schematic flow chart illustrating an implementation of a data sample processing method according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart illustrating an implementation of a method for detecting an abnormal sample according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a target Euclidean distance shown in an embodiment of the present invention;
FIG. 4 is a schematic structural diagram of an abnormal sample detection device shown in the embodiment of the present invention;
fig. 5 is a schematic structural diagram of an electronic device shown in the embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
As shown in fig. 1, an implementation flow diagram of a method for processing a data sample according to an embodiment of the present invention is provided, where the method is applied to a processor, and specifically includes the following steps;
s101, a data sample set is obtained, wherein the data sample set at least comprises one data sample.
In this embodiment of the present invention, the data sample set includes at least one data sample, where the data sample may be any type of data sample, for example, an image type data sample, and this is not limited in this embodiment of the present invention.
For example, in the embodiment of the present invention, a data sample set is obtained, where the data sample set includes 1000 data samples, and the 1000 data samples may be data samples of an image type, and may also be data samples of a text type.
S102, clustering all the data samples in the data sample set by using a preset clustering algorithm to generate N clustering categories.
In the embodiment of the present invention, for all data samples in the sample data set, a preset clustering algorithm is used to cluster all data samples in the sample data set, so that N cluster categories, for example, 5 cluster categories, can be generated.
Since there may be defects in the data sample, for example, a certain sample feature in the sample data lacks a feature value, missing value padding processing needs to be performed on the data sample at this time, which means that preprocessing needs to be performed on the data sample.
Based on this, in the embodiment of the present invention, for all data samples in the sample data set, all data samples in the sample data set are preprocessed, where the preprocessing at least includes missing value padding.
The missing value padding may be a default padding method of the system, or may be a padding method set by the user, so that all data samples in the sample data set are preprocessed according to any one of the above padding methods from a given data set and a given feature space.
In addition, in the embodiment of the present invention, in order to make the subsequent distance calculation within a uniform scale range, normalization processing needs to be performed on the preprocessed data sample, where normalization processing may be understood as normalization processing.
Based on this, in the embodiment of the present invention, all the data samples in the preprocessed data sample set are normalized, so as to obtain the normalized data samples corresponding to all the data samples, and the subsequent distance calculation can be facilitated to be in a uniform scale range.
The normalization process may be a system default normalization process (e.g., max-min data normalization), or may be a normalization process set by a user, so that all data samples in the preprocessed data sample set are normalized.
Thus, through the preprocessing and the standardization, the standardized data samples corresponding to all the data samples can be obtained, and all the standardized data samples are clustered by using a preset clustering algorithm to generate N clustering categories.
It should be noted that, for the preset clustering algorithm, the preset clustering algorithm may specifically be a K-means clustering algorithm in the embodiment of the present invention, and certainly, other relatively mature clustering algorithms in the market may also be used, and the embodiment of the present invention does not limit this.
In addition, the cluster number K, for example, the cluster category 5, may be set by a user, or may automatically select an optimal cluster number, so as to implement the clustering of the data samples, which is not limited in the embodiment of the present invention.
Based on the method, N clustering categories specified by a user are obtained, or the N clustering categories are determined according to the inflection point of the elbow graph, so that all standardized data samples are clustered by using a preset clustering algorithm, and the N clustering categories are generated.
S103, aiming at any one data sample in the data sample set, determining the cluster category corresponding to the data sample, and determining the distance between the data sample and the cluster category central point.
In the embodiment of the invention, for all data samples in the data sample set, for any data sample in the data sample set, the cluster category corresponding to the data sample is determined, and the distance between the data sample and the center point of the cluster category is determined.
For all the data samples in the data sample set, the normalized data samples corresponding to the data samples can be obtained through the preprocessing and the normalization, so that the cluster category corresponding to the normalized data samples is determined for any normalized data sample, and the distance between the normalized data sample and the center point of the cluster category is determined.
For example, for a normalized data sample 1, the cluster class a corresponding to the normalized data sample is determined, and the distance between the normalized data sample 1 and the center point of the cluster class a is determined, and the processing for the remaining normalized data samples is similar to the processing for the normalized data sample 1, so that the distance between each normalized data sample and the center point of its cluster class can be obtained, as shown in table 1 below.
Figure BDA0003370545440000081
TABLE 1
It should be noted that, in the embodiment of the present invention, the distance may specifically refer to a euclidean distance, so that for any normalized data sample, a cluster class corresponding to the normalized data sample is determined, and a euclidean distance between the normalized data sample and a center point of the cluster class is determined.
S104, aiming at any one clustering category, determining the distribution of the distances in the clustering category and determining different quantiles corresponding to the distances in the clustering category.
In the embodiment of the present invention, for N cluster categories, for any one of the N cluster categories, a distribution of distances within the cluster category is determined, where a distance refers to a distance between each standardized sample data within the cluster category and a center point of the cluster category.
In addition, for the N cluster categories, for any one of the N cluster categories, the distance inside the cluster category is mapped with different quantiles, so that different quantiles corresponding to the distance inside the cluster category can be determined.
For example, for 5 cluster categories, for any one of the 5 cluster categories, the distance inside the cluster category is mapped with 100 quantiles, so that different quantiles corresponding to the distance inside the cluster category can be determined.
In the embodiment of the present invention, the distance refers to a euclidean distance, and for any one of the N clustering categories, the distribution of the euclidean distances within the clustering category is determined, and different quantiles corresponding to the euclidean distances within the clustering category are determined, thereby completing the processing of the data sample set.
For example, for 5 cluster classes, for any one of the 5 cluster classes, a distribution of euclidean distances within the cluster class is determined, where euclidean distance refers to the euclidean distance between each normalized data sample within the cluster class and the center point of the cluster class.
In addition, for 5 cluster categories, aiming at any one of the 5 cluster categories, the Euclidean distance inside the cluster category is mapped with 100 quantiles, so that different quantiles corresponding to the Euclidean distance inside the cluster category can be determined, and the processing of the data sample set is completed.
After the above processing, as shown in fig. 2, an implementation flow diagram of the method for detecting an abnormal sample according to the embodiment of the present invention is shown, where the method is applied to a processor, and specifically includes the following steps:
s201, acquiring a data sample to be detected, and predicting a target clustering category corresponding to the data sample to be detected by using a preset clustering algorithm.
In the embodiment of the invention, the data sample to be detected is obtained, wherein the data sample to be detected is a new data sample, and the target cluster category corresponding to the data sample to be detected is predicted by using a preset clustering algorithm, wherein the target cluster category can be any one of the N cluster categories.
In the embodiment of the present invention, the data sample to be detected is preprocessed, where the preprocessing at least includes missing value padding, and specifically, the preprocessing flow is similar to that described above, and details of the embodiment of the present invention are not repeated here.
In addition, normalization processing is performed on the preprocessed data sample to be detected to obtain a standardized data sample to be detected, and specifically, the normalization processing is similar to that described above. Therefore, the target clustering category corresponding to the standardized data sample to be detected can be predicted by utilizing a preset clustering algorithm.
For example, in the embodiment of the present invention, the data sample to be detected is preprocessed, where the preprocessing at least includes missing value padding, and specifically, the data sample to be detected is preprocessed according to a missing value padding manner set by a user from a given data set and a feature space.
And normalizing the preprocessed data sample to be detected by utilizing maximum-minimum data normalization to obtain the normalized data sample to be detected, so that the target clustering class (clustering class A) corresponding to the normalized data sample to be detected can be predicted by utilizing a K-means clustering algorithm.
S202, determining a target distance between the data sample to be detected and the target clustering class central point, and searching a target quantile corresponding to the target distance.
In the embodiment of the invention, for the data sample to be detected, the target distance between the data sample to be detected and the target clustering class central point is determined, and the target quantile corresponding to the target distance is searched.
And if the target distance is the Euclidean distance, determining the target Euclidean distance between the data sample to be detected and the target cluster class central point for the data sample to be detected, and searching a target quantile corresponding to the target Euclidean distance.
In addition, the standardized data sample to be detected is obtained after the data sample to be detected is preprocessed and standardized, so that the target Euclidean distance between the standardized data sample to be detected and the target clustering class central point can be determined, and the target quantile corresponding to the target Euclidean distance is searched.
For example, the data sample to be detected is preprocessed and normalized to obtain a normalized data sample to be detected, a target euclidean distance between the normalized data sample to be detected and a center point of a target cluster type (cluster type a) is determined, as shown in fig. 3, and a target quantile corresponding to the target euclidean distance is searched.
S203, comparing the target quantile with a preset threshold value, and determining whether the data sample to be detected is an abnormal sample according to the comparison result.
In the embodiment of the invention, a threshold value can be preset by a user, the target quantile corresponding to the target Euclidean distance is compared with the threshold value set by the user, and whether the data sample to be detected is an abnormal sample or not is determined according to the comparison result.
And if the target quantile is smaller than or equal to the preset threshold, determining that the data sample to be detected is not an abnormal sample, namely a normal sample.
For example, for a target quantile corresponding to the target euclidean distance, if the target quantile is greater than 5, the data sample to be detected is determined to be an abnormal sample, and if the target quantile is less than or equal to 5, the data sample to be detected is determined to be a non-abnormal sample, i.e., a normal sample.
Through the above description of the technical scheme provided by the embodiment of the invention, the data sample to be detected is obtained, the target cluster category corresponding to the data sample to be detected is predicted by using the preset clustering algorithm, the target distance between the data sample to be detected and the center point of the target cluster category is determined, the target quantile corresponding to the target distance is searched, the target quantile is compared with the preset threshold value, and whether the data sample to be detected is an abnormal sample is determined according to the comparison result.
Predicting a target cluster category corresponding to a data sample to be detected through a preset clustering algorithm, determining a target distance between the data sample to be detected and a target cluster category central point, searching a target quantile corresponding to the target distance, comparing the target quantile with a preset threshold value, and determining whether the data sample to be detected is an abnormal sample according to a comparison result, so that the abnormal sample can be separated.
Corresponding to the above method embodiment, an embodiment of the present invention further provides an apparatus for detecting an abnormal sample, as shown in fig. 4, the apparatus may include: the system comprises a sample acquisition module 410, a category prediction module 420, a distance determination module 430, a quantile search module 440, a quantile comparison module 450 and a sample detection module 460.
A sample obtaining module 410, configured to obtain a data sample to be detected;
the category prediction module 420 is configured to predict a target clustering category corresponding to the data sample to be detected by using a preset clustering algorithm;
a distance determining module 430, configured to determine a target distance between the data sample to be detected and the target cluster category center point;
a quantile search module 440, configured to search for a target quantile corresponding to the target distance;
a quantile comparison module 450, configured to compare the target quantile with a preset threshold;
and the sample detection module 460 is configured to determine whether the data sample to be detected is an abnormal sample according to the comparison result.
An embodiment of the present invention further provides an electronic device, as shown in fig. 5, including a processor 51, a communication interface 52, a memory 53 and a communication bus 54, where the processor 51, the communication interface 52, and the memory 53 complete mutual communication through the communication bus 54,
a memory 53 for storing a computer program;
the processor 51 is configured to implement the following steps when executing the program stored in the memory 53:
acquiring a data sample to be detected, and predicting a target clustering category corresponding to the data sample to be detected by using a preset clustering algorithm; determining a target distance between the data sample to be detected and the target clustering category central point, and searching a target quantile corresponding to the target distance; and comparing the target quantile with a preset threshold value, and determining whether the data sample to be detected is an abnormal sample according to a comparison result.
The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.
The communication interface is used for communication between the electronic equipment and other equipment.
The Memory may include a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.
The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.
In another embodiment of the present invention, there is also provided a storage medium having instructions stored therein, which when run on a computer, cause the computer to execute the method for detecting an abnormal sample according to any one of the above embodiments.
In yet another embodiment, a computer program product containing instructions is provided, which when run on a computer causes the computer to perform the method for detecting an abnormal sample as described in any of the above embodiments.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a storage medium or transmitted from one storage medium to another, for example, from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The storage medium may be any available medium that can be accessed by a computer or a data storage device including one or more available media integrated servers, data centers, and the like. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims (10)

1. A method for detecting an abnormal sample, the method comprising:
acquiring a data sample to be detected, and predicting a target clustering category corresponding to the data sample to be detected by using a preset clustering algorithm;
determining a target distance between the data sample to be detected and the target clustering category central point, and searching a target quantile corresponding to the target distance;
and comparing the target quantile with a preset threshold value, and determining whether the data sample to be detected is an abnormal sample according to a comparison result.
2. The method of claim 1, further comprising, prior to performing the method:
acquiring a data sample set, wherein the data sample set at least comprises one data sample;
clustering all the data samples in the data sample set by using a preset clustering algorithm to generate N clustering categories;
for any one data sample in the data sample set, determining the cluster category corresponding to the data sample, and determining the distance between the data sample and the cluster category central point;
for any one of the cluster categories, determining a distribution of the distances within the cluster category, and determining different quantiles corresponding to the distances within the cluster category.
3. The method of claim 2, wherein the clustering all the data samples in the data sample set by using a preset clustering algorithm to generate N cluster categories comprises:
acquiring N clustering categories specified by a user, or determining the N clustering categories according to inflection points of the elbow graph;
and clustering all the data samples in the data sample set by using a preset clustering algorithm to generate N clustering categories.
4. The method of claim 2, wherein determining the distance between the data sample and the cluster class center point comprises:
determining Euclidean distances between the data samples and the cluster category center points;
the determining, for any of the cluster categories, the distribution of the distances within the cluster category and the different quantiles corresponding to the distances within the cluster category, includes:
for any one of the cluster categories, determining the distribution of the Euclidean distances in the cluster category, and determining different quantiles corresponding to the Euclidean distances in the cluster category;
the determining the target distance between the data sample to be detected and the target clustering category central point and searching the target quantile corresponding to the target distance comprises the following steps:
determining a target Euclidean distance between the data sample to be detected and the target clustering category central point, and searching a target quantile corresponding to the target Euclidean distance.
5. The method according to any one of claims 2 to 4, wherein the clustering all the data samples in the data sample set by using a preset clustering algorithm to generate N cluster categories comprises:
preprocessing all the data samples in the set of data samples, wherein the preprocessing at least includes missing value padding;
all the data samples in the preprocessed data sample set are normalized to obtain normalized data samples corresponding to all the data samples;
clustering all the standardized data samples by using a preset clustering algorithm to generate N clustering categories;
the determining the cluster category corresponding to the data sample and the distance between the data sample and the cluster category center point for any one of the data samples in the data sample set includes:
for any normalized data sample, determining the cluster category corresponding to the normalized data sample and the distance between the cluster category and the center point of the cluster category.
6. The method according to claim 5, wherein the predicting the target cluster category corresponding to the data sample to be detected by using a preset clustering algorithm comprises:
preprocessing the data sample to be detected, wherein the preprocessing at least comprises missing value filling;
normalizing the preprocessed data sample to be detected to obtain a standardized data sample to be detected;
predicting a target clustering category corresponding to the standardized data sample to be detected by using a preset clustering algorithm;
the determining the target distance between the data sample to be detected and the target cluster category center point includes:
and determining the target distance between the standardized data sample to be detected and the target clustering class central point.
7. The method according to claim 1, wherein the determining whether the data sample to be detected is an abnormal sample according to the comparison result comprises:
if the target quantile is larger than the preset threshold value, determining that the data sample to be detected is an abnormal sample;
and if the target quantile is less than or equal to the preset threshold, determining that the data sample to be detected is a non-abnormal sample.
8. An apparatus for detecting an abnormal sample, the apparatus comprising:
the sample acquisition module is used for acquiring a data sample to be detected;
the class prediction module is used for predicting a target cluster class corresponding to the data sample to be detected by using a preset clustering algorithm;
the distance determining module is used for determining a target distance between the data sample to be detected and the target clustering category central point;
the quantile searching module is used for searching a target quantile corresponding to the target distance;
the quantile comparison module is used for comparing the target quantile with a preset threshold value;
and the sample detection module is used for determining whether the data sample to be detected is an abnormal sample according to the comparison result.
9. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;
a memory for storing a computer program;
a processor for implementing the method steps of any one of claims 1 to 7 when executing a program stored on a memory.
10. A storage medium on which a computer program is stored, which program, when being executed by a processor, carries out the method according to any one of claims 1 to 7.
CN202111396617.6A 2021-11-23 2021-11-23 Abnormal sample detection method and device, electronic equipment and storage medium Pending CN114298147A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111396617.6A CN114298147A (en) 2021-11-23 2021-11-23 Abnormal sample detection method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111396617.6A CN114298147A (en) 2021-11-23 2021-11-23 Abnormal sample detection method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114298147A true CN114298147A (en) 2022-04-08

Family

ID=80966229

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111396617.6A Pending CN114298147A (en) 2021-11-23 2021-11-23 Abnormal sample detection method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114298147A (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3156941A1 (en) * 2015-10-12 2017-04-19 Siemens Aktiengesellschaft System, method and a computer program product for analyzing data
CN109978070A (en) * 2019-04-03 2019-07-05 北京市天元网络技术股份有限公司 A kind of improved K-means rejecting outliers method and device
CN111797887A (en) * 2020-04-16 2020-10-20 中国电力科学研究院有限公司 Anti-electricity-stealing early warning method and system based on density screening and K-means clustering
CN111814910A (en) * 2020-08-12 2020-10-23 中国工商银行股份有限公司 Abnormality detection method, abnormality detection device, electronic apparatus, and storage medium
CN111814523A (en) * 2019-04-12 2020-10-23 北京京东尚科信息技术有限公司 Human body activity recognition method and device
CN112001409A (en) * 2020-07-01 2020-11-27 中国电力科学研究院有限公司 Power distribution network line loss abnormity diagnosis method and system based on K-means clustering algorithm
CN112905412A (en) * 2021-01-29 2021-06-04 清华大学 Method and device for detecting abnormity of key performance index data

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3156941A1 (en) * 2015-10-12 2017-04-19 Siemens Aktiengesellschaft System, method and a computer program product for analyzing data
CN109978070A (en) * 2019-04-03 2019-07-05 北京市天元网络技术股份有限公司 A kind of improved K-means rejecting outliers method and device
CN111814523A (en) * 2019-04-12 2020-10-23 北京京东尚科信息技术有限公司 Human body activity recognition method and device
CN111797887A (en) * 2020-04-16 2020-10-20 中国电力科学研究院有限公司 Anti-electricity-stealing early warning method and system based on density screening and K-means clustering
CN112001409A (en) * 2020-07-01 2020-11-27 中国电力科学研究院有限公司 Power distribution network line loss abnormity diagnosis method and system based on K-means clustering algorithm
CN111814910A (en) * 2020-08-12 2020-10-23 中国工商银行股份有限公司 Abnormality detection method, abnormality detection device, electronic apparatus, and storage medium
CN112905412A (en) * 2021-01-29 2021-06-04 清华大学 Method and device for detecting abnormity of key performance index data

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
程明畅 等: "基于分位数半径的动态K-means算法", 《南京大学学报(自然科学)》 *

Similar Documents

Publication Publication Date Title
CN110008080B (en) Business index anomaly detection method and device based on time sequence and electronic equipment
CN110083475B (en) Abnormal data detection method and device
CN111538642B (en) Abnormal behavior detection method and device, electronic equipment and storage medium
CN107798047B (en) Repeated work order detection method, device, server and medium
CN108829715B (en) Method, apparatus, and computer-readable storage medium for detecting abnormal data
CN112818066A (en) Time sequence data anomaly detection method and device, electronic equipment and storage medium
CN107911397B (en) Threat assessment method and device
CN111062013A (en) Account filtering method and device, electronic equipment and machine-readable storage medium
CN108399115B (en) Operation and maintenance operation detection method and device and electronic equipment
CN109740621B (en) Video classification method, device and equipment
CN111339137A (en) Data verification method and device
CN114662602A (en) Outlier detection method and device, electronic equipment and storage medium
CN113918438A (en) Method and device for detecting server abnormality, server and storage medium
CN111814557A (en) Action flow detection method, device, equipment and storage medium
CN114463345A (en) Multi-parameter mammary gland magnetic resonance image segmentation method based on dynamic self-adaptive network
CN112307086B (en) Automatic data verification method and device in fire service
CN112995765A (en) Network resource display method and device
CN115932144B (en) Chromatograph performance detection method, chromatograph performance detection device, chromatograph performance detection equipment and computer medium
CN114298147A (en) Abnormal sample detection method and device, electronic equipment and storage medium
CN116661954A (en) Virtual machine abnormality prediction method, device, communication equipment and storage medium
CN110795308A (en) Server inspection method, device, equipment and storage medium
US20230245421A1 (en) Face clustering method and apparatus, classification storage method, medium and electronic device
CN114429177A (en) Equipment fingerprint feature screening method and device, electronic equipment and storage medium
CN113763305B (en) Method and device for calibrating defect of article and electronic equipment
CN114398228A (en) Method and device for predicting equipment resource use condition and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20220408

RJ01 Rejection of invention patent application after publication