CN115171125A

CN115171125A - Data anomaly detection method

Info

Publication number: CN115171125A
Application number: CN202210632838.7A
Authority: CN
Inventors: 王鹏飞; 龙如蛟; 杨志博; 姚聪
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2022-06-06
Filing date: 2022-06-06
Publication date: 2022-10-11

Abstract

One or more embodiments of the present specification provide a method of detecting data abnormality, the method including: aiming at any test sample in a test set, acquiring a preset number of neighbor samples similar to the test sample in a support set, wherein the test set is a test sample set to be subjected to abnormal detection, and the support set is a labeled sample set which is manually labeled in advance; establishing a probability distribution function according to data corresponding to the target field in the neighbor sample; calculating the confidence coefficient of the to-be-detected data corresponding to the target field in the test sample through the probability distribution function; and judging whether the target field in the test sample is abnormal or not according to the confidence coefficient.

Description

Data anomaly detection method

Technical Field

One or more embodiments of the present disclosure relate to the field of data processing technologies, and in particular, to a method for detecting data anomalies.

Background

With the advent of the digital intelligence era, more and more data processing processes can be realized by means of deep learning algorithms without manual participation. In the process of data processing by using an algorithm, a large amount of training data is required to be acquired to perfect a model of a deep learning algorithm, and if the range of the training data cannot cover all possible use scenes of the model, or objects required to be processed by the model include various different forms, abnormal output results acquired by the model can be caused. In the prior art, the discovery of the abnormal output result of the model mainly depends on manual intervention, for example, the feedback result is manually analyzed through user feedback, and the data processing is performed on the manually selected target sample to generate new training data so as to complete the iterative optimization of the model.

Because the optimization process needs a large amount of manual work, errors may occur in the whole abnormal data processing stage, and the data abnormality is found to need to be fed back by a user, which causes the processing of the data abnormality to be untimely. Therefore, how to find out samples, which cannot be correctly processed by the existing model, through a specific algorithm and then use the samples to perform model iteration is an urgent problem to be solved.

Disclosure of Invention

In view of this, one or more embodiments of the present disclosure provide a method for detecting data anomalies.

To achieve the above object, one or more embodiments of the present disclosure provide the following technical solutions:

according to a first aspect of one or more embodiments of the present specification, there is provided a method for detecting data abnormality, including:

aiming at any test sample in a test set, acquiring a preset number of neighbor samples similar to the test sample in a support set, wherein the test set is a test sample set to be subjected to abnormal detection, and the support set is a labeled sample set which is manually labeled in advance;

establishing a probability distribution function according to data corresponding to the target field in the neighbor sample;

calculating the confidence coefficient of the to-be-detected data corresponding to the target field in the test sample through the probability distribution function;

and judging whether the target field in the test sample is abnormal or not according to the confidence coefficient.

According to a second aspect of one or more embodiments of the present specification, there is provided a data anomaly detection method applied to a data anomaly detection scenario of a recognized card ticket output by an OCR recognition model, the method including:

aiming at any identified card bill in a test set, acquiring a preset number of neighbor card bills similar to the identified card bills in a support set, wherein the test set is an identified card bill set to be subjected to abnormal detection; the support set is a marked card bill set which is marked manually in advance;

establishing a probability distribution function according to data corresponding to the target field in the neighbor card bill;

calculating the confidence coefficient of the data to be detected corresponding to the target field in the recognized card bill through the probability distribution function;

and judging whether the target field in the recognized card bill is abnormal or not according to the confidence.

According to a third aspect of one or more embodiments of the present description, there is provided a computer readable storage medium, on which a computer program is stored, which program, when executed by a processor, performs the steps of the method according to the first or second aspect.

According to a fourth aspect of one or more embodiments of the present description, there is provided an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method according to the first or second aspect when executing the program.

In the technical scheme provided by the specification, by the method of obtaining the neighbor samples similar to the test sample in the support set, the influence of the labeled samples with lower similarity to the test sample on the abnormal detection result of the test sample is avoided, the accuracy of the abnormal detection is improved, and the abnormal detection result is output in the form of a field by the method of performing the abnormal detection on the field and obtaining the confidence degrees corresponding to the fields.

Drawings

FIG. 1 is a schematic diagram of an architecture of a data anomaly detection device according to an exemplary embodiment of the present specification;

FIG. 2 is a schematic flow chart of a data anomaly detection method provided in an exemplary embodiment of the present specification;

FIG. 3 is a schematic view of a support set provided by an exemplary embodiment of the present description;

FIG. 4 is a schematic flow chart diagram illustrating another data anomaly detection method provided in an exemplary embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of an electronic device according to an exemplary embodiment of the present disclosure;

FIG. 6 is a schematic diagram of a data anomaly detection apparatus provided in an exemplary embodiment of the present specification;

fig. 7 is a schematic diagram of another data anomaly detection device provided in an exemplary embodiment of the present specification.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with one or more embodiments of the present specification. Rather, they are merely examples of apparatus and methods consistent with certain aspects of one or more embodiments of the specification, as detailed in the claims which follow.

It should be noted that: in other embodiments, the steps of the corresponding methods are not necessarily performed in the order shown and described herein. In some other embodiments, the method may include more or fewer steps than those described herein. Moreover, a single step described in this specification may be broken down into multiple steps for description in other embodiments; multiple steps described in this specification may be combined into a single step in other embodiments.

With the advent of the digital intelligence era, more and more data processing processes can be realized by means of deep learning algorithms. For example, in the process of performing Character Recognition on an image by using an OCR (Optical Character Recognition) technology to generate a text, the characters in the image do not need to be manually read, but the image is input into a machine learning model trained in advance, and the text on the image is acquired through an output result of the machine learning model. However, the training process of the machine learning model depends on a large amount of training data, even if a weak supervised learning mode is adopted, part of samples need to be labeled manually, and in order to reduce the training data and not reduce the performance of the machine learning model, the manual labeling of how many samples need to be labeled is a problem to be considered, and meanwhile, because the contribution of each sample to the model is different, how to screen the samples needing to be labeled is also a problem to be considered.

In the prior art, when a sample needing to be manually marked is selected, a manual screening mode is usually adopted, feedback of a user for identifying abnormity is obtained in the application process of a machine learning model, distribution result statistics is carried out on the obtained abnormal feedback result manually, then data processing is carried out on a target sample manually selected according to the distribution result to complete manual marking, new training data are generated, and the new training data are added into training of the machine learning model to complete iterative optimization of the model.

Because the optimization process needs a great deal of manual work, errors may occur in the whole abnormal data processing stage, and the data abnormality is found to need to be fed back by a user, which causes the processing of the data abnormality to be untimely. Therefore, how to find out samples, which cannot be used for data processing correctly, of the existing model through a specific algorithm and then use the samples to perform model iteration is an urgent problem to be solved.

In contrast, the present specification provides a method for detecting data abnormality, in which a labeled sample set that has been labeled manually is used as a support set, and when abnormality detection is performed on a sample to be tested selected from a test set, a neighbor sample that is similar to the test sample in the support set is obtained as a comparison basis, instead of using all labeled samples in the support set as a comparison basis, so that the number of comparison samples is reduced, and at the same time, labeled samples that have a large difference from the test sample form and a low similarity in the support set can be prevented from affecting a data abnormality detection result of the test sample, thereby improving accuracy of abnormality detection. And the abnormal field can be quickly positioned by the method, so that the field with the abnormality is conveniently processed.

The following describes a method for detecting data abnormality set forth in this specification. Referring to fig. 1, fig. 1 is a schematic structural diagram of a data anomaly detection device according to an exemplary embodiment of the present disclosure.

As shown in fig. 1, the data abnormality detecting apparatus may include: a server 11, a network 12 and at least one terminal 13.

The server 11 may be a physical server including an independent host, or the server 11 may be a virtual server carried by a host cluster. In the operation process, the server 11 may be configured with a data anomaly detection device, which may be implemented in a software and/or hardware manner to provide a service of detecting data anomalies, and by comparing the test sample in the test set with the neighboring sample similar to the test sample in the support set, it is detected whether any field in the test sample is anomalous.

The terminal 13 is an electronic device that can be used by a user and can initiate a data anomaly detection request for a test sample image. The electronic device may be a mobile phone, a desktop computer, a tablet device, a notebook computer or a handheld computer (PDAs), a wearable device (such as smart glasses and VR glasses), or may be a scanner or a Digital camera having an OCR recognition function, or an OCR recognizer for a card bill specially used for OCR recognition, for example: passport identifiers, driver's license identifiers, and the like, to which one or more embodiments of the present disclosure are not limited. The electronic device may include an image capturing device for capturing an image of a target object to be subjected to OCR recognition, or the electronic device may receive an image of a target object captured by another device through a communication connection and perform OCR recognition.

In an exemplary embodiment of the present specification, the terminal 13 may implement an OCR recognition function for a target object, and specifically, the terminal 13 implements an OCR recognition function for the target object based on a machine learning model for OCR recognition deployed thereon, for example: OCR recognition of train tickets, extracting ticket face information, formatting output records and the like. The terminal sends the output result of the machine learning model as a test sample to the server, and the server 11 performs abnormality detection. The OCR function is optimized by the server 11 based on the abnormality detection result. For example, for data abnormality caused by an emerging new-format train ticket or a scene which cannot be accurately identified by an original machine learning model although the new format is not realized, the model iteration is dynamically and actively realized by manually marking a test sample and using the marked sample for carrying out iteration updating on the machine identification model, so that the model iteration is continuously optimized and adapts to format change. Of course, the data anomaly detection method proposed in this specification may also be applied to data anomaly detection in other application scenarios besides OCR recognition, and this specification is not limited in particular.

And the network 12 for interaction between the server 11 and the terminal 13 may include various types of wired or wireless networks.

A method for detecting data abnormality provided in the present specification will be described below with reference to fig. 2. Fig. 2 is a schematic flowchart of a method for detecting data anomalies according to an exemplary embodiment. As shown in fig. 2, the method may include the steps of:

s201, aiming at any test sample in a test set, obtaining a preset number of neighbor samples similar to the test sample in a support set, wherein the test set is the test sample set to be subjected to abnormal detection, and the support set is a labeled sample set which is labeled manually in advance.

The test sample is a sample needing data anomaly detection, and the adjacent sample selected from the marked samples in the support set is used as a comparison basis of the anomaly detection. According to any test sample in the test set, a preset number of neighbor samples similar to the test sample are obtained in the support set. The purpose of obtaining the neighbor samples is to avoid the marked samples with larger differences and lower similarities with the test sample form in the support set from influencing the data anomaly detection result of the test sample in the subsequent comparison process, and reduce the accuracy of anomaly detection.

In an exemplary embodiment of the present disclosure, the selecting of the neighbor samples may be accomplished by the following steps:

generating corresponding characteristic vectors according to field data contained in the test samples and the marked samples;

and calculating the distance between the feature vector corresponding to the test sample and the feature vector of the labeled sample, and acquiring a preset number of adjacent samples closest to the test sample.

In the method, the test sample and the labeled sample are first represented by a feature vector, wherein the feature vector may be generated according to field data included in the test sample and the labeled sample.

For example, in an exemplary embodiment of the present specification, assuming that the test sample is a card ticket output by a machine learning model for OCR recognition, since the card ticket contains data in a structured form of key-value pairs, in a test sample, the test sample will contain data of a plurality of fields, where the data of any one field includes: category information, location information, and identification information. For example, assume that the field "gender: female' then has category information of "gender" and location information indicates the location of this field in the test sample, for example: the field is in a second line of the text generated by the test sample, or the position coordinates of the field in the original object which is not recognized by OCR; the identification information is "woman". In order to facilitate comparison, all data included in the test sample and the labeled sample are counted according to categories, the data of each category are arranged according to a predetermined sequence, and a feature vector for each sample is generated by the method. For example, for a sample containing data categories including gender, identity, age, and document number, a feature vector may be formed in the order of gender, name, age, and document number to represent the sample.

After the samples in the test set and the support set are expressed in the form of the feature vectors, the similarity between the feature vectors of the test samples and the feature vectors of the labeled samples in the support set is calculated. In an exemplary embodiment of the present specification, a cosine similarity algorithm may be used to calculate a cosine distance between the feature vector of the test sample and the feature vector of the labeled sample in the support set for characterizing the similarity between the two. The closer the cosine distance between the feature vector of the test sample and the feature vector of a certain labeled sample is, the higher the similarity between the test sample and the labeled sample is. By the method, a preset number of neighbor samples with highest similarity to the test sample can be obtained from the support set. Of course, other calculation methods may also be used to calculate the similarity between the feature vector of the test sample and the feature vector of the labeled sample in the support set, and this specification is not limited in particular.

And after acquiring a neighbor sample of the test sample, carrying out the next step.

S202, establishing a probability distribution function according to the data corresponding to the target field in the neighbor sample.

S203, calculating the confidence of the data to be detected corresponding to the target field in the test sample through the probability distribution function.

For example, assuming that the target field is the "gender" field, statistics are performed on the distribution of the data of the "gender" field in the neighbor samples, and a probability distribution function for the field is generated. The confidence of the target field in the test sample can be calculated by substituting the test sample into the probability distribution function obtained from the neighboring samples. The higher the confidence of the target field in the test sample is, the more the target field conforms to the distribution rule obtained by the neighboring sample, that is, the lower the possibility of the abnormality of the target field is, and the lower the confidence is, the higher the difference of the distribution trend of the target field and the corresponding field in the neighboring sample is, the higher the possibility of the abnormality is.

In an exemplary embodiment of the present specification, the target field has different description data in a plurality of dimensions, and when the probability distribution function is established, the probability distribution function corresponding to each dimension may be established based on the description data of the target field in each dimension. In this specification, the description data of the target field in any dimension is referred to as the criterion of the target field in the dimension. For example, taking the gender field in the identification card as an example, for the "gender" field, there are criteria based on the dimension of the number of occurrences of the category and the dimension of the number of recognition results. And respectively establishing corresponding probability distribution functions aiming at the plurality of criteria, and respectively obtaining the confidence degrees of the test sample target field corresponding to the dimensions through the probability distribution functions corresponding to the dimensions. And comprehensively judging whether the target field is abnormal or not based on a plurality of criteria. When the target field is abnormal, the description data in a certain dimension may happen to accord with the probability distribution function generated by the neighboring sample based on the dimension, but the description data in other dimensions do not accord with the probability distribution function generated based on the dimension. For example, for the "gender" field, assuming that an error field "gender = good person" is identified in the test sample, at this time, it is obvious that the abnormality of the field cannot be detected in the dimension of the number of occurrences of the category, but in the dimension of the number of recognition results, it is obvious that the number of the recognition results is changed from a single character to two characters, and therefore, the abnormality of the field can be detected. By the method for establishing the multi-dimensional probability distribution function aiming at the same target field, the accuracy of data anomaly detection can be effectively improved.

In an exemplary embodiment of the present specification, the form of the probability distribution function may be a gaussian distribution function, a poisson distribution function, or other forms of distribution functions suitable for the application, and the present specification is not limited in particular.

Specifically, it is assumed that, for a certain test sample, the number of occurrences and the number of recognition results of the gender field in a preset number of neighboring samples in the support set are obtained, and gaussian distribution statistics is performed on the statistical results of the information. Because the number of times of appearance of each field in the identity card is 1 and the identification result is 'male' or 'female' which are both single characters, the number of the identification results is 1. Based on the background, for the gender field, when the similarity of a preset number of neighboring samples is high, gaussian distribution fitting a mean value of 1.0 and a variance of 0.0 may occur for the dimension of occurrence times; for the number dimension of the recognition result, a gaussian distribution fitting a mean value of 1.0 and a variance of 0.0 may occur, and the result of the probability distribution function of the two dimensions indicates that only one detection result occurs in the gender field, and the corresponding recognition result should be only one word. And respectively substituting the number of times of the gender field of the test sample and the number of the identification results into corresponding probability distribution functions, and calculating the confidence coefficient under the criterion to judge whether the field is abnormal.

In an exemplary embodiment of the present specification, an average confidence for the target field may be calculated according to the obtained confidences for the target field based on the criteria of different dimensions, a reasonable weight may be assigned to the obtained confidence based on each criterion according to the importance degree of each criterion, and a weighted confidence corresponding to the target field may be calculated according to the weight of each criterion and the obtained confidence of the target field based on each criterion. And judging whether the target field has an abnormality or not through the average confidence or the confidence.

And S204, judging whether the target field in the test sample is abnormal or not according to the confidence coefficient.

In an exemplary embodiment of the present specification, the confidence abnormality threshold may be empirically determined, and when the confidence of the target field is lower than the threshold, it is determined that there is an abnormality in the target field.

In another exemplary embodiment of the present specification, a comprehensive confidence of the test sample is generated for the confidence of the to-be-detected data corresponding to each field in the whole test sample, and whether an abnormality exists in the whole test sample is determined according to the comprehensive confidence. For the test sample expressed by the characteristic vector, the comprehensive confidence of the test sample can be expressed by a characteristic matrix composed of the characteristic vectors of all the fields, and the comprehensive confidence corresponding to the whole test sample can be obtained by calculating in the method through the characteristic matrix.

In an exemplary embodiment of the present specification, other information such as position information may also be added to the criterion based on different dimensions, so as to improve the accuracy of data anomaly detection. Meanwhile, when data of the same category appears in the same sample for a plurality of times, the data of the same category appearing in different fields can be distinguished by introducing position information.

In an exemplary embodiment of the present specification, the samples in the test set are derived from test samples output by a pre-trained machine model, which require data anomaly detection.

For example, it is assumed that a machine learning model trained in advance is a structure detection model for implementing OCR recognition, which is used to recognize characters in a preset object and output corresponding text data. In the OCR recognition of an image by the structure detection model, an output result of the OCR recognition may be abnormal due to different carriers of the target object (for example, recognition of a paper material of an electronic picture, etc.), different forms of recorded information of the target object (for example, different positions in different objects for the same information), and the like. In order to detect whether an output result of the OCR recognition is abnormal or not and correspondingly process a recognition object with the abnormality, the recognition result subjected to the OCR recognition is taken as a test sample and is placed into a test set.

In the above embodiment, it is assumed that the labeled exemplars of the multiple layouts containing the preset object are in the support set. For example, when the preset object is a ticket, the ticket has two formats of red and blue, and field data of the ticket has a difference, for example, the ticket of the blue format includes pinyin of the departure station, but the ticket of the red format does not include the field, if an average statistical result is adopted when the detection data is abnormal, and if the test sample is the ticket of the blue format, a probability distribution function for the pinyin field of the departure station is greatly different from the actual probability distribution function due to the influence of the ticket of the red format, and when the number of ticket samples of the red format in the support set is large, the probability distribution may be biased to the result that the pinyin field does not exist for the departure station. At this time, the ticket in the blue format as the test sample is determined to have an abnormality in the data. In the present specification, by using the method of obtaining the support concentrated neighbor sample, the influence of the labeled sample having a large format difference with the test sample on the detection result of the test sample can be effectively avoided.

In an exemplary embodiment of the present specification, the samples in the test set may also be labeled samples that have been manually labeled. In this case, the above method for detecting data abnormality may be used to correct errors in the labeled samples, and when any labeled sample as a test sample is determined to be abnormal by the above method, it is proved that the labeling information of the labeled sample may be in error, and it is necessary to modify the manual labeling information of the sample having abnormality. The method for placing the marked samples into the test set can detect the marked samples with errors in manual marking, and can improve the quality of marked data by modifying the samples with the errors in marking.

After data anomaly detection is completed, a large number of abnormal samples which may be homogeneous problems caused by the same model defect can be obtained, and in order to avoid repeated labeling and increase cost, partial representative abnormal samples can be screened out by an active learning method for manual labeling. The screened abnormal samples can be the abnormal samples which can be distinguished from abnormal types most easily, the abnormal samples are labeled manually, and then the labeled samples are put into a support set, so that the application range of the data abnormality detection method provided by the specification can be expanded.

Alternatively, in another exemplary embodiment of the present specification, or alternatively, the manually labeled anomaly samples may also be used for iterative training of a machine learning model applicable to the support set. For example, for a machine learning model identified by OCR, the application range of the data anomaly detection method provided by the present specification can be expanded by labeling the anomaly samples and using the labeled anomaly samples in the training process of the model.

In an exemplary embodiment of the present disclosure, in the active learning method, the following method is adopted for screening the abnormal sample:

classifying the abnormal samples according to abnormal fields in the abnormal samples; acquiring an abnormal proportion corresponding to each category according to the number of the abnormal samples of each category; and extracting abnormal samples according to the abnormal proportion for manual labeling.

For example, for a test sample in a certain test set, the abnormal sample with abnormality can be classified into "name + sex" field with abnormality, "age" field with abnormality, "sex" field with abnormality, and the ratio of the number of abnormal samples corresponding to each category is 1:2:2, the total number of the abnormal samples is 10000, and at present, 1000 samples need to be screened from the 10000 samples for manual labeling, and then 200 abnormal samples with abnormal name + gender fields, 400 abnormal samples with abnormal age fields and 400 abnormal samples with abnormal gender fields are extracted according to the proportion for manual labeling.

Of course, the method of extracting the abnormal sample for manual labeling may also adopt other searching methods for sampling, for example, minimum confidence sampling, edge sampling, entropy sampling, and the like. Taking the minimum confidence sampling as an example, the search method calculates the minimum confidence of each sample as an index for judging the uncertainty. For example, the comprehensive confidence of the test samples in the above embodiments may be obtained, and for samples with smaller comprehensive confidence, it is proved that the samples have lower similarity and larger difference with neighboring samples in the support set, and therefore, artificial labeling is more required, the abnormal samples are arranged according to the comprehensive confidence sequence, and when the total number of the abnormal samples is 10000 and 1000 samples need to be screened from the samples for artificial labeling, 1000 abnormal samples with the lowest comprehensive confidence in the abnormal samples are extracted for artificial labeling. Different searching methods can be selected according to different application scenarios to sample the abnormal sample, and this specification is not particularly limited.

For ease of understanding, the present specification provides the following specific examples: the test samples in the test set are output after a machine learning model for OCR recognition recognizes one type of certificate. The number of marked samples containing two different formats of the certificate in the supporting set is 10000, and the marked samples are respectively a format 1 and a format 2, as shown in fig. 3. Wherein, all include in the certificate of above-mentioned two formats: name, sex, age field data, and the number of occurrences is one.

Assume that for test sample 1, it is closer to the layout 1, but the data information of its age field is "age = man". In addition, there is a test sample 2 in the test set, which is closer to the version 2, but a "year, month and day of birth" field is also included in the test sample 2, the test sample 2 being a new version of the document.

When the age field in the test sample 1 is used as a target field for abnormality detection, 1000 neighboring samples in the support set are acquired for the target field. And based on the dimensions of the "identification result character type" and the "identification result character number", probability distribution functions of the target field in the two different dimensions are established according to the neighboring samples, as shown by a probability distribution function 1 and a probability distribution function 2 in fig. 3.

Since the age in the format 1 is represented by arabic letters, and the recognition result of the target field in the test sample 1 is "man", which is a chinese character rather than arabic letters, the confidence of the target field in the dimension of "recognition result character type" in the test sample 1 is 0, assuming that there is no error in the labeled samples in the support set ideally. As for the dimension of "number of recognition result characters", since the identification information of age may appear in a range of one-digit number to three-digit number, the reference value of data anomaly judgment for the target field is small according to the probability distribution function established by the target field in the neighboring sample in the dimension of "number of recognition result characters". In this case, other dimensions may be selected to establish a corresponding probability distribution function, or the weight of "the number of recognition result characters" may be reduced when the weighted confidence of the target field is calculated. Since the test sample 1 has an abnormality in the age field, the test sample 1 is an abnormal sample.

For the test sample 2, when the neighbor sample in the support set is the format 2, but the "year, month, day of birth" field is used as the target field, since the data of the field does not exist in the format 2, for the field, it is assumed that the confidence coefficient is 0 regardless of the dimension in the ideal case that there is no error in the labeled sample in the support set. Therefore, the test sample 2 is also an abnormal sample.

Assuming that the test set only includes the above two types of abnormal situations, the total number of test samples is 10000, where the ratio of the abnormal type 1 to the abnormal type 2 is 1:9. then after the abnormal data in the test set is detected, 1000 abnormal samples with abnormal "age" field and 9000 abnormal samples with abnormal "year, month and day of birth" field are obtained. If 1000 samples are selected for manual labeling, 100 abnormal samples with abnormal age fields and 900 abnormal samples with abnormal birth year, month and day fields are screened for manual labeling.

For an abnormal sample with an abnormality in the field of the 'year, month and day of birth', because the abnormality occurs because the format is updated to the format 3, after the abnormal sample is manually labeled, the abnormal sample can be placed in a support set as a new labeled sample, and when the test sample 3 belonging to the format 3 is subjected to abnormality detection, the obtained neighbor sample belongs to the labeled sample belonging to the format 3, which is newly added to the support set, so that when the test sample 3 is subjected to abnormality detection according to the neighbor sample, the test sample 3 is determined to be free of abnormality based on the format 3. At this time, for the support set of the data anomaly detection method, the above process introduces the format 3, which is equivalent to expanding the application range of the data anomaly detection method, and the test sample 3 belonging to the format 3 will not be determined as an anomalous sample, thereby avoiding repeated labeling of samples with the same anomaly type to a great extent.

Meanwhile, the artificially labeled abnormal sample can also be used for iterative training of a machine learning model for OCR recognition. Because the new labeled sample comprises the new format, the part of the sample is used as training data of the machine learning model, in the process of training the model, a use scene of a user is added, and the updated machine learning model obtained after the machine learning model is subjected to iterative training can be better suitable for the identification of the certificate of the format 3.

An exemplary embodiment of the present specification further provides a method for detecting data anomaly of a recognized card ticket output by an OCR recognition model, as shown in fig. 4, including the following steps:

s401, aiming at any one identified card bill in a test set, acquiring a preset number of neighbor card bills similar to the identified card bills in a support set, wherein the test set is an identified card bill set to be subjected to abnormal detection; the support set is a marked card bill set which is marked manually in advance;

s402, establishing a probability distribution function according to data corresponding to a target field in the neighbor card bill;

s403, calculating the confidence coefficient of the to-be-detected data corresponding to the target field in the recognized card bill through the probability distribution function; and judging whether the target field in the recognized card bill is abnormal or not according to the confidence.

In an exemplary embodiment of the present specification, the above-mentioned obtaining may be implemented by, in the support set, a preset number of neighboring card tickets similar to the identified card ticket, the following steps:

generating corresponding characteristic vectors according to field data contained in the identified card and certificate bills and the marked card and certificate bills;

and calculating the distance between the characteristic vector corresponding to the identified card bill and the characteristic vector of the marked card bill, and acquiring a preset number of neighboring card bills which are closest to the identified card bill.

In an exemplary embodiment of the present specification, probability distribution functions corresponding to the dimensions may be respectively established based on the description data of the target field in each dimension, and confidence degrees of the data to be detected corresponding to the target field in the identified card-type bill corresponding to each dimension may be respectively obtained through the probability distribution functions corresponding to each dimension.

In another exemplary embodiment of the present specification, a comprehensive confidence level of the recognized card ticket may also be generated according to the confidence levels of the data to be detected corresponding to the respective fields in the recognized card ticket; and judging whether the whole recognized card bill is abnormal or not according to the comprehensive confidence.

In an exemplary embodiment of the present specification, a card ticket with an abnormal field may also be determined as an abnormal card ticket, and at least a part of the abnormal card ticket is selected for manual annotation; placing the abnormal card bill marked manually into a support set; alternatively, a manually labeled exception card ticket is used for iterative training of the model for OCR recognition. In this embodiment, after the card-certificate bills with the abnormality detected by the above method are manually marked, the card-certificate bills with the abnormality detected by the above method can be placed in the support set as updated manually marked card-certificate bills for increasing the number of the manually marked card-certificate bills in the support set, thereby improving the accuracy of the data abnormality detection method. Or, the card bill with the abnormality detected by the method can be used for iterative training of the model after being manually marked, and the model for OCR recognition is iteratively updated, so that the comprehension capability of the model for OCR recognition on the card bill is further improved.

When at least one part of abnormal card and certificate bills are selected to be marked manually, the method can be realized through the following steps: classifying the abnormal card evidence bills according to the abnormal fields in the abnormal card evidence bills; acquiring an abnormal proportion corresponding to each category according to the number of the abnormal card bill of each category; and extracting the abnormal card bill according to the abnormal proportion for manual marking.

The above-listed specific embodiment of the data anomaly detection method applied to the recognized card ticket output by the OCR recognition model can be understood with reference to the specific embodiment of the method shown in fig. 2, and will not be described herein again.

Fig. 5 is a schematic structural diagram of an electronic device according to an exemplary embodiment of the present specification. Referring to fig. 5, at the hardware level, the apparatus includes a processor 502, an internal bus 504, a network interface 506, a memory 508, and a non-volatile memory 510. Of course it is also possible to include hardware required for other functions. The processor 402 reads the corresponding computer program from the non-volatile memory 510 into the memory 508 and runs it, forming a kind of data anomaly detection device on a logical level. Of course, besides software implementation, the one or more embodiments in this specification do not exclude other implementations, such as logic devices or combinations of software and hardware, and so on, that is, the execution subject of the following processing flow is not limited to each logic unit, and may also be hardware or logic devices.

In accordance with an embodiment of the foregoing method, the present specification further provides a data anomaly detection apparatus, which may include, as shown in fig. 6:

a first obtaining unit 610, configured to obtain, for any test sample in a test set, a preset number of neighboring samples similar to the test sample in a support set, where the test set is a test sample set to be subjected to anomaly detection, and the support set is a labeled sample set that is manually labeled in advance;

a first establishing unit 620, configured to establish a probability statistic function according to data corresponding to a target field in the neighbor sample;

a first calculating unit 630, configured to calculate, through the probability statistic function, a confidence of the to-be-detected data in the test sample corresponding to the target field;

the first judging unit 640 is configured to judge whether a target field in the test sample is abnormal according to the confidence.

Optionally, the first obtaining unit 610 may be specifically configured to:

Optionally, the target field has different description data in multiple dimensions, and the first establishing unit 620 may be specifically configured to:

respectively establishing probability statistical functions corresponding to all dimensions based on the description data of the target fields on all dimensions;

the first computing unit 630 may be specifically configured to:

and respectively acquiring the confidence degrees of the to-be-detected data corresponding to the target field in the test sample corresponding to the dimensions through the probability statistical function corresponding to the dimensions.

Optionally, the apparatus further comprises:

the first generating unit 650 is configured to generate a comprehensive confidence of the test sample according to the confidence of the to-be-detected data corresponding to each field in the test sample;

and a first whole judging unit 660, configured to judge whether the whole test sample is abnormal according to the comprehensive confidence.

Optionally, the test samples in the test set include at least one of: the method comprises the steps of outputting a test sample needing data abnormity detection and a labeled sample which is manually labeled in advance by a machine learning model obtained through pre-training.

Optionally, the machine learning model is a model for OCR recognition of a preset object, and the support set contains labeled samples of multiple formats of the preset object.

Optionally, the apparatus further comprises:

and the first selecting unit 670 is configured to determine a sample with an exception in at least one field as an exception sample, and select at least a part of the exception sample for manual tagging.

A first abnormal sample processing unit 680, configured to place the artificially labeled abnormal sample into the support set; alternatively, the artificially labeled anomaly samples are used for iterative training of a machine learning model applicable to the support set.

Optionally, the first selecting unit 670 may be specifically configured to:

classifying the abnormal samples according to the abnormal fields in the abnormal samples;

acquiring an abnormal proportion corresponding to each category according to the number of the abnormal samples of each category;

and extracting an abnormal sample according to the abnormal proportion for manual marking.

In accordance with the above method, the present specification further provides another data anomaly detection apparatus, applied to a data anomaly detection scenario for a recognized card ticket output by an OCR recognition model, as shown in fig. 7, the apparatus may include:

a second obtaining unit 710, configured to obtain, for any identified card ticket in a test set, a preset number of neighboring card tickets similar to the identified card ticket in a support set, where the test set is an identified card ticket set to be subjected to anomaly detection; the support set is a marked card bill set which is marked manually in advance;

a second establishing unit 720, configured to establish a probability distribution function according to data corresponding to the target field in the neighbor card certificate;

the second calculating unit 730 is configured to calculate, through the probability distribution function, a confidence of the to-be-detected data corresponding to the target field in the recognized card and certificate bill;

the second judging unit 740 is configured to judge whether a target field in the identified card ticket is abnormal according to the confidence.

Optionally, the second obtaining unit 710 may be specifically configured to:

Optionally, the apparatus may further include:

a second selecting unit 750, configured to determine a card ticket with an abnormal field as an abnormal card ticket, and select at least a part of the abnormal card ticket for manual annotation;

the second abnormal sample processing unit 760 is used for putting the manually marked abnormal card bill into the support set; alternatively, a manually labeled exception card ticket is used for iterative training of the model for OCR recognition.

The implementation process of the functions and actions of each unit in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.

The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. A typical implementation device is a computer, which may take the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email messaging device, game console, tablet computer, wearable device, or a combination of any of these devices.

In a typical configuration, a computer includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic disk storage, quantum memory, graphene-based storage media or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

In one or more embodiments of the present specification, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.

The terminology used in the description of the one or more embodiments is for the purpose of describing the particular embodiments only and is not intended to be limiting of the description of the one or more embodiments. As used in one or more embodiments of the present specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used in one or more embodiments of the present description to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of one or more embodiments herein. The word "if," as used herein, may be interpreted as "at … …" or "at … …" or "in response to a determination," depending on the context.

The above description is only for the purpose of illustrating the preferred embodiments of the one or more embodiments of the present disclosure, and is not intended to limit the scope of the one or more embodiments of the present disclosure, and any modifications, equivalent substitutions, improvements, etc. made within the spirit and principle of the one or more embodiments of the present disclosure should be included in the scope of the one or more embodiments of the present disclosure.

Claims

1. A method for detecting data anomalies, comprising:

2. The method of claim 1, wherein said obtaining a preset number of neighbor samples similar to the test sample in a support set comprises:

3. The method of claim 1, wherein the target field has different description data in a plurality of dimensions, respectively, and wherein establishing a probability distribution function based on data corresponding to the target field in the neighbor samples comprises:

respectively establishing probability distribution functions corresponding to all dimensions based on the description data of the target fields on all dimensions;

obtaining the confidence of the data to be detected corresponding to the target field in the test sample through the probability distribution function, wherein the confidence comprises the following steps:

and respectively acquiring the confidence degrees of the to-be-detected data corresponding to the target field in the test sample corresponding to the dimensions through the probability distribution function corresponding to the dimensions.

4. The method of claim 1, further comprising:

generating a comprehensive confidence coefficient of the test sample according to the confidence coefficient of the to-be-detected data corresponding to each field in the test sample;

and judging whether the whole test sample is abnormal or not according to the comprehensive confidence.

5. The method of claim 1, wherein the test samples in the test set include at least one of: the method comprises the steps of outputting a test sample needing data abnormity detection and a labeled sample which is manually labeled in advance by a machine learning model obtained through pre-training.

6. The method of claim 5, wherein the machine learning model is a model for OCR recognition of a preset object, the supporting set containing labeled samples of a plurality of versions of the preset object.

7. The method of claim 1, further comprising:

determining a sample with abnormality in at least one field as an abnormal sample, and selecting at least one part of the abnormal sample for manual labeling;

placing the abnormal sample marked manually into a support set; alternatively, the artificially labeled anomaly samples are used for iterative training of a machine learning model applicable to the support set.

8. The method of claim 7, wherein the selecting at least a portion of the abnormal samples for manual labeling comprises:

9. A data anomaly detection method is applied to a data anomaly detection scene of a recognized card bill output by an OCR recognition model, and comprises the following steps:

and judging whether the target field in the recognized card bill is abnormal or not according to the confidence coefficient.

10. The method of claim 9, wherein said obtaining a preset number of neighbor card documents similar in support set to the identified card document comprises:

11. The method of claim 9, further comprising:

determining the card bill with the abnormal field as an abnormal card bill, and selecting at least one part of the abnormal card bill for manual marking;

placing the abnormal card bill marked manually into a support set; alternatively, a manually labeled exception card ticket is used for iterative training of the model for OCR recognition.

12. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 8 or 9 to 11.

13. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method according to any of claims 1-8 or 9-11 are implemented when the processor executes the program.