CN111325260B

CN111325260B - Data processing method and device, electronic equipment and computer readable medium

Info

Publication number: CN111325260B
Application number: CN202010093525.XA
Authority: CN
Inventors: 苏业; 冷家冰; 管超; 任思可; 黄锋; 李旭
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-02-14
Filing date: 2020-02-14
Publication date: 2023-10-27
Anticipated expiration: 2040-02-14
Also published as: CN111325260A

Abstract

The present disclosure provides a data processing method, including: acquiring a data detection request, wherein the data detection request comprises a plurality of marked data; predicting the marked data by utilizing at least one trained abnormality detection classifier aiming at each marked data to obtain a prediction result of each abnormality detection classifier on the marked data, wherein the prediction result of the marked data comprises information for determining whether the marked data is outlier data or not; and determining whether the marked data is outlier data according to the prediction result of at least one anomaly detection classifier on the marked data. The present disclosure also provides a data processing apparatus, an electronic device, and a computer readable medium.

Description

Data processing method and device, electronic equipment and computer readable medium

Technical Field

The embodiment of the disclosure relates to the technical field of data processing, in particular to a data processing method and device, electronic equipment and a computer readable medium.

Background

In the process of the artificial intelligence algorithm landing, data marking and data cleaning are often links which consume the energy of a marker and an algorithm engineer most, meanwhile, the data quality also gradually becomes the upper limit of the algorithm performance, and the problems of error marking, missing marking, ambiguity marking and the like can cause great interference to the optimization process of the algorithm, so that the algorithm is difficult to distinguish positive and negative samples clearly.

At present, after the data is marked, the marked data is checked and adjusted by manpower (expert), and the efficiency of manually checking the marked data is lower, and the quality of the marked data is difficult to be ensured.

Disclosure of Invention

The embodiment of the disclosure provides a data processing method and device, electronic equipment and a computer readable medium.

In a first aspect, an embodiment of the present disclosure provides a data processing method, including:

acquiring a data detection request, wherein the data detection request comprises a plurality of marked data;

predicting the marked data by utilizing at least one trained abnormality detection classifier aiming at each marked data to obtain a prediction result of each abnormality detection classifier on the marked data, wherein the prediction result of the marked data comprises information for determining whether the marked data is outlier data or not;

and determining whether the marked data is outlier data according to the prediction result of at least one anomaly detection classifier on the marked data.

In some embodiments, the predicting, for each labeled data, the labeled data by using at least one anomaly detection classifier trained in advance, and before obtaining a prediction result of each anomaly detection classifier on the labeled data, further includes:

For each marked data, carrying out feature extraction on the marked data by utilizing at least one preset feature extraction algorithm to obtain a feature vector corresponding to the marked data;

and taking the feature vectors corresponding to the marked data as input variables, and performing unsupervised model training by using at least one preset anomaly detection algorithm to obtain the anomaly detection classifier corresponding to each anomaly detection algorithm, wherein the output variables of the anomaly detection classifier are variables for determining whether the marked data are outlier data.

In some embodiments, for each marked data, feature extraction is performed on the marked data by using at least one preset feature extraction algorithm to obtain a feature vector corresponding to the marked data, including:

and carrying out feature extraction on the marked data by utilizing a preset feature extraction algorithm aiming at each marked data to obtain a feature vector corresponding to the marked data.

For each marked data, respectively carrying out feature extraction on the marked data by utilizing a plurality of preset feature extraction algorithms to obtain feature extraction results which correspond to the marked data and are extracted based on each feature extraction algorithm;

and generating the feature vector corresponding to the marked data according to a plurality of feature extraction results corresponding to the marked data.

In some embodiments, in the case of predicting the noted data using a pre-trained anomaly detection classifier, determining whether the noted data is outlier data based on the prediction of the noted data by at least one anomaly detection classifier comprises:

when the prediction result of the anomaly detection classifier on the marked data comprises information for determining that the marked data is outlier data, determining that the marked data is outlier data;

when the prediction result of the anomaly detection classifier on the marked data comprises information for determining that the marked data is non-outlier data, the marked data is determined to be the non-outlier data.

In some embodiments, in the case of predicting the noted data using a plurality of anomaly detection classifiers that are trained in advance, determining whether the noted data is outlier data based on the prediction result of the noted data by at least one anomaly detection classifier comprises:

And determining whether the marked data is outlier data or not based on a preset voting decision mechanism according to the prediction results of the plurality of anomaly detection classifiers for predicting the marked data.

In some embodiments, the determining, based on a preset voting decision mechanism, whether the marked data is outlier data according to the prediction results of predicting the marked data by the plurality of anomaly detection classifiers, includes:

counting the quantity of information used for determining the marked data as outlier data in the prediction results of the marked data predicted by a plurality of anomaly detection classifiers;

judging whether the ratio of the number to the total number of the prediction results is greater than or equal to a preset threshold value;

and when the ratio of the number to the total number of the prediction results is larger than or equal to a preset threshold value, determining that the marked data is outlier data.

In some embodiments, the at least one pre-set feature extraction algorithm comprises: a flat feature extraction algorithm, a color moment feature extraction algorithm, a direction gradient histogram feature extraction algorithm, a local binary pattern feature extraction algorithm and a SIFT feature extraction algorithm.

In some embodiments, the at least one preset anomaly detection algorithm comprises: ABOD, independent forest, HBOS, KNN, PCA, MCD.

In a second aspect, embodiments of the present disclosure provide a data processing apparatus, including:

the acquisition module is used for acquiring a data detection request, wherein the data detection request comprises a plurality of marked data;

the prediction module is used for predicting the marked data by utilizing at least one pre-trained abnormality detection classifier aiming at each marked data to obtain a prediction result of each abnormality detection classifier on the marked data, wherein the prediction result of the marked data comprises information for determining whether the marked data is outlier data or not;

and the determining module is used for determining whether the marked data is outlier data according to the prediction result of at least one anomaly detection classifier on the marked data.

In some embodiments, the system further comprises a training module comprising a feature extraction sub-module and a training sub-module;

the feature extraction sub-module is used for extracting features of the marked data by utilizing at least one preset feature extraction algorithm aiming at each marked data to obtain feature vectors corresponding to the marked data;

the training sub-module is used for performing unsupervised model training by using at least one preset anomaly detection algorithm by taking the feature vectors corresponding to the marked data as input variables, so as to obtain the anomaly detection classifier corresponding to each anomaly detection algorithm, wherein the output variable of the anomaly detection classifier is a variable used for determining whether the marked data is outlier data.

In some embodiments, the feature extraction sub-module is specifically configured to perform feature extraction on each piece of marked data by using a preset feature extraction algorithm, so as to obtain a feature vector corresponding to the marked data.

In some embodiments, the feature extraction sub-module is specifically configured to, for each labeled data, respectively perform feature extraction on the labeled data by using a plurality of preset feature extraction algorithms, so as to obtain a feature extraction result, corresponding to the labeled data, extracted based on each feature extraction algorithm; and generating the feature vector corresponding to the marked data according to a plurality of feature extraction results corresponding to the marked data.

In some embodiments, where the prediction module predicts the annotated data using a pre-trained anomaly detection classifier,

the determining module is specifically configured to determine that the marked data is outlier data when the prediction result of the anomaly detection classifier on the marked data includes information for determining that the marked data is outlier data; when the prediction result of the anomaly detection classifier on the marked data comprises information for determining that the marked data is non-outlier data, the marked data is determined to be the non-outlier data.

In some embodiments, where the prediction module predicts the annotated data using a plurality of anomaly detection classifiers that are pre-trained,

the determining module is specifically configured to determine whether the marked data is outlier data based on a preset voting decision mechanism according to a prediction result of predicting the marked data by the plurality of anomaly detection classifiers.

In some embodiments, the determining module is specifically configured to count the number of information that is used to determine that the marked data is outlier data in the prediction results that are predicted by the plurality of anomaly detection classifiers; judging whether the ratio of the number to the total number of the prediction results is greater than or equal to a preset threshold value; and when the ratio of the number to the total number of the prediction results is larger than or equal to a preset threshold value, determining that the marked data is outlier data.

In a third aspect, embodiments of the present disclosure provide an electronic device, comprising:

one or more processors;

a memory having one or more programs stored thereon, which when executed by the one or more processors, cause the one or more processors to implement the data processing method as described in any of the embodiments above;

One or more I/O interfaces coupled between the processor and the memory configured to enable information interaction of the processor with the memory.

In a fourth aspect, an embodiment of the present disclosure provides a computer readable medium having a computer program stored thereon, where the computer program when executed implements the data processing method according to any of the above embodiments.

According to the data processing method, the device, the electronic equipment and the computer readable medium, after marked data are collected, the marked data are classified by utilizing the pre-trained anomaly detection classifier, so that outlier data in all marked data are screened out, the auditing time of the marked data can be effectively reduced, the auditing efficiency of the marked data is improved, the accuracy of quality auditing of the marked data can be effectively ensured, and the quality of the marked data is improved.

Drawings

The accompanying drawings are included to provide a further understanding of embodiments of the disclosure, and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure, without limitation to the disclosure. The above and other features and advantages will become more readily apparent to those skilled in the art by describing in detail exemplary embodiments with reference to the attached drawings, in which:

FIG. 1 is a flow chart of a data processing method according to an embodiment of the present disclosure;

FIG. 2 is a flow chart of one embodiment of step 13 of FIG. 1;

FIG. 3 is a flow chart of another data processing method provided by an embodiment of the present disclosure;

FIG. 4 is a flow chart of one specific implementation of step 111 in FIG. 3;

fig. 5 is a schematic view of an application scenario of a data processing method according to an embodiment of the disclosure;

FIG. 6 is a block diagram of a data processing apparatus according to an embodiment of the present disclosure;

FIG. 7 is a block diagram of another data processing apparatus provided in an embodiment of the present disclosure;

fig. 8 is a schematic view of an application scenario of a data processing apparatus according to an embodiment of the present disclosure;

fig. 9 is a block diagram of an electronic device according to an embodiment of the disclosure.

Detailed Description

In order to better understand the technical solutions of the present disclosure, the following describes in detail a data processing method and apparatus, an electronic device, and a computer readable medium provided in the present disclosure with reference to the accompanying drawings.

Example embodiments will be described more fully hereinafter with reference to the accompanying drawings, but may be embodied in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Embodiments of the disclosure and features of embodiments may be combined with each other without conflict.

As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used herein, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and the present disclosure, and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

Fig. 1 is a flowchart of a data processing method according to an embodiment of the present disclosure, where, as shown in fig. 1, the method may be performed by a data processing apparatus, where the apparatus may be implemented by software and/or hardware, and the apparatus may be integrated into an electronic device, such as a server. The data processing method comprises the following steps:

step 11, acquiring a data detection request, wherein the data detection request comprises a plurality of marked data.

In the embodiment of the disclosure, in the context of big data mining and artificial intelligence algorithms, after completing data acquisition, data needs to be marked, and after marking data according to expert knowledge, for quality audit of marked data, firstly, in step 11, a data detection request is acquired, where the data detection request includes a plurality of marked data.

And step 12, predicting the marked data by utilizing at least one trained abnormality detection classifier aiming at each marked data to obtain a prediction result of each abnormality detection classifier on the marked data.

Wherein, for each annotated data, the prediction result of the annotated data includes information for determining whether the annotated data is outlier data.

In the embodiment of the disclosure, at least one anomaly detection classifier is trained in advance, and the input of each anomaly detection classifier is a feature vector of marked data and the output is information for determining whether the marked data is outlier data. By means of the anomaly detection classifier, it can be predicted whether the marked data is outlier data, i.e., there are two types of output results (predicted results) of the anomaly detection classifier: the marked data can be classified by an outlier data, non-outlier data (normal data), i.e. by an anomaly detection classifier to classify all marked data into outlier data or non-outlier data.

In some embodiments, in step 12, for each labeled data, the labeled data is predicted using a pre-trained anomaly detection classifier, resulting in a prediction of the labeled data by the anomaly detection classifier. Specifically, in step 12, the feature vector of the labeled data obtained in advance is input to the anomaly detection classifier to perform outlier prediction on the labeled data, and a prediction result is output. When the abnormality detection classifier predicts that the marked data is the outlier data, the prediction result comprises information for determining that the marked data is the outlier data, and when the abnormality detection classifier predicts that the marked data is the non-outlier data, the prediction result comprises information for determining that the marked data is the non-outlier data.

When the labeled data is predicted by using a pre-trained abnormality detection classifier, the abnormality detection classifier may be determined in advance by performing an unsupervised model training based on a preset abnormality detection algorithm using a plurality of labeled data. The preset anomaly detection algorithm may include any one of an angle-based outlier detection (ABOD) algorithm, an independent forest algorithm, a histogram-based anomaly detection (HBOS) algorithm, a K Nearest Neighbor (KNN) classification algorithm, a Principal Component Analysis (PCA) algorithm, and a maximum neighbor distance (MCD) algorithm, but it should be noted that the anomaly detection algorithm in the embodiment of the present disclosure is not limited to the above-listed algorithm, and the preset anomaly detection method in the embodiment of the present disclosure may also employ other unsupervised anomaly detection algorithms.

In some embodiments, in step 12, for each labeled data, a plurality of pre-trained anomaly detection classifiers are used to predict the labeled data to obtain a plurality of prediction results corresponding to the labeled data. Specifically, in step 12, the feature vectors of the labeled data obtained in advance are input into each anomaly detection classifier, so that each anomaly detection classifier performs outlier prediction on the labeled data, and outputs a corresponding prediction result. For each anomaly detection classifier, when the anomaly detection classifier predicts that the marked data is outlier data, the prediction result correspondingly output by the anomaly detection classifier comprises information for determining that the marked data is outlier data, and when the anomaly detection classifier predicts that the marked data is non-outlier data, the prediction result correspondingly output by the anomaly detection classifier comprises information for determining that the marked data is non-outlier data.

When the labeled data is predicted by utilizing a plurality of pre-trained abnormal detection classifiers, the plurality of abnormal detection classifiers can be pre-determined based on a plurality of preset different abnormal detection algorithms, and each abnormal detection algorithm is correspondingly trained to form an abnormal detection classifier by respectively performing unsupervised model training determination by using the plurality of labeled data, namely, the plurality of abnormal detection classifiers belong to different abnormal detection classifiers. Wherein the plurality of preset different anomaly detection algorithms include, but are not limited to: at least two algorithms of an ABOD algorithm, an independent forest algorithm, an HBOS algorithm, a KNN algorithm, a PCA algorithm and an MCD algorithm.

And step 13, determining whether the marked data is outlier data according to the prediction result of at least one anomaly detection classifier on the marked data.

In some embodiments, in the case that the labeled data is predicted by using a pre-trained anomaly detection classifier in the step 12, in step 13, it is determined whether the labeled data is outlier data according to the prediction result of the labeled data by the anomaly detection classifier.

Specifically, in the case of predicting the noted data by using an anomaly detection classifier, when the prediction result of the noted data by the anomaly detection classifier includes information for determining that the noted data is outlier data, the noted data is determined to be outlier data. In other words, when the anomaly detection classifier predicts that the marked data is outlier data, the marked data is determined to be outlier data, i.e. anomaly data, such as data with error marks, data with confusing marks, or data with blank marks.

When the prediction result of the anomaly detection classifier on the marked data comprises information for determining that the marked data is non-outlier data, the marked data is determined to be the non-outlier data. In other words, when the anomaly detection classifier predicts that the marked data is non-outlier data, the marked data is determined to be non-outlier data, i.e. normal data, which is normal and correct.

In some embodiments, in step 12, in the case that the labeled data is predicted by using a plurality of different anomaly detection classifiers that are trained in advance, in step 13, it is determined whether the labeled data is outlier data according to the prediction results of the labeled data by the plurality of anomaly detection classifiers. In some embodiments, in the case where the labeled data is predicted using a plurality of anomaly detection classifiers that are trained in advance in step 12, step 13 includes: and determining whether the marked data is outlier data or not based on a preset voting decision mechanism according to the prediction results of the plurality of anomaly detection classifiers for predicting the marked data.

FIG. 2 is a flowchart of one embodiment of step 13 in FIG. 1, in some embodiments, in the case of predicting the labeled data using a plurality of anomaly detection classifiers trained in advance in step 12, as shown in FIG. 2, step 13 specifically includes:

in step 1321, the statistical multiple anomaly detection classifiers are used to determine the amount of information of the marked data as outlier data in the prediction results of the marked data.

For example, assuming that the number of the abnormality detection classifiers trained in advance is 4, the 4 abnormality detection classifiers are an abnormality detection classifier a, an abnormality detection classifier B, an abnormality detection classifier C, and an abnormality detection classifier D, respectively, in step 12, the prediction result of predicting the noted data by the abnormality detection classifier a is a prediction result a, the prediction result of predicting the noted data by the abnormality detection classifier B is a prediction result B, the prediction result of predicting the noted data by the abnormality detection classifier C is a prediction result C, and the prediction result of predicting the noted data by the abnormality detection classifier D is a prediction result D. Then, in step 1321, the number S of information used to determine that the marked data is outlier data among the prediction result a, the prediction result B, the prediction result C, and the prediction result D is counted. For example, assuming that the prediction result a includes information for determining that the labeled data is outlier data, the prediction result B includes information for determining that the labeled data is outlier data, the prediction result C includes information for determining that the labeled data is outlier data, and the prediction result D includes information for determining that the labeled data is non-outlier data, in step 1321, the number S of pieces of information for determining that the labeled data is outlier data is counted as 3.

Step 1322, calculating a ratio of the number of pieces of information used to determine that the labeled data is outlier data to the total number of prediction results corresponding to the labeled data.

For example, if the number of pre-trained anomaly detection classifiers is 4, in step 12, the total number of prediction results corresponding to the labeled data is 4, and in step 1321, the number of pieces of information for determining that the labeled data is outlier data is s=3, and in step 1322, the ratio of the number of pieces of information for determining that the labeled data is outlier data S to the total number of prediction results 4 corresponding to the labeled data is 3/4=75%.

And 1323, judging whether the ratio corresponding to the marked data is greater than or equal to a preset threshold value, if so, determining that the marked data is outlier data, otherwise, determining that the marked data is non-outlier data.

In some embodiments, the preset threshold may be set according to actual needs, e.g., the preset threshold may be set to 50%, 60%, etc.

In step 1323, when it is determined that the ratio corresponding to the marked data is greater than or equal to the preset threshold, it indicates that most of the prediction results of the anomaly detection classifier on the marked data indicate that the marked data is outlier data, that is, the marked data is highly likely to be outlier data, so that it is determined that the marked data is outlier data, the marked data is marked as the data marked with anomalies, and then the marked data is fed back to the marker for manual secondary review, or the marked data is discarded.

When the ratio corresponding to the marked data is judged to be smaller than the preset threshold value, the prediction results of most of the abnormal detection classifiers on the marked data show that the marked data is non-outlier data, namely the marked data is highly likely to be the non-outlier data, so that the marked data is determined to be the non-outlier data, the marked data is marked as the normal marked data, and the marked data is not further processed.

In some embodiments, when the labeled data is predicted using a plurality of different anomaly detection classifiers in step 12, in step 13, it is determined whether the labeled data is outlier data using a minority-subject majority voting decision mechanism through steps 1321 through 1323.

In some embodiments, in the case of predicting the labeled data by using a plurality of anomaly detection classifiers trained in advance in the step 12, the voting decision mechanism used to determine whether the labeled data is outlier data may also be a voting decision mechanism based on a neural network algorithm, a machine learning algorithm, a boosting algorithm, an SVM support vector machine, or the like.

In some embodiments, when the marked data is determined to be outlier data, the outlier data is recorded, the outlier data is marked as abnormal marked data, the abnormal marked data is fed back to a marker for the marker to manually recheck the abnormal marked data, the marking of the abnormal marked data is adjusted, and the adjusted marked data fed back by the marker is recorded. In some embodiments, when the marked data is determined to be outlier data, the marked data is removed.

In some embodiments, when it is determined that the marked data is non-outlier data, then the non-outlier data is recorded and marked as normal marked data.

In the embodiment of the disclosure, all data including marked data, outlier data, non-outlier data, and adjusted marked data fed back by a marker need to be stored in a preset database, so that model iteration is performed on the anomaly detection classifier by using the data.

According to the data processing method provided by the embodiment of the disclosure, after marked data are collected, the marked data are classified by utilizing the pre-trained anomaly detection classifier, so that outlier data in all marked data are screened out, the auditing time of the marked data can be effectively reduced, the auditing efficiency of the marked data is improved, the accuracy of quality auditing of the marked data can be effectively ensured, and the quality of the marked data is improved.

Fig. 3 is a flowchart of another data processing method according to an embodiment of the disclosure, and in some embodiments, as shown in fig. 3, the data processing method is different from the data processing method according to any of the foregoing embodiments in that: the data processing method further comprises step 111 and step 112 before step 12. The following description is only about step 111 and step 112, and other descriptions may be specifically referred to the description of the data processing method in any of the foregoing embodiments, which is not repeated herein.

Step 111, for each marked data, performing feature extraction on the marked data by using at least one preset feature extraction algorithm to obtain a feature vector corresponding to the marked data.

In some embodiments, step 111 comprises: and carrying out feature extraction on the marked data by utilizing a preset feature extraction algorithm aiming at each marked data to obtain a feature vector corresponding to the marked data. Specifically, in step 111, a predetermined feature extraction algorithm is used to perform feature engineering on the labeled data, so as to extract feature vectors corresponding to the labeled data. In this case, the feature extraction algorithm may employ any one of the following feature extraction algorithms: the feature extraction algorithm of the embodiments of the present disclosure is not limited to the above feature extraction algorithm, but may be any other suitable feature extraction algorithm, and is not listed here.

In some embodiments, step 111 comprises: and respectively extracting the characteristics of each marked data by utilizing a plurality of preset characteristic extraction algorithms to obtain the characteristic vector corresponding to the marked data. Fig. 4 is a flowchart of a specific implementation of step 111 in fig. 3, as shown in fig. 4, in some embodiments, step 111 specifically includes step 1111 and step 1112 in the case of performing feature extraction on the labeled data by using a plurality of preset feature extraction algorithms, respectively.

And 1111, respectively performing feature extraction on the marked data by using a plurality of preset feature extraction algorithms according to each marked data to obtain a feature extraction result which corresponds to the marked data and is extracted based on each feature extraction algorithm.

Specifically, in step 1111, feature engineering is performed on the labeled data by using a plurality of preset feature extraction algorithms, so as to obtain feature extraction results extracted based on each feature extraction algorithm, thereby obtaining a plurality of feature extraction results corresponding to the labeled data, where each feature extraction result corresponding to the labeled data is a feature vector. That is, in step 1111, a plurality of feature vectors corresponding to the labeled data may be obtained.

In this case, the plurality of preset feature extraction algorithms are different, and the plurality of different feature extraction algorithms include: at least two algorithms of flat, color moment, histogram of direction gradient feature (HOG), local binary pattern feature (LBP), SIFT, but it should be noted that the feature extraction algorithm of the embodiments of the present disclosure is not limited thereto, and may be other suitable feature extraction algorithms, which are not listed here.

Step 1112, generating the feature vector corresponding to the marked data according to the feature extraction results corresponding to the marked data.

Specifically, feature fusion is performed on a plurality of feature extraction results corresponding to the marked data, so that a final feature vector corresponding to the marked data is generated.

In some embodiments, feature engineering is performed on the marked data by using various feature extraction algorithms, feature fusion is performed on various feature extraction results to obtain feature vectors of the marked data, and the feature vectors are sent to an anomaly detection algorithm to perform model training and prediction, so that the recognition rate and the prediction accuracy of the trained model can be effectively improved.

And 112, taking the feature vectors corresponding to the marked data as input variables, and performing unsupervised model training by using at least one preset anomaly detection algorithm to obtain an anomaly detection classifier corresponding to each anomaly detection algorithm, wherein the output variables of the anomaly detection classifier are variables for determining whether the marked data are outlier data.

After the feature vectors corresponding to the marked data are obtained in the step 111, in step 112, an unsupervised model training is performed by using at least one preset anomaly detection algorithm to obtain an anomaly detection classifier corresponding to each anomaly detection algorithm, where the output variable of the anomaly detection classifier is a variable for determining whether the marked data are outlier data.

In some embodiments, in step 112, feature vectors corresponding to the marked data are used as input variables, and an unsupervised model training is performed by using a preset anomaly detection algorithm to obtain an anomaly detection classifier corresponding to the anomaly detection algorithm, where the output variables of the anomaly detection classifier are variables for determining whether the marked data are outlier data. In this case, the preset anomaly detection algorithm may include any one of an ABOD algorithm, an independent forest algorithm, an HBOS algorithm, a KNN algorithm, a PCA algorithm, and an MCD algorithm, but it should be noted that the anomaly detection algorithm in the embodiment of the present disclosure is not limited to the above-listed algorithm, and the preset anomaly detection method in the embodiment of the present disclosure may also use other unsupervised anomaly detection algorithms. In this case, in step 12, for each labeled data, the labeled data is predicted by using the trained anomaly detection classifier corresponding to the anomaly detection algorithm.

In some embodiments, in step 112, feature vectors corresponding to the marked data are used as input variables, and an unsupervised model training is performed by using a plurality of preset anomaly detection algorithms, so as to obtain anomaly detection classifiers corresponding to the anomaly detection algorithms, where the output variables of the anomaly detection classifiers are variables for determining whether the marked data are outlier data. In this case, the plurality of preset abnormality detection algorithms are different, and the plurality of different abnormality detection algorithms may include: at least two algorithms of an ABOD algorithm, an independent forest algorithm, an HBOS algorithm, a KNN algorithm, a PCA algorithm and an MCD algorithm. In this case, in step 12, for each labeled data, the labeled data is predicted by using a plurality of trained abnormality detection classifiers.

In the embodiment of the disclosure, the unsupervised anomaly detection algorithm has the advantages of high calculation speed, no need of pre-training, small resource occupation amount and the like.

Fig. 5 is a schematic view of an application scenario of a data processing method according to an embodiment of the present disclosure, where the data processing method provided by the embodiment of the present disclosure is applicable to all algorithm application scenarios based on manual labeling data driving. As shown in fig. 5, in an application scenario, for each labeled data to be detected, firstly, performing feature engineering on the labeled data by using multiple feature extraction algorithms (HOG, LBP, color moment, SIFT, etc.), respectively; then, feature fusion is carried out on the feature extraction results of the marked data by using a plurality of feature extraction algorithms to obtain feature vectors corresponding to the marked data; performing unsupervised model training based on a plurality of anomaly detection algorithms by using feature vectors corresponding to a plurality of marked data to obtain anomaly detection classifiers corresponding to a plurality of anomaly detection algorithms (HBOS, IFFrest, KNN, ABOD and the like) respectively; then, aiming at the feature vector obtained by feature fusion of each marked data, the feature vector corresponding to the marked data is sent into each anomaly detection classifier for outlier prediction so as to obtain a prediction result corresponding to each anomaly detection data; and finally, determining whether the marked data is outlier data or not according to a preset voting decision mechanism.

Fig. 6 is a block diagram of a data processing apparatus according to an embodiment of the present disclosure, and as shown in fig. 6, the data processing apparatus is configured to implement the above-mentioned data processing method, and the data processing apparatus includes: an acquisition module 201, a prediction module 202 and a determination module 203.

The acquiring module 201 is configured to acquire a data detection request, where the data detection request includes a plurality of marked data.

The prediction module 202 is configured to predict, for each labeled data, the labeled data by using at least one anomaly detection classifier trained in advance, so as to obtain a prediction result of each anomaly detection classifier on the labeled data, where the prediction result of the labeled data includes information for determining whether the labeled data is outlier data.

The determining module 203 is configured to determine whether the marked data is outlier data according to a prediction result of the marked data by at least one anomaly detection classifier.

Fig. 7 is a block diagram illustrating another data processing apparatus according to an embodiment of the disclosure, and in some embodiments, as shown in fig. 7, the data processing apparatus further includes a training module 204, where the training module 204 includes a feature extraction submodule 2041 and a training submodule 2042.

The feature extraction submodule 2041 is configured to perform feature extraction on each piece of marked data by using at least one preset feature extraction algorithm, so as to obtain a feature vector corresponding to the marked data.

The training submodule 2042 is configured to perform unsupervised model training by using at least one preset anomaly detection algorithm with feature vectors corresponding to each labeled data as input variables, to obtain an anomaly detection classifier corresponding to each anomaly detection algorithm, where output variables of the anomaly detection classifier are variables for determining whether the labeled data is outlier data.

In some embodiments, the feature extraction submodule 2041 is specifically configured to perform feature extraction on each piece of marked data by using a preset feature extraction algorithm, so as to obtain a feature vector corresponding to the marked data.

In some embodiments, the feature extraction submodule 2041 is specifically configured to, for each piece of labeled data, perform feature extraction on the labeled data by using a plurality of preset feature extraction algorithms, so as to obtain a feature extraction result, corresponding to the labeled data, extracted based on each feature extraction algorithm; and generating a feature vector corresponding to the marked data according to a plurality of feature extraction results corresponding to the marked data.

In some embodiments, in the case that the prediction module 202 predicts the marked data by using a pre-trained anomaly detection classifier, the determining module 203 is specifically configured to determine that the marked data is outlier data when the prediction result of the anomaly detection classifier on the marked data includes information for determining that the marked data is outlier data; when the prediction result of the anomaly detection classifier on the marked data comprises information for determining that the marked data is non-outlier data, the marked data is determined to be the non-outlier data.

In some embodiments, in the case that the prediction module 202 predicts the marked data by using a plurality of anomaly detection classifiers that are trained in advance, the determining module 203 is specifically configured to determine, based on a preset voting decision mechanism, whether the marked data is outlier data according to the prediction results of the plurality of anomaly detection classifiers that respectively predict the marked data.

In some embodiments, in the case that the prediction module 202 predicts the marked data by using a plurality of anomaly detection classifiers that are trained in advance, the determining module 203 is specifically configured to count the amount of information that is used to determine that the marked data is outlier data in the prediction results of the marked data predicted by the plurality of anomaly detection classifiers; judging whether the ratio of the number to the total number of the prediction results is greater than or equal to a preset threshold value; and when the ratio of the number to the total number of the prediction results is larger than or equal to a preset threshold value, determining that the marked data is outlier data.

Fig. 8 is a schematic diagram of an application scenario of a data processing apparatus according to an embodiment of the present disclosure, where the data processing apparatus according to the embodiment of the present disclosure can be applied to all algorithm application scenarios based on manual labeling data driving. As shown in fig. 8, in an application scenario, the data processing apparatus further includes, in addition to the above modules, a data acquisition module 205, a data annotation module 206, an algorithm library 208, a feature library 209, a feedback module 210, and a database 211. Wherein, the data acquisition module 205 is configured to acquire data; after the data acquisition module 205 completes data acquisition, the annotator can annotate the data through the data annotating module 206 according to expert knowledge, and then send the annotated data as a data detection request (query) to the acquisition module 201 through the data annotating module 206; the obtaining module 201 can perform load balancing according to the running state of each server, and send a data detection request to an optimal server carrying the training module 204; after receiving the data detection request, the training module 204 starts, firstly, invokes a proper feature extraction algorithm from the feature library 209 to perform feature extraction on the marked data, and then invokes a proper unsupervised anomaly detection algorithm from the algorithm library 208 to perform unsupervised model training to obtain an anomaly detection classifier; then the prediction module 202 performs outlier prediction on the marked data based on the anomaly detection classifier; the determining module 203 determines whether the marked data is outlier data according to the prediction result; when the determining module 203 determines that the marked data is outlier data, the feedback module 210 records the outlier data and feeds back the outlier data to the annotator after marking the outlier data as the data with abnormal marks, so that the annotator can perform manual secondary review and adjustment; when the determining module 203 determines that the marked data is non-outlier data, the feedback module 210 records the non-outlier data and marks the non-outlier data as normal marked data; the database 211 is used for storing marked data, outlier data output by the anomaly detection classifier, non-outlier data, and data returned by the marker after manual secondary review and adjustment, and the data stored and recorded in the database 211 can be used for performing model iteration on the anomaly detection classifier.

The algorithm library 208 contains various unsupervised anomaly detection algorithms, and the feature library 209 contains various feature extraction algorithms applicable to data such as text, voice, image, etc., according to different accuracy and speed requirements of users for system call.

In addition, the data processing apparatus provided in the embodiments of the present disclosure is specifically configured to implement the foregoing data processing method, and the description of the foregoing data processing method may be specifically referred to, which is not repeated herein.

Fig. 9 is a block diagram of an electronic device according to an embodiment of the disclosure, as shown in fig. 9, where the electronic device includes: one or more processors 501; a memory 502 having one or more programs stored thereon, which when executed by the one or more processors 501 cause the one or more processors 501 to implement the data processing method described above; one or more I/O interfaces 503 coupled between the processor 501 and the memory 502 are configured to enable information interaction of the processor 501 with the memory 502.

The disclosed embodiments also provide a computer readable storage medium having a computer program stored thereon, wherein the computer program when executed implements the aforementioned data processing method.

Those of ordinary skill in the art will appreciate that all or some of the steps, systems, functional modules/units in the apparatus, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between the functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed cooperatively by several physical components. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as known to those skilled in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. Furthermore, as is well known to those of ordinary skill in the art, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media.

Example embodiments have been disclosed herein, and although specific terms are employed, they are used and should be interpreted in a generic and descriptive sense only and not for purpose of limitation. In some instances, it will be apparent to one skilled in the art that features, characteristics, and/or elements described in connection with a particular embodiment may be used alone or in combination with other embodiments unless explicitly stated otherwise. It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the scope of the disclosure as set forth in the appended claims.

Claims

1. A data processing method, comprising:

acquiring a data detection request, wherein the data detection request comprises a plurality of marked data, and the marked data comprises one or more of text data, voice data and image data;

predicting the marked data by utilizing at least one trained abnormality detection classifier in advance for each marked data to obtain a prediction result of each abnormality detection classifier on the marked data, wherein the prediction result of the marked data comprises information for determining whether the marked data is outlier data or not, the abnormality detection classifier is used for predicting whether the marked data is outlier data or not, the input of the abnormality detection classifier is a feature vector of the marked data, and the output is information for determining whether the marked data is outlier data or not;

2. The data processing method according to claim 1, wherein the predicting, for each labeled data, the labeled data by using at least one anomaly detection classifier trained in advance, and before obtaining a prediction result of each anomaly detection classifier on the labeled data, further comprises:

for each marked data, performing feature extraction on the marked data by using at least one preset feature extraction algorithm to obtain a feature vector corresponding to the marked data, wherein the at least one preset feature extraction algorithm comprises: a Flatten feature extraction algorithm, a color moment feature extraction algorithm, a direction gradient histogram feature extraction algorithm, a local binary pattern feature extraction algorithm and a SIFT feature extraction algorithm;

taking the feature vectors corresponding to the marked data as input variables, performing unsupervised model training by using at least one preset anomaly detection algorithm to obtain the anomaly detection classifier corresponding to each anomaly detection algorithm, wherein the output variables of the anomaly detection classifier are variables for determining whether the marked data are outlier data, and the at least one preset anomaly detection algorithm comprises: ABOD, independent forest, HBOS, KNN, PCA, MCD.

3. The data processing method according to claim 2, wherein the performing feature extraction on the labeled data by using at least one preset feature extraction algorithm for each labeled data to obtain a feature vector corresponding to the labeled data includes:

4. The data processing method according to claim 2, wherein the performing feature extraction on the labeled data by using at least one preset feature extraction algorithm for each labeled data to obtain a feature vector corresponding to the labeled data includes:

5. A data processing method according to claim 1, wherein in the case where the labeled data is predicted by a pre-trained abnormality detection classifier,

The determining whether the marked data is outlier data according to the prediction result of the at least one anomaly detection classifier on the marked data comprises:

6. A data processing method according to claim 1, wherein in the case where the labeled data is predicted using a plurality of anomaly detection classifiers trained in advance,

7. The data processing method according to claim 6, wherein the determining whether the labeled data is outlier data based on a preset voting decision mechanism according to the prediction results of the plurality of anomaly detection classifiers for predicting the labeled data, respectively, comprises:

8. A data processing apparatus comprising:

the system comprises an acquisition module, a data detection module and a data processing module, wherein the acquisition module is used for acquiring a data detection request, the data detection request comprises a plurality of marked data, and the marked data comprises one or more of text data, voice data and image data;

the prediction module is used for predicting the marked data by utilizing at least one pre-trained abnormality detection classifier aiming at each marked data to obtain a prediction result of each abnormality detection classifier on the marked data, wherein the prediction result of the marked data comprises information for determining whether the marked data is outlier data or not, the abnormality detection classifier is used for predicting whether the marked data is outlier data or not, the input of the abnormality detection classifier is a feature vector of the marked data, and the information for determining whether the marked data is outlier data or not is output;

9. The data processing apparatus of claim 8, further comprising a training module comprising a feature extraction sub-module and a training sub-module;

the feature extraction sub-module is configured to perform feature extraction on each marked data by using at least one preset feature extraction algorithm to obtain a feature vector corresponding to the marked data, where the at least one preset feature extraction algorithm includes: a Flatten feature extraction algorithm, a color moment feature extraction algorithm, a direction gradient histogram feature extraction algorithm, a local binary pattern feature extraction algorithm and a SIFT feature extraction algorithm;

the training sub-module is configured to perform unsupervised model training by using at least one preset anomaly detection algorithm with feature vectors corresponding to the labeled data as input variables, to obtain the anomaly detection classifier corresponding to each anomaly detection algorithm, where an output variable of the anomaly detection classifier is a variable for determining whether the labeled data is outlier data, and the at least one preset anomaly detection algorithm includes: ABOD, independent forest, HBOS, KNN, PCA, MCD.

10. The data processing apparatus according to claim 9, wherein the feature extraction submodule is specifically configured to perform feature extraction on each labeled data by using a preset feature extraction algorithm, so as to obtain a feature vector corresponding to the labeled data.

11. The data processing apparatus according to claim 9, wherein the feature extraction submodule is specifically configured to, for each labeled data, perform feature extraction on the labeled data by using a plurality of preset feature extraction algorithms, so as to obtain a feature extraction result, corresponding to the labeled data, extracted based on each feature extraction algorithm; and generating the feature vector corresponding to the marked data according to a plurality of feature extraction results corresponding to the marked data.

12. The data processing apparatus according to claim 8, wherein in the case where the prediction module predicts the noted data using a pre-trained abnormality detection classifier,

13. The data processing apparatus according to claim 8, wherein in the case where the prediction module predicts the noted data using a plurality of anomaly detection classifiers that are trained in advance,

14. The data processing apparatus according to claim 13, wherein the determining module is specifically configured to count an amount of information that is used to determine that the labeled data is outlier data in a prediction result of the labeled data predicted by the plurality of anomaly detection classifiers; judging whether the ratio of the number to the total number of the prediction results is greater than or equal to a preset threshold value; and when the ratio of the number to the total number of the prediction results is larger than or equal to a preset threshold value, determining that the marked data is outlier data.

15. An electronic device, comprising:

one or more processors;

a memory having one or more programs stored thereon, which when executed by the one or more processors, cause the one or more processors to implement the data processing method of any of claims 1-7;

16. A computer readable medium, on which a computer program is stored, wherein the computer program, when executed, implements the data processing method according to any of claims 1-7.