WO2021244583A1

WO2021244583A1 - Data cleaning method, apparatus and device, program, and storage medium

Info

Publication number: WO2021244583A1
Application number: PCT/CN2021/097992
Authority: WO
Inventors: 许江浩; 任国焘; 陈杰
Original assignee: 杭州海康威视数字技术股份有限公司
Priority date: 2020-06-03
Filing date: 2021-06-02
Publication date: 2021-12-09
Also published as: CN113762519A

Abstract

A data cleaning method, apparatus and device, a program, and a storage medium. The method comprises: obtaining a data set, the data set comprising a plurality of pieces of initial training data (101); according to feature information of each piece of initial training data in the data set, determining a score value of each piece of initial training data, the score value being used for representing the training effectiveness of the initial training data (102); according to the score value of each piece of initial training data, selecting target training data from the data set (103); and carrying out data cleaning according to the target training data (104). The solution can increase data cleaning efficiency, reduce invalid input of redundant data, and improve the utilization rate of cleaning resources.

Description

Data cleaning method, device, equipment, program and storage medium

Technical field

This application relates to the field of image processing technology, in particular to a data cleaning method, device and equipment, program and storage medium.

Background technique

Machine learning is a way to realize artificial intelligence. It is a multi-disciplinary interdisciplinary, involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and other subjects. Machine learning is used to study how computers simulate or implement human learning behaviors in order to acquire new knowledge or skills, and reorganize the existing knowledge structure to continuously improve its own performance. Machine learning pays more attention to algorithm design, so that computers can automatically learn rules from data and use the rules to predict unknown data.

Machine learning has been widely used, such as: data mining, computer vision, natural language processing, biometric recognition, search engines, medical diagnosis, detection of credit card fraud, stock market analysis, DNA sequence sequencing, speech and handwriting recognition, strategy Games and robot applications, etc.

In order to implement machine learning, it is necessary to obtain a large amount of initial training data, perform data cleaning on these initial training data, obtain cleaned training data, and implement machine learning based on the cleaned training data.

However, the above method requires data cleaning of all initial training data, and it is impossible to filter the initial training data. As a result, training data with poor effects is also involved in machine learning, and the learning effect is poor.

Summary of the invention

This application provides a data cleaning method, which includes:

Acquiring a data set, the data set including a plurality of initial training data;

Determine the score value of each initial training data according to the feature information of each initial training data in the data set, where the score value is used to represent the training effect of the initial training data;

Selecting target training data from the data set according to the score value of each initial training data;

Perform data cleaning according to the target training data.

The present application provides a data cleaning device, which includes:

An acquisition module for acquiring a data set, the data set including a plurality of initial training data;

The determining module is configured to determine the score value of each initial training data according to the feature information of each initial training data in the data set, and the score value is used to represent the training effect of the initial training data;

The selection module is used to select target training data from the data set according to the score value of each initial training data;

The cleaning module is used for data cleaning according to the target training data.

The present application provides a data cleaning device, including: a processor and a machine-readable storage medium, where the machine-readable storage medium stores machine-executable instructions that can be executed by the processor;

The processor is used to execute machine executable instructions to implement the following steps:

Perform data cleaning according to the target training data.

The present application provides a computer program, which is stored in a machine-readable storage medium, and when a processor executes the computer program, it causes the processor to implement the method in the above first aspect.

The present application provides a machine-readable storage medium that stores machine-executable instructions. When called and executed by a processor, the machine-executable instructions cause the processor to execute the first aspect Methods.

It can be seen from the above technical solutions that in this embodiment of the application, the score value of the initial training data is determined according to the characteristic information of the initial training data. The score value is used to represent the training effect of the initial training data. Select the target training data from the initial training data, and perform data cleaning on the target training data instead of performing data cleaning on all initial training data, thereby improving the efficiency of data cleaning and reducing the invalid input of redundant data. It can clean the target training data with good training effect (that is, the score value is high), and provide the most effective data for training, so that the training data with better effect can participate in machine learning, and the effect of machine learning is better, which can improve cleaning Resource utilization.

Description of the drawings

In order to explain the embodiments of the application or the technical solutions in the prior art more clearly, the following will briefly introduce the drawings that need to be used in the description of the embodiments of the application or the prior art. Obviously, the drawings in the following description These are just some of the embodiments described in this application. For those of ordinary skill in the art, other drawings can also be obtained based on these drawings of the embodiments of this application.

FIG. 1 is a flowchart of a data cleaning method in an embodiment of the present application;

Fig. 2 is a schematic diagram of an application scenario in an embodiment of the present application;

FIG. 3 is a flowchart of a data cleaning method in another embodiment of the present application;

FIG. 4 is a structural diagram of a data cleaning device in an embodiment of the present application;

Fig. 5 is a structural diagram of a data cleaning device in an embodiment of the present application.

detailed description

The terms used in the embodiments of the present application are only for the purpose of describing specific embodiments, rather than limiting the present application. The singular forms of "a", "said" and "the" used in this application and claims are also intended to include plural forms, unless the context clearly indicates other meanings. It should also be understood that the term "and/or" as used herein refers to any or all possible combinations of one or more associated listed items.

It should be understood that although the terms first, second, third, etc. may be used in the embodiments of the present application to describe various information, the information should not be limited to these terms. These terms are only used to distinguish the same type of information from each other. For example, without departing from the scope of this application, the first information may also be referred to as second information, and similarly, the second information may also be referred to as first information. Depending on the context, in addition, the word "if" used can be interpreted as "when" or "when" or "in response to certainty."

Machine learning is a way to realize artificial intelligence. It is used to study how computers simulate or realize human learning behaviors in order to acquire new knowledge or skills, and reorganize the existing knowledge structure to continuously improve its own performance. Neural network is the specific implementation of machine learning. This article uses neural network as an example to introduce the implementation of machine learning. For other types of machine learning algorithms, it is similar to neural network.

Exemplarily, the neural network may include, but is not limited to: Convolutional Neural Network (abbreviated as CNN), Recurrent Neural Network (abbreviated as RNN), fully connected network, etc. The structural units of the neural network may include, but are not limited to: Convolutional Layer (Conv), Pooling Layer (Pool), Excitation Layer, Fully Connected Layer (FC), etc., which are not limited.

In the convolution layer, the data features are enhanced by using the convolution kernel to perform the convolution operation. The convolution layer uses the convolution kernel to perform the convolution operation in the space range. The convolution kernel can be a matrix of m*n size. , The input of the convolutional layer is convolved with the convolution kernel, and the output of the convolutional layer can be obtained. The convolution operation is actually a filtering process. In the convolution operation, the data is convolved with the convolution kernel w(x, y) to obtain multiple convolution features. These convolution features are the output of the convolution layer. And can be provided to the pooling layer.

In the pooling layer, it is actually a down-sampling process. By taking the maximum value, minimum value, and average value of multiple convolutional features (that is, the output of the convolutional layer), the amount of calculation can be reduced. , And maintain the invariance of characteristics. In the pooling layer, the principle of local correlation can be used to sub-sampling the data, which can reduce the amount of data processing and retain the useful information in the data.

In the excitation layer, an activation function (such as a non-linear function) can be used to map the output characteristics of the pooling layer, so as to introduce non-linear factors, so that the neural network can enhance the expression ability through non-linear combination. Among them, the activation function of the excitation layer can include, but is not limited to, the ReLU (Rectified Linear Units) function. Taking the ReLU function as an example, the ReLU function can take all the features output by the pooling layer, which are less than 0. The feature is set to 0, and the feature greater than 0 remains unchanged.

In the fully connected layer, the fully connected layer is used to perform fully connected processing on all the features input to the fully connected layer, thereby obtaining a feature vector, and the feature vector may include multiple features.

In practical applications, one or more convolutional layers, one or more pooling layers, one or more excitation layers, and one or more fully connected layers can be combined to construct a neural network according to different needs.

Exemplarily, before using the neural network for business processing, the neural network needs to be trained first. In the training process of the neural network, a large amount of initial training data can be obtained, and the initial training data can be cleaned to obtain the cleaned training data, and the cleaned training data can be used to train the neural network parameters in the neural network, such as the convolutional layer Parameters (such as convolution kernel parameters), pooling layer parameters, excitation layer parameters, fully connected layer parameters, etc., there are no restrictions on this. Based on the neural network that has been trained, the neural network can be used for business processing, for example, the input data is provided to the neural network, and the neural network processes the input data, such as using various neural network parameters to process the input data to obtain the output Data, and finally use neural network to complete business processing, such as face detection, vehicle detection, etc.

In related technologies, it is necessary to perform data cleaning on all initial training data, and use all the cleaned training data to train the neural network. However, in these training data, there may be unusable training data, there may be repeated training data, and there may be training data with poor training effects. When these training data are provided to the neural network, it will lead to the training of the neural network. The effect is poor, that is, the reliability of the neural network is reduced, such as face detection, and the accuracy of vehicle detection is greatly reduced.

In response to the above findings, in the embodiments of the present application, the score value of each initial training data can be determined, and the score value is used to represent the training effect of the initial training data, that is, the higher the score value, the better the training effect of the training data. Therefore, part of the initial training data with high scores can be used as target training data, the target training data can be cleaned, and the cleaned target training data can be used to train the neural network. Obviously, because the target training data is training data with high scores, that is, training data with better training effects, when these training data are provided to the neural network, the training effect of the neural network will be better, that is, the neural network is reliable Improved performance, such as face detection, and increased accuracy of vehicle detection.

The technical solutions of the embodiments of the present application will be described below in conjunction with specific embodiments.

Refer to Figure 1, which is a schematic flow diagram of a data cleaning method. The method may include:

Step 101: Obtain a data set, where the data set may include multiple initial training data.

Exemplarily, when training data needs to be used to train the neural network, the training data may be obtained first. For the convenience of distinction, the training data is called initial training data. For example, the initial training data can be obtained from a certain device, or the initial training data input by the user can be received, and there is no restriction on this.

Exemplarily, for a large amount of acquired initial training data, these initial training data may be classified, and each type of initial training data is added to a data set. For example, the initial training data for face detection is added to data set 1, and the initial training data for vehicle detection is added to data set 2, and so on, and there is no restriction on this classification method. In summary, at least one data set can be obtained, and each data set includes multiple initial training data. Since the processing procedure of each data set is the same, the following takes the processing procedure of a data set as an example for description.

Step 102: Determine the score value of each initial training data according to the feature information of each initial training data in the data set. The score value is used to indicate the training effect of the initial training data. For example, the score value of the initial training data is High, it means that the training effect of the initial training data is better, and the score value of the initial training data is lower, it means that the training effect of the initial training data is worse.

The feature information of the initial training data can characterize the training effect of the initial training data. When the feature information characterizes the training effect of the initial training data, the score value of the initial training data is higher. When the feature information characterizes the training effect of the initial training data, the training effect of the initial training data is higher. When it is bad, the score value of the initial training data is lower. In summary, the score value of the initial training data can be determined according to the characteristic information of the initial training data.

Exemplarily, when the feature information of at least two initial training data is the same, the score values of these initial training data may be the same, and the score values of these initial training data may also be different.

Exemplarily, the following method may be used to determine the score value of each initial training data:

Manner 1. For each initial training data in the data set, the pre-configured mapping relationship is queried through the characteristic information of the initial training data to obtain the score value of the initial training data.

Exemplarily, for mode 1, the mapping relationship can be pre-configured, and the mapping relationship can include but is not limited to the correspondence relationship between feature information and score value. The correspondence relationship between feature information and score value can be configured based on experience, and there is no limitation on this . For example, when the feature information a1 represents a better training effect of the initial training data, the score value corresponding to the feature information a1 is higher. For another example, when the feature information a2 characterizes that the training effect of the initial training data is poor, the score value corresponding to the feature information a2 is lower.

Refer to Table 1, which is an example of the mapping relationship. The mapping relationship is used to record the corresponding relationship between the feature information and the score value. The score value can adopt a percentage system or other score values, and there is no restriction on this. Table 1 shows the mapping relationship in the form of a table. Of course, other data structures can also be used to represent the mapping relationship, as long as it includes the corresponding relationship between the feature information and the score value, which is not limited.

Table 1

特征信息Feature information	分数值Point value
特征信息a1Characteristic information a1	100100
特征信息a2Characteristic information a2	9595
特征信息a3Characteristic information a3	9090
特征信息a4Characteristic information a4	8585
……	……

Exemplarily, for each initial training data in the data set, feature information of the initial training data can be obtained. For example, the initial training data may include feature information. Therefore, the feature information of the initial training data can be obtained directly from the initial training data. For another example, a certain algorithm (such as a deep learning algorithm) can be used to analyze the initial training data to obtain the characteristic information of the initial training data. This analysis process is not limited, as long as the characteristic information of the initial training data can be obtained.

After the feature information of the initial training data is obtained, the mapping relationship shown in Table 1 can be looked up through the feature information of the initial training data to obtain the score value of the initial training data. For example, if the feature information of the initial training data is feature information a3, the score value of the initial training data is 90.

In summary, for each initial training data in the data set, the mapping relationship shown in Table 1 can be inquired from the feature information of the initial training data to obtain the score value of the initial training data.

Manner 2. Sort all the initial training data according to the important priority of the feature information of each initial training data in the data set, and determine the score value of each initial training data according to the sorting result.

Exemplarily, for mode 2, the important priority of the feature information can be pre-configured, and the important priority can be configured based on experience, and there is no restriction on the important priority. For example, when the feature information a1 characterizes the training effect of the initial training data is good, and the feature information a2 characterizes the training effect of the initial training data is poor, the important priority of the feature information a1 may be greater than the important priority of the feature information a2.

Refer to Table 2, which is an example of the important priority of feature information. The higher the value of the important priority, the greater the important priority. Table 2 expresses important priorities in a tabular manner, and other data structures may also be used to express important priorities, as long as the important priorities of the feature information are included, and there is no restriction on this.

Table 2

特征信息Feature information	重要优先级Important priority
特征信息a1Characteristic information a1	1010
特征信息a2Characteristic information a2	99
特征信息a3Characteristic information a3	88
特征信息a4Characteristic information a4	77
……	……

Exemplarily, for each initial training data in the data set, the characteristic information of the initial training data can be acquired. For the acquisition method, please refer to the foregoing embodiment, and will not be repeated here. After the feature information of the initial training data is obtained, Table 2 can be looked up through the feature information of the initial training data to obtain the important priority of the initial training data. Then, according to the important priority of the feature information of each initial training data, all the initial training data are sorted, and the score value of each initial training data is determined according to the sorting result.

For example, assuming that the important priority of the feature information of the initial training data 1> the important priority of the feature information of the initial training data 2> the important priority of the feature information of the initial training data 3, the ranking result is the initial training data 1, the initial training Data 2 and initial training data 3. Therefore, the score value of the initial training data 1 is greater than the score value of the initial training data 2, and the score value of the initial training data 2 is greater than the score value of the initial training data 3. For example, the score of the initial training data 1 The value is 100, the score value of the initial training data 2 is 99, and the score value of the initial training data 3 is 98. Of course, the above score value is just an example.

In summary, for each initial training data in the data set, by sorting the initial training data, the score value of each initial training data can be determined according to the sorting result.

Of course, the above method 1 and method 2 are only two examples of this application, and there is no limitation on this, as long as the score value of the initial training data can be determined according to the characteristic information of the initial training data.

In a possible implementation manner, the feature information may include, but is not limited to, application scenarios and/or data quality. There is no restriction on the feature information, and all information that can characterize the training effect can be used as the feature information. The application scenario is used to represent the scenario information of the initial training data, such as daytime, night, sunny day, rainy day, etc. Of course, the above are just a few examples of the application scenario, and there is no restriction on this. Data quality is used to indicate the quality information of the initial training data, such as resolution, etc. The higher the resolution, the better the data quality and the clearer the data. Of course, the above are only examples of data quality, and there is no restriction on this.

The following describes the implementation process of application scenarios and/or data quality in combination with specific conditions.

Case 1. If the feature information includes application scenarios, determine the scenario score of each initial training data according to the application scenario of each initial training data in the data set; the scenario score is used to represent the training effect of the initial training data, for example , The higher the scene score of the initial training data, the better the training effect of the initial training data, and the lower the scene score of the initial training data, the worse the training effect of the initial training data. Then, the score value of the initial training data is determined according to the scene score of the initial training data, for example, the scene score of the initial training data is directly used as the score value of the initial training data.

The application scenario of the initial training data can characterize the training effect of the initial training data. When the application scenario characterizes the training effect of the initial training data, the initial training data has a higher scenario score. When the application scenario characterizes the training effect of the initial training data, the training effect of the initial training data is higher. When it is bad, the scene score of the initial training data is lower. In summary, the scene score of the initial training data can be determined according to the application scene of the initial training data. For example, for nights, rainy days, etc., when the initial training data of these application scenarios is used for training, the training effect is better, and the initial training data has a higher scene score. For daytime, sunny days, etc., when the initial training data of these application scenarios is used for training, the training effect is poor, and the scene score of the initial training data is low.

Case 2. If the feature information includes data quality, determine the quality score of each initial training data according to the data quality of each initial training data in the data set; the quality score is used to indicate the training effect of the initial training data, for example , The higher the quality score of the initial training data, the better the training effect of the initial training data, and the lower the quality score of the initial training data, the worse the training effect of the initial training data. Then, the score value of the initial training data is determined according to the quality score of the initial training data, for example, the quality score of the initial training data is directly used as the score value of the initial training data.

The data quality of the initial training data can characterize the training effect of the initial training data. When the data quality characterizes the training effect of the initial training data, the quality score of the initial training data is higher. When the data quality characterizes the training effect of the initial training data, the training effect is better. When it is bad, the quality score of the initial training data is lower. In summary, the quality score of the initial training data can be determined according to the data quality of the initial training data.

For example, for the initial training data with low resolution (that is, poor data quality), when it is used for training, the training effect is better, and the quality score of the initial training data is higher. For the initial training data with higher resolution, when it is used for training, the training effect is poor, and the quality of the initial training data is low.

Case 3. If the feature information includes application scenarios and data quality, determine the scenario score of each initial training data according to the application scenario of each initial training data in the data set, and the scenario score is used to represent the training effect of the initial training data . And, according to the data quality of each initial training data in the data set, the quality score of each initial training data is determined, and the quality score is used to represent the training effect of the initial training data. Then, for each initial training data, the score value of the initial training data is determined according to the scene score and scene weight value of the initial training data, as well as the quality score and quality weight value.

Exemplarily, the scene weight value and the quality weight value can be configured according to experience, and there is no restriction on this and can be configured arbitrarily. For example, the sum of the scene weight value and the quality weight value can be 1. If the user pays attention to the application scene, the scene weight value is greater than the quality weight value, for example, the scene weight value is 0.7, the quality weight value is 0.3, or the scene weight value is 0.6 , The quality weight value is 0.4. If the user is concerned about data quality, the quality weight value is greater than the scene weight value. For example, the scene weight value is 0.3 and the quality weight value is 0.7, or the scene weight value is 0.4 and the quality weight value is 0.6. In addition, you can also set both the scene weight value and the quality weight value to 0.5. Of course, the above are just a few examples of scene weight values and quality weight values.

In case 1 and case 3, it is necessary to determine the scene score of each initial training data according to the application scenario of each initial training data in the data set. For example, for each initial training data in the data set, the pre-configured mapping relationship is queried through the application scenario of the initial training data (the mapping relationship includes the corresponding relationship between the application scenario and the scenario score), and the scenario score of the initial training data is obtained. For the specific implementation method, refer to the above-mentioned method 1, replace the feature information with the application scenario, and replace the score value with the scenario score, which will not be repeated here. For another example, sort all the initial training data according to the important priority of the application scenario of each initial training data in the data set, and determine the scene score of each initial training data according to the sorting result. For the specific implementation method, refer to the above method 2. I won't repeat it here.

In case 2 and case 3, it is necessary to determine the quality score of each initial training data according to the data quality of each initial training data in the data set. For example, for each initial training data in the data set, the data quality of the initial training data is used to query the pre-configured mapping relationship (the mapping relationship includes the corresponding relationship between the data quality and the quality score), and the quality score of the initial training data is obtained. For the specific implementation method, refer to the above-mentioned method 1. Replace the characteristic information with the data quality, and replace the score value with the quality score, which will not be repeated here. For another example, according to the important priority of the data quality of each initial training data in the data set, all initial training data are sorted, and the quality score of each initial training data is determined according to the sorting result. For the specific implementation, please refer to the above method 2. I won't repeat it here.

In Case 1 and Case 3, when the application scenarios of at least two initial training data are the same, the scenario scores of these initial training data may be the same or different. In case 2 and case 3, when the data quality of at least two initial training data is the same, the quality scores of these initial training data may be the same or different.

In summary, the score value of each initial training data can be determined according to the characteristic information of each initial training data in the data set. After the score value of each initial training data is obtained, in a possible implementation manner, the score value can be directly used as the score value of the initial training data. In another possible implementation manner, the score value of the initial training data can also be corrected, and the corrected score value is used as the score value of the initial training data. The following describes the correction process of the score value: determine the similarity between two initial training data; if the similarity is greater than the preset similarity threshold, keep the score value of one initial training data unchanged, and reduce the other initial training The score value of the data.

Exemplarily, the preset similarity threshold can be configured based on experience, and there is no restriction on this. When the similarity between the two initial training data is greater than the preset similarity threshold, it means that the two initial training data are very close. It is considered that the initial training data is the same or similar, that is, the two are repeated.

Exemplarily, regarding the determination method of similarity, Euclidean distance can be used to determine the similarity between two initial training data, or cosine similarity can be used to determine the similarity between two initial training data, or Peel The Sun correlation coefficient determines the similarity between two initial training data. Of course, the above are just a few examples, and there is no restriction on this determination method, and any similarity algorithm can be used.

For example, compare the similarity between the initial training data 1 and the initial training data 2. If the similarity is greater than the preset similarity threshold, keep the score value of the initial training data 1 unchanged, and reduce the score value of the initial training data 2, such as Decrease the score value to 0. If the similarity is not greater than the preset similarity threshold, the score values of the initial training data 1 and the initial training data 2 are kept unchanged. Then, the similarity between the initial training data 1 and the initial training data 3 is compared, and so on, the similarity of any two initial training data can be compared.

Exemplarily, when the similarity is greater than the preset similarity threshold, the score value of the initial training data with a high score value may be kept unchanged, or the score value of the initial training data with a low score value may be kept unchanged.

Exemplarily, after the score value of a certain initial training data is reduced, the initial training data does not participate in the subsequent comparison process, that is, the similarity between the initial training data and other initial training data is no longer compared.

Step 103: Select target training data from the data set according to the score value of each initial training data.

Exemplarily, since the score value is used to represent the training effect of the initial training data, the higher the score value, the better the training effect of the initial training data, and the lower the score value, the worse the training effect of the initial training data. Therefore, based on each For the score value of the initial training data, the initial training data with a high score value can be used as the target training data. In this way, the initial training data with a better training effect can be used as the target training data.

Exemplarily, the target training data can be selected from the data set in the following manner:

Manner 1: For each initial training data in the data set, if the score value of the initial training data is greater than the preset score threshold, the initial training data can be determined as the target training data.

Exemplarily, the preset score threshold can be configured based on experience, and there is no restriction on this. When the score value is greater than the preset score threshold, it indicates that the training effect of the initial training data is good, and the initial training data can be used as the target training data. When the score value is not greater than the preset score threshold, it indicates that the training effect of the initial training data is poor, and the initial training data does not need to be used as the target training data.

For example, assuming that the score value of the initial training data 1 is greater than the preset score threshold, the initial training data 1 may be determined as the target training data. For another example, assuming that the score value of the initial training data 2 is not greater than the preset score threshold, the initial training data 2 is not determined as the target training data, and so on.

Method 2: Sort all the initial training data according to the score value of each initial training data in the data set, and select multiple initial training data as the target training data according to the sorting result.

For example, based on the score value of each initial training data in the data set, all the initial training data are sorted in the order of the score value from high to low. Based on the ranking result, starting from the initial training data with a high score value, multiple initial training data with the highest ranking are selected as the target training data.

Exemplarily, the data cleaning time interval (indicating that data cleaning is performed in this time interval) may be divided into multiple statistical periods, and the duration of each statistical period is the same. In each statistical period, you can start with the initial training data with a high score, and select multiple initial training data with the highest ranking as the target training data. For example, the ranking result is initial training data 1-initial training data 100. In the first statistical period, initial training data 1-initial training data 10 are selected as the target training data, and in the second statistical period, initial training data 11- The initial training data 20 is used as the target training data, and so on.

In a possible implementation manner, the number M to be cleaned in the next statistical period may be determined first, and M may be a positive integer, that is, a natural number. In the next statistical cycle, based on the sorting result, starting from the initial training data with a high score value, M initial training data can be selected in turn as the target training data.

The value of M can be configured based on experience, and there is no restriction on this. For example, when all operating nodes can perform data cleaning on 10 target training data in a statistical period, M can be 10 or slightly greater than 1. Assuming that the target training data to be cleaned is several pictures, if the value of M is 0, it can be considered that the number of pictures in the next statistical period is 0, and all pictures have been cleaned.

Since the number of operating nodes may change, and the number of target training data for data cleaning of operating nodes in different statistical periods may also change, M can also be determined in the following way: determine the next statistical period according to the cleaning efficiency of operating nodes The number to be cleaned M, the cleaning efficiency represents the completed cleaning amount of the operating nodes (that is, all operating nodes) in the current statistical period.

In summary, the data cleaning time interval can be divided into multiple statistical periods, and the duration of each statistical period is the same. In the first statistical cycle, first select 10 initial training data as the target training data, add these 10 target training data to the list to be cleaned, and the operation node obtains the target training data from the list to be cleaned, and performs a calculation on the target training data. Data cleaning. If the operating node can perform data cleaning on 15 target training data in the first statistical cycle, then in the first statistical cycle, it is also necessary to select 5 initial training data as the target training data, and add these 5 target training data to the target training data. In the cleaning list, the operating node obtains the target training data from the list to be cleaned, and performs data cleaning on the target training data.

Obviously, since the operating node performs data cleaning on a total of 15 target training data in the first statistical period, the cleaning efficiency can be 15, and it is determined that the number M to be cleaned in the second statistical period is 15. In the second statistical cycle, first select 15 initial training data as the target training data, add these 15 target training data to the list to be cleaned, and the operation node obtains the target training data from the list to be cleaned, and performs a calculation on the target training data. Data cleaning. If the operating node can perform data cleaning on 12 target training data in the second statistical period, there is no need to add new target training data to the list to be cleaned.

Obviously, since the operating node performs data cleaning on a total of 12 target training data in the second statistical period, the cleaning efficiency can be 12, and it is determined that the number M to be cleaned in the third statistical period is 12.

In the third statistical cycle, first select 9 initial training data as the target training data, and add these 9 target training data to the list to be cleaned. Since there are still 3 target training data in the list to be cleaned, the list to be cleaned There are a total of 12 target training data. The operating node can obtain target training data from the list to be cleaned, and perform data cleaning on the target training data, and so on.

Step 104: Perform data cleaning according to the target training data.

Exemplarily, the target training data and the cleaning parameters may be sent to the operating node, so that the operating node performs data cleaning on the target training data according to the cleaning parameters, which may also be referred to as data labeling.

Exemplarily, the initial training data/target training data may be picture data, audio data, video data, text data, etc., and there is no restriction on the type of the initial training data/target training data.

Exemplarily, performing data cleaning on the target training data refers to at least one of operations such as classifying, drawing a frame, annotating, and marking (that is, a label indicating a certain attribute) on the target training data. The method of cleaning this data is not For restrictions, all data cleaning methods related to neural networks are applicable.

Exemplarily, the cleaning parameter indicates how to clean the target training data, for example, how to realize the classification parameters, how to realize the border drawing parameters, how to realize the annotation parameters, how to realize the marked parameters, etc. Therefore, the operation node can be based on The cleaning parameter performs data cleaning on the target training data.

In a possible implementation manner, the number of operation nodes can be dynamically adjusted according to the number of target training data. For example, with respect to the above method 1, initial training data with a score greater than a preset score threshold may be determined as the target training data. Assuming that there are 48 target training data, and each operation node can complete the data cleaning of 5 target training data, 10 operation nodes need to be deployed. Based on this, in step 104, 48 target training data and cleaning parameters can be sent to 10 operating nodes, so that these operating nodes perform data cleaning on the target training data according to the cleaning parameters.

In another possible implementation manner, the amount of target training data can be dynamically adjusted according to the cleaning efficiency of the operating node. For example, for the second method above, the number M to be cleaned in the next statistical period is determined according to the cleaning efficiency of the operating node, and M initial training data are selected as the target training data in the next statistical period. For example, when the cleaning efficiency of the operating node is 10, the number M to be cleaned in the next statistical period is determined to be 10. Based on this, in step 104, 10 target training data and cleaning parameters are sent to the operating node, so that the operating node Perform data cleaning on the target training data according to the cleaning parameters.

The above technical solutions are described below in combination with specific application scenarios. As shown in FIG. 2, which is a schematic diagram of the application scenario of the embodiment of this application, the control center module 21, the data import module 22, the active learning module 23, and the cleaning control module 24 can be deployed on the same device or on different devices.

In the above application scenario, referring to Figure 3, the data cleaning method may include:

In step 301, the control center module 21 creates a cleaning task. The cleaning task may include a data cleaning time interval (indicating that data cleaning is performed in this time interval), cleaning parameters, and the like.

In step 302, the control center module 21 sends a work instruction to the data import module 22.

In step 303, the data import module 22 obtains a data set, which includes a plurality of initial training data. Exemplarily, after the data import module 22 receives the work instruction, it starts to work. In the working process, initial training data can be obtained from historical data, and/or initial training data can be obtained from real-time data, and there is no restriction on this. Regarding the obtained large amount of initial training data, the data importing module 22 imports the same type of initial training data into the same data set, thereby obtaining at least one data set.

In step 304, the data import module 22 returns a data import success message to the control center module 21. The data import success message indicates that the data import module 22 has completed the data import work, that is, the data set has been obtained, and the data import success message may also carry the amount of initial training data in the data set.

In step 305, the control center module 21 sends a work instruction to the active learning module 23.

In step 306, the active learning module 23 determines the score value of each initial training data according to the characteristic information of each initial training data in the data set. Exemplarily, the active learning module 23 starts to work after receiving the work instruction. In the working process, the data set is obtained from the data import module 22, and the score value of each initial training data is determined according to the characteristic information of each initial training data in the data set.

In a possible implementation, after the active learning module 23 receives the work instruction, it can obtain all the initial training data in the data set from the data import module 22, that is, obtain all the initial training data at one time, and according to the data in the data set For the characteristic information of each initial training data in, determine the score value of each initial training data. For a specific method, please refer to method 1 or method 2 of step 102, which will not be repeated here.

In another possible implementation manner, after receiving the work instruction, the active learning module 23 may obtain part of the initial training data in the data set from the data import module 22, and determine part of the initial training data according to the characteristic information of the part of the initial training data. The score value of the data. After the score value determination is completed, part of the initial training data in the data set is obtained from the data import module 22, and so on, until all the initial training data in the data set is obtained from the data import module 22, and the score value determination is completed.

For example, the active learning module 23 obtains 10 initial training data from the data importing module 22, and for each initial training data, query the pre-configured mapping relationship through the feature information of the initial training data to obtain the score value of the initial training data. Then, 10 pieces of initial training data are obtained from the data import module 22, and so on, until the score values of all initial training data are determined.

For another example, the active learning module 23 obtains the initial training data 1-10 from the data import module 22, sorts the initial training data 1-10 according to the important priority of the feature information of the initial training data 1-10, and determines the initial training data according to the sorting result. The score value of the training data 1-10. Then, obtain the initial training data 11-20 from the data import module 22, sort the initial training data 1-20 according to the important priority of the feature information of the initial training data 1-20, and determine the initial training data 1- 20 points value.

Since the score values of the initial training data 1-10 are re-determined, the score values of the initial training data 1-10 need to be corrected, that is, the score values of the revised initial training data 1-10 are used.

Then, obtain the initial training data 21-30 from the data import module 22, sort the initial training data 1-30 according to the important priority of the feature information of the initial training data 1-30, and determine the initial training data 1- 30 points value. Since the score value of the initial training data 1-20 has been re-determined, the score value of the initial training data 1-20 needs to be corrected, that is, the score value of the revised initial training data 1-20 is used, and so on, Until the completion of all initial training data scores are determined.

Exemplarily, after the initial training data 11-20 is obtained from the data import module 22, the score value of the initial training data 1-10 needs to be corrected. The reason is that the important priority is based on the feature information of the initial training data 1-10. Level, when sorting the initial training data 1-10, assuming that the initial training data 5 is in the first place, the score value of the initial training data 5 is 100. However, when sorting the initial training data 1-20 according to the important priority of the feature information of the initial training data 1-20, the initial training data 5 may not be in the first place. If it is in the sixth place, the initial training data 5 is The score value is 95, that is, the score value of the initial training data 5 has changed, so the score value of the initial training data 5 needs to be corrected.

After the active learning module 23 determines the score value of each initial training data according to the characteristic information of each initial training data in the data set, it can also determine the similarity between the two initial training data; if the similarity is greater than the preset similarity Degree threshold, keep the score value of one initial training data unchanged, and reduce the score value of another initial training data. For example, if there are repeated initial training data, keep the score value of the first initial training data unchanged, and set the score value of other initial training data to 0.

Of course, the active learning module 23 may also perform a similarity comparison process before determining the score value of the initial training data. For example, first determine the similarity between the initial training data. If the similarity is greater than the preset similarity threshold, keep an initial training data in the data set, and set the score value of the remaining initial training data to 0, and The remaining initial training data is not kept in the data set. Based on this, the active learning module 23 can determine the score value of each initial training data according to the feature information of each initial training data (the initial training data whose score value is set to 0 is not included) in the data set.

Exemplarily, the active learning module 23 supports querying initial training data according to conditions, for example, the number of initial training data whose score value is greater than a certain value, the distribution of different score value intervals, and so on.

In step 307, the active learning module 23 sends a scoring completion message to the control center module 21. The scoring completion message indicates that the active learning module 23 has scored all the initial training data.

In step 308, the control center module 21 sends a work instruction to the cleaning control module 24.

In step 309, the cleaning control module 24 determines the quantity M to be cleaned, and sends the quantity M to be cleaned to the active learning module 23. Exemplarily, the cleaning control module 24 starts to work after receiving the work instruction. In the working process, the quantity M to be cleaned is determined first, and the quantity M to be cleaned is sent to the active learning module 23.

Exemplarily, the number M1 to be cleaned in the first statistical period can be configured based on experience. The number M2 to be cleaned in the second statistical period is determined based on the cleaning efficiency of all operating nodes in the first statistical period. The number M2 to be cleaned in the third statistical period is determined based on the cleaning efficiency of all operating nodes in the second statistical period, and so on. In summary, the cleaning control module 24 can determine the quantity M to be cleaned in each statistical period, and send the quantity M to be cleaned to the active learning module 23.

Exemplarily, when the cleaning efficiency of operating nodes increases or decreases in the statistical period, and/or the number of operating nodes increases or decreases, it will cause the cleaning efficiency of all operating nodes to change, that is, the number of cleaning nodes M will be changed. The change occurs, so that the quantity M to be cleaned can be dynamically adjusted.

Exemplarily, the cleaning control module 24 can count the cleaning efficiency of each operating node, that is, the number of target training data completed by the operating node in the current statistical period. Then, the cleaning efficiency of all operating nodes is determined, and the number M to be cleaned is determined based on the cleaning efficiency of all operating nodes.

Step 310, the active learning module 23 sorts all the initial training data according to the score value of each initial training data. Based on the ranking result, starting from the initial training data with the higher score value, select the first M initial training data as the target training data , Send the target training data to the cleaning control module 24.

In step 311, the cleaning control module 24 adds the target training data to the list to be cleaned.

For example, in the first statistical cycle, the active learning module 23 uses M1 initial training data as target training data, sends M1 target training data to the cleaning control module 24, and the cleaning control module 24 adds M1 target training data to the to-be-cleaned List. In the second statistical cycle, the active learning module 23 uses M2 initial training data as target training data, sends M2 target training data to the cleaning control module 24, and the cleaning control module 24 adds M2 target training data to the list to be cleaned. And so on.

In step 312, the cleaning control module 24 sends the target training data to the operating node, so that the operating node performs data cleaning on the target training data. For example, the target training data and cleaning parameters are sent to the operating node, so that the operating node performs data cleaning on the target training data according to the cleaning parameters.

For example, when the operating node can process new target training data, it can send a request message to the cleaning control module 24. The request message is used to request N target training data, indicating that the operating node can perform data cleaning on the N target training data. Can be a positive integer. After receiving the request message, the cleaning control module 24 determines whether there are N target training data in the list to be cleaned. If so, directly send N target training data to the operating node. If not, then obtain (Na) target training data from the active learning module 23, a is used to represent the target training data that already exists in the list to be cleaned, so that N target training data can be obtained, and the N target training data Sent to the operation node.

Exemplarily, the operation node may also be referred to as a cleaning node. The operation node may be a machine or a manual operation. There is no restriction on this, as long as the target training data can be cleaned.

In step 313, the cleaning control module 24 feeds back the task execution status to the control center module 21.

It can be seen from the above technical solutions that in the embodiments of the present application, data cleaning can be performed on the target training data instead of data cleaning on all initial training data, thereby improving the efficiency of data cleaning and reducing the invalid input of redundant data. In addition, data cleaning can be performed on target training data with good training effects (that is, high score values), and the most effective data can be provided for training, so that training data with better effects can participate in machine learning, and the effect of machine learning is better. Improve the utilization of cleaning resources.

For example, there are currently 2 operation nodes that can be used for data cleaning. Each operation node can perform data cleaning on 100 target training data every day. Assuming there are 1000 initial training data, you can select the score value from the 1000 initial training data. 200 initial training data larger than n, these 200 initial training data are used as target training data. Then, 100 target training data can be provided to one operating node, and the remaining 100 target training data can be provided to another operating node. In this way, two operating nodes can perform data cleaning on the above 200 target training data.

In the data cleaning process, if an operation node is added, 100 initial training data with high scores can be selected from the remaining 800 initial training data, and these 100 initial training data are used as the target training data. These target training data are provided to newly added operation nodes.

For another example, there are currently 2 operating nodes that can be used for data cleaning. During the data cleaning process, it is found that the initial training data with high scores starts to accumulate, that is, the number of operating nodes is not enough, and the initial training can be based on the accumulated high scores. Data quantity, dynamic adjustment of the number of operation nodes invested.

For another example, based on the historically accumulated cleaning efficiency, it is determined that one statistical period can complete 1000 target training data, then the cleaning control module 24 obtains 1000 target training data from the active learning module 23 and invests in cleaning. In the data cleaning process, it is found that the actual cleaning efficiency is high, and 1100 target training data can be completed in one statistical cycle, and then the cleaning control module 24 obtains 100 target training data from the active learning module 23 and invests in cleaning. In the next statistical cycle, 1100 target training data are acquired from the active learning module 23 and used for cleaning. By analogy, the amount of target training data can be dynamically adjusted.

Based on the same application concept as the above method, an embodiment of the present application also proposes a data cleaning device. As shown in FIG. 4, it is a structural diagram of the data cleaning device, and the device includes:

The obtaining module 41 is used to obtain a data set, and the data set includes a plurality of initial training data; the determining module 42 is used to determine the value of each initial training data according to the characteristic information of each initial training data in the data set. The score value is used to indicate the training effect of the initial training data; the selection module 43 is used to select target training data from the data set according to the score value of each initial training data; the cleaning module 44 is used to Data cleaning is performed on the target training data.

The determining module 42 is specifically configured to: for each initial training data in the data set, query a pre-configured mapping relationship through the feature information of the initial training data to obtain the score value of the initial training data; wherein, The mapping relationship includes the corresponding relationship between the feature information and the score value; or,

According to the important priority of the feature information of each initial training data in the data set, all initial training data are sorted, and the score value of each initial training data is determined according to the sorting result.

The characteristic information includes application scenarios and/or data quality, and the determining module 42 is specifically configured to:

If the feature information includes application scenarios and data quality, determine the scenario score of each initial training data according to the application scenario of each initial training data in the data set; according to each initial training data in the data set The data quality of the data, determine the quality score of each initial training data;

For each initial training data, the score value of the initial training data is determined according to the scene score and the scene weight value of the initial training data, as well as the quality score and the quality weight value.

Exemplarily, the determining module 42 is further configured to: after determining the score value of each initial training data according to the characteristic information of each initial training data in the data set, determine the difference between the two initial training data. Similarity; if the similarity is greater than the preset similarity threshold, the score value of one initial training data is kept unchanged, and the score value of another initial training data is reduced.

The selection module 43 is specifically configured to: for each initial training data, if the score value of the initial training data is greater than a preset score threshold, determine the initial training data as target training data; or, according to the data The score value of each initial training data in the set is sorted, and multiple initial training data are selected as the target training data according to the sorting result.

When the selection module 43 selects a plurality of initial training data as target training data according to the sorting result, it is specifically used to: determine the quantity M to be cleaned in the next statistical period;

In the next statistical period, based on the sorting result, starting from the initial training data with a high score value, M initial training data are sequentially selected as the target training data.

When the selection module 43 determines the quantity M to be cleaned in the next statistical period, it is specifically used for:

The quantity M to be cleaned in the next statistical period is determined according to the cleaning efficiency of the operating node, and the cleaning efficiency represents the completed cleaning quantity of the operating node in the current statistical period.

The cleaning module 44 is specifically configured to send the target training data and cleaning parameters to an operating node, so that the operating node performs data cleaning on the target training data according to the cleaning parameters.

Based on the same application concept as the above method, an embodiment of this application also proposes a data cleaning device. For the data cleaning device provided in the embodiment of this application, from the hardware level, the schematic diagram of the hardware architecture of the data cleaning device can be seen in Figure 5. Show. The data cleaning device may include: a processor 51 and a machine-readable storage medium 52, where the machine-readable storage medium 52 stores machine-executable instructions that can be executed by the processor 51; the processor 51 is used to execute the machine Executable instructions are used to implement the methods disclosed in the above examples of this application. For example, the processor 51 is used to execute machine executable instructions to implement the following steps:

Perform data cleaning according to the target training data.

Based on the same application concept as the above method, an embodiment of the application also provides a machine-readable storage medium, wherein a number of computer instructions are stored on the machine-readable storage medium, and when the computer instructions are executed by a processor, The method disclosed in the above examples of this application is implemented.

For example, when the computer instructions are executed by a processor, the following steps can be implemented:

Perform data cleaning according to the target training data.

Exemplarily, the foregoing machine-readable storage medium may be any electronic, magnetic, optical, or other physical storage device, and may contain or store information, such as executable instructions, data, and so on. For example, the machine-readable storage medium can be: RAM (Radom Access Memory), volatile memory, non-volatile memory, flash memory, storage drives (such as hard drives), solid state drives, and any type of storage disk (Such as CD, DVD, etc.), or similar storage media, or a combination of them.

The systems, devices, modules, or units explained in the foregoing embodiments may be specifically implemented by computer chips or entities, or implemented by products with certain functions. A typical implementation device is a computer. The specific form of the computer can be a personal computer, a laptop computer, a cellular phone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email receiving and sending device, and a game control A console, a tablet computer, a wearable device, or a combination of any of these devices.

For the convenience of description, when describing the above device, the functions are divided into various units and described separately. Of course, when implementing this application, the functions of each unit can be implemented in the same one or more software and/or hardware.

Those skilled in the art should understand that the embodiments of the present application can be provided as a method, a system, or a computer program product. Therefore, this application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, the embodiments of the present application may adopt the form of computer program products implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.

This application is described with reference to flowcharts and/or block diagrams of methods, devices (systems), and computer program products according to embodiments of this application. It should be understood that each process and/or block in the flowchart and/or block diagram, and the combination of processes and/or blocks in the flowchart and/or block diagram can be implemented by computer program instructions. These computer program instructions can be provided to the processor of a general-purpose computer, a special-purpose computer, an embedded processor, or other programmable data processing equipment to generate a machine, so that the instructions executed by the processor of the computer or other programmable data processing equipment are generated for use. It is a device that realizes the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.

Moreover, these computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device, The instruction device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.

These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment. The instructions provide steps for implementing functions specified in a flow or multiple flows in the flowchart and/or a block or multiple blocks in the block diagram.

The above descriptions are only examples of this application, and are not intended to limit this application. For those skilled in the art, this application can have various modifications and changes. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of this application shall be included in the scope of the claims of this application.

Claims

A data cleaning method, characterized in that the method includes:

Acquiring a data set, the data set including a plurality of initial training data;

Determine the score value of each initial training data according to the feature information of each initial training data in the data set, where the score value is used to represent the training effect of the initial training data;

Selecting target training data from the data set according to the score value of each initial training data;

Perform data cleaning according to the target training data.
The method according to claim 1, wherein the determining the score value of each initial training data according to the characteristic information of each initial training data in the data set comprises:

For each initial training data in the data set, the pre-configured mapping relationship is queried through the feature information of the initial training data to obtain the score value of the initial training data; wherein, the mapping relationship includes feature information and score Correspondence of values; or,

According to the important priority of the feature information of each initial training data in the data set, all initial training data are sorted, and the score value of each initial training data is determined according to the sorting result.
The method according to claim 1 or 2, characterized in that:

The feature information includes application scenarios and/or data quality, and determining the score value of each initial training data according to the feature information of each initial training data in the data set includes:

If the feature information includes application scenarios and data quality, determine the scenario score of each initial training data according to the application scenario of each initial training data in the data set; according to each initial training data in the data set The data quality of the data, determine the quality score of each initial training data;

For each initial training data, the score value of the initial training data is determined according to the scene score and the scene weight value of the initial training data, as well as the quality score and the quality weight value.
The method according to claim 1 or 2, characterized in that:

After determining the score value of each initial training data according to the characteristic information of each initial training data in the data set, the method further includes:

Determine the similarity between two initial training data; if the similarity is greater than the preset similarity threshold, keep the score value of one initial training data unchanged, and reduce the score value of the other initial training data.
The method according to claim 1, wherein the selecting target training data from the data set according to the score value of each initial training data comprises:

For each initial training data, if the score value of the initial training data is greater than the preset score threshold, the initial training data is determined as the target training data; or,

According to the score value of each initial training data in the data set, all initial training data are sorted, and multiple initial training data are selected as target training data according to the sorting result.
The method of claim 5, wherein:

The selecting multiple initial training data as target training data according to the sorting result includes:

Determine the quantity M to be cleaned in the next statistical period;

In the next statistical period, based on the sorting result, starting from the initial training data with a high score value, M initial training data are sequentially selected as the target training data, where M is a natural number.
The method of claim 6, wherein:

The determining the quantity M to be cleaned in the next statistical period includes:

The quantity M to be cleaned in the next statistical period is determined according to the cleaning efficiency of the operating node, and the cleaning efficiency represents the completed cleaning quantity of the operating node in the current statistical period.
The method of claim 1, wherein:

The performing data cleaning according to the target training data includes:

The target training data and the cleaning parameter are sent to the operating node, so that the operating node performs data cleaning on the target training data according to the cleaning parameter.
A data cleaning device, characterized in that the device includes:

An acquisition module for acquiring a data set, the data set including a plurality of initial training data;

The determining module is configured to determine the score value of each initial training data according to the feature information of each initial training data in the data set, and the score value is used to represent the training effect of the initial training data;

The selection module is used to select target training data from the data set according to the score value of each initial training data;

The cleaning module is used for data cleaning according to the target training data.
A data cleaning device, characterized by comprising: a processor and a machine-readable storage medium, the machine-readable storage medium storing machine executable instructions that can be executed by the processor;

The processor is used to execute machine executable instructions to implement the following steps:

Acquiring a data set, the data set including a plurality of initial training data;

Determine the score value of each initial training data according to the feature information of each initial training data in the data set, where the score value is used to represent the training effect of the initial training data;

Selecting target training data from the data set according to the score value of each initial training data;

Perform data cleaning according to the target training data.
A computer program, which is stored in a machine-readable storage medium, and when the processor executes the computer program, causes the processor to implement the method according to any one of claims 1-8.
A machine-readable storage medium, the machine-readable storage medium stores machine-executable instructions, when called and executed by a processor, the machine-executable instructions prompt the processor to execute according to claims 1-8 Any one of the methods.