WO2021244583A1 - Data cleaning method, apparatus and device, program, and storage medium - Google Patents
Data cleaning method, apparatus and device, program, and storage medium Download PDFInfo
- Publication number
- WO2021244583A1 WO2021244583A1 PCT/CN2021/097992 CN2021097992W WO2021244583A1 WO 2021244583 A1 WO2021244583 A1 WO 2021244583A1 CN 2021097992 W CN2021097992 W CN 2021097992W WO 2021244583 A1 WO2021244583 A1 WO 2021244583A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- training data
- data
- initial training
- score value
- initial
- Prior art date
Links
- 238000004140 cleaning Methods 0.000 title claims abstract description 154
- 238000000034 method Methods 0.000 title claims abstract description 80
- 238000003860 storage Methods 0.000 title claims abstract description 25
- 238000012549 training Methods 0.000 claims abstract description 685
- 230000000694 effects Effects 0.000 claims description 66
- 238000013507 mapping Methods 0.000 claims description 18
- 238000004590 computer program Methods 0.000 claims description 11
- 238000013528 artificial neural network Methods 0.000 description 29
- 230000008569 process Effects 0.000 description 23
- 230000008676 import Effects 0.000 description 21
- 238000010801 machine learning Methods 0.000 description 17
- 238000010586 diagram Methods 0.000 description 13
- 238000012545 processing Methods 0.000 description 13
- 230000006870 function Effects 0.000 description 11
- 238000001514 detection method Methods 0.000 description 9
- 238000011176 pooling Methods 0.000 description 8
- 238000004422 calculation algorithm Methods 0.000 description 6
- 230000005284 excitation Effects 0.000 description 5
- 230000008859 change Effects 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 3
- 230000007423 decrease Effects 0.000 description 3
- 230000004913 activation Effects 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 2
- 230000006399 behavior Effects 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 108091028043 Nucleic acid sequence Proteins 0.000 description 1
- 239000006227 byproduct Substances 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000003745 diagnosis Methods 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 230000003203 everyday effect Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 238000012886 linear function Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Definitions
- This application relates to the field of image processing technology, in particular to a data cleaning method, device and equipment, program and storage medium.
- Machine learning is a way to realize artificial intelligence. It is a multi-disciplinary interdisciplinary, involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and other subjects. Machine learning is used to study how computers simulate or implement human learning behaviors in order to acquire new knowledge or skills, and reorganize the existing knowledge structure to continuously improve its own performance. Machine learning pays more attention to algorithm design, so that computers can automatically learn rules from data and use the rules to predict unknown data.
- Machine learning has been widely used, such as: data mining, computer vision, natural language processing, biometric recognition, search engines, medical diagnosis, detection of credit card fraud, stock market analysis, DNA sequence sequencing, speech and handwriting recognition, strategy Games and robot applications, etc.
- the above method requires data cleaning of all initial training data, and it is impossible to filter the initial training data.
- training data with poor effects is also involved in machine learning, and the learning effect is poor.
- This application provides a data cleaning method, which includes:
- the present application provides a data cleaning device, which includes:
- An acquisition module for acquiring a data set, the data set including a plurality of initial training data
- the determining module is configured to determine the score value of each initial training data according to the feature information of each initial training data in the data set, and the score value is used to represent the training effect of the initial training data;
- the selection module is used to select target training data from the data set according to the score value of each initial training data
- the cleaning module is used for data cleaning according to the target training data.
- the present application provides a data cleaning device, including: a processor and a machine-readable storage medium, where the machine-readable storage medium stores machine-executable instructions that can be executed by the processor;
- the processor is used to execute machine executable instructions to implement the following steps:
- the present application provides a computer program, which is stored in a machine-readable storage medium, and when a processor executes the computer program, it causes the processor to implement the method in the above first aspect.
- the present application provides a machine-readable storage medium that stores machine-executable instructions. When called and executed by a processor, the machine-executable instructions cause the processor to execute the first aspect Methods.
- the score value of the initial training data is determined according to the characteristic information of the initial training data.
- the score value is used to represent the training effect of the initial training data. Select the target training data from the initial training data, and perform data cleaning on the target training data instead of performing data cleaning on all initial training data, thereby improving the efficiency of data cleaning and reducing the invalid input of redundant data. It can clean the target training data with good training effect (that is, the score value is high), and provide the most effective data for training, so that the training data with better effect can participate in machine learning, and the effect of machine learning is better, which can improve cleaning Resource utilization.
- FIG. 1 is a flowchart of a data cleaning method in an embodiment of the present application
- Fig. 2 is a schematic diagram of an application scenario in an embodiment of the present application
- FIG. 3 is a flowchart of a data cleaning method in another embodiment of the present application.
- FIG. 4 is a structural diagram of a data cleaning device in an embodiment of the present application.
- Fig. 5 is a structural diagram of a data cleaning device in an embodiment of the present application.
- first, second, third, etc. may be used in the embodiments of the present application to describe various information, the information should not be limited to these terms. These terms are only used to distinguish the same type of information from each other.
- first information may also be referred to as second information, and similarly, the second information may also be referred to as first information.
- word "if” used can be interpreted as "when” or "when” or "in response to certainty.”
- Machine learning is a way to realize artificial intelligence. It is used to study how computers simulate or realize human learning behaviors in order to acquire new knowledge or skills, and reorganize the existing knowledge structure to continuously improve its own performance.
- Neural network is the specific implementation of machine learning. This article uses neural network as an example to introduce the implementation of machine learning. For other types of machine learning algorithms, it is similar to neural network.
- the neural network may include, but is not limited to: Convolutional Neural Network (abbreviated as CNN), Recurrent Neural Network (abbreviated as RNN), fully connected network, etc.
- the structural units of the neural network may include, but are not limited to: Convolutional Layer (Conv), Pooling Layer (Pool), Excitation Layer, Fully Connected Layer (FC), etc., which are not limited.
- the data features are enhanced by using the convolution kernel to perform the convolution operation.
- the convolution layer uses the convolution kernel to perform the convolution operation in the space range.
- the convolution kernel can be a matrix of m*n size.
- the input of the convolutional layer is convolved with the convolution kernel, and the output of the convolutional layer can be obtained.
- the convolution operation is actually a filtering process.
- the data is convolved with the convolution kernel w(x, y) to obtain multiple convolution features. These convolution features are the output of the convolution layer. And can be provided to the pooling layer.
- the pooling layer it is actually a down-sampling process.
- the maximum value, minimum value, and average value of multiple convolutional features that is, the output of the convolutional layer
- the amount of calculation can be reduced.
- the principle of local correlation can be used to sub-sampling the data, which can reduce the amount of data processing and retain the useful information in the data.
- an activation function (such as a non-linear function) can be used to map the output characteristics of the pooling layer, so as to introduce non-linear factors, so that the neural network can enhance the expression ability through non-linear combination.
- the activation function of the excitation layer can include, but is not limited to, the ReLU (Rectified Linear Units) function.
- the ReLU function can take all the features output by the pooling layer, which are less than 0. The feature is set to 0, and the feature greater than 0 remains unchanged.
- the fully connected layer is used to perform fully connected processing on all the features input to the fully connected layer, thereby obtaining a feature vector, and the feature vector may include multiple features.
- one or more convolutional layers, one or more pooling layers, one or more excitation layers, and one or more fully connected layers can be combined to construct a neural network according to different needs.
- the neural network needs to be trained first.
- a large amount of initial training data can be obtained, and the initial training data can be cleaned to obtain the cleaned training data, and the cleaned training data can be used to train the neural network parameters in the neural network, such as the convolutional layer Parameters (such as convolution kernel parameters), pooling layer parameters, excitation layer parameters, fully connected layer parameters, etc., there are no restrictions on this.
- the convolutional layer Parameters such as convolution kernel parameters
- pooling layer parameters such as convolution kernel parameters
- excitation layer parameters fully connected layer parameters, etc.
- the neural network can be used for business processing, for example, the input data is provided to the neural network, and the neural network processes the input data, such as using various neural network parameters to process the input data to obtain the output Data, and finally use neural network to complete business processing, such as face detection, vehicle detection, etc.
- the score value of each initial training data can be determined, and the score value is used to represent the training effect of the initial training data, that is, the higher the score value, the better the training effect of the training data. Therefore, part of the initial training data with high scores can be used as target training data, the target training data can be cleaned, and the cleaned target training data can be used to train the neural network.
- the target training data is training data with high scores, that is, training data with better training effects
- the training effect of the neural network will be better, that is, the neural network is reliable Improved performance, such as face detection, and increased accuracy of vehicle detection.
- the method may include:
- Step 101 Obtain a data set, where the data set may include multiple initial training data.
- the training data when training data needs to be used to train the neural network, the training data may be obtained first.
- the training data is called initial training data.
- the initial training data can be obtained from a certain device, or the initial training data input by the user can be received, and there is no restriction on this.
- these initial training data may be classified, and each type of initial training data is added to a data set.
- the initial training data for face detection is added to data set 1
- the initial training data for vehicle detection is added to data set 2, and so on, and there is no restriction on this classification method.
- at least one data set can be obtained, and each data set includes multiple initial training data. Since the processing procedure of each data set is the same, the following takes the processing procedure of a data set as an example for description.
- Step 102 Determine the score value of each initial training data according to the feature information of each initial training data in the data set.
- the score value is used to indicate the training effect of the initial training data. For example, the score value of the initial training data is High, it means that the training effect of the initial training data is better, and the score value of the initial training data is lower, it means that the training effect of the initial training data is worse.
- the feature information of the initial training data can characterize the training effect of the initial training data.
- the score value of the initial training data is higher.
- the feature information characterizes the training effect of the initial training data the training effect of the initial training data is higher.
- the score value of the initial training data is lower.
- the score value of the initial training data can be determined according to the characteristic information of the initial training data.
- the score values of these initial training data may be the same, and the score values of these initial training data may also be different.
- the following method may be used to determine the score value of each initial training data:
- the mapping relationship can be pre-configured, and the mapping relationship can include but is not limited to the correspondence relationship between feature information and score value.
- the correspondence relationship between feature information and score value can be configured based on experience, and there is no limitation on this . For example, when the feature information a1 represents a better training effect of the initial training data, the score value corresponding to the feature information a1 is higher. For another example, when the feature information a2 characterizes that the training effect of the initial training data is poor, the score value corresponding to the feature information a2 is lower.
- mapping relationship is used to record the corresponding relationship between the feature information and the score value.
- the score value can adopt a percentage system or other score values, and there is no restriction on this.
- Table 1 shows the mapping relationship in the form of a table.
- other data structures can also be used to represent the mapping relationship, as long as it includes the corresponding relationship between the feature information and the score value, which is not limited.
- feature information of the initial training data can be obtained.
- the initial training data may include feature information. Therefore, the feature information of the initial training data can be obtained directly from the initial training data.
- a certain algorithm such as a deep learning algorithm
- This analysis process is not limited, as long as the characteristic information of the initial training data can be obtained.
- the mapping relationship shown in Table 1 can be looked up through the feature information of the initial training data to obtain the score value of the initial training data. For example, if the feature information of the initial training data is feature information a3, the score value of the initial training data is 90.
- mapping relationship shown in Table 1 can be inquired from the feature information of the initial training data to obtain the score value of the initial training data.
- Manner 2 Sort all the initial training data according to the important priority of the feature information of each initial training data in the data set, and determine the score value of each initial training data according to the sorting result.
- the important priority of the feature information can be pre-configured, and the important priority can be configured based on experience, and there is no restriction on the important priority. For example, when the feature information a1 characterizes the training effect of the initial training data is good, and the feature information a2 characterizes the training effect of the initial training data is poor, the important priority of the feature information a1 may be greater than the important priority of the feature information a2.
- Table 2 which is an example of the important priority of feature information. The higher the value of the important priority, the greater the important priority. Table 2 expresses important priorities in a tabular manner, and other data structures may also be used to express important priorities, as long as the important priorities of the feature information are included, and there is no restriction on this.
- the characteristic information of the initial training data can be acquired.
- Table 2 can be looked up through the feature information of the initial training data to obtain the important priority of the initial training data. Then, according to the important priority of the feature information of each initial training data, all the initial training data are sorted, and the score value of each initial training data is determined according to the sorting result.
- the ranking result is the initial training data 1, the initial training Data 2 and initial training data 3. Therefore, the score value of the initial training data 1 is greater than the score value of the initial training data 2, and the score value of the initial training data 2 is greater than the score value of the initial training data 3.
- the score of the initial training data 1 The value is 100, the score value of the initial training data 2 is 99, and the score value of the initial training data 3 is 98.
- the above score value is just an example.
- the score value of each initial training data can be determined according to the sorting result.
- the feature information may include, but is not limited to, application scenarios and/or data quality.
- the application scenario is used to represent the scenario information of the initial training data, such as daytime, night, sunny day, rainy day, etc.
- Data quality is used to indicate the quality information of the initial training data, such as resolution, etc. The higher the resolution, the better the data quality and the clearer the data.
- the above are only examples of data quality, and there is no restriction on this.
- Case 1 If the feature information includes application scenarios, determine the scenario score of each initial training data according to the application scenario of each initial training data in the data set; the scenario score is used to represent the training effect of the initial training data, for example , The higher the scene score of the initial training data, the better the training effect of the initial training data, and the lower the scene score of the initial training data, the worse the training effect of the initial training data. Then, the score value of the initial training data is determined according to the scene score of the initial training data, for example, the scene score of the initial training data is directly used as the score value of the initial training data.
- the application scenario of the initial training data can characterize the training effect of the initial training data.
- the initial training data has a higher scenario score.
- the training effect of the initial training data is higher.
- the scene score of the initial training data is lower.
- the scene score of the initial training data can be determined according to the application scene of the initial training data. For example, for nights, rainy days, etc., when the initial training data of these application scenarios is used for training, the training effect is better, and the initial training data has a higher scene score. For daytime, sunny days, etc., when the initial training data of these application scenarios is used for training, the training effect is poor, and the scene score of the initial training data is low.
- the feature information includes data quality
- the score value of the initial training data is determined according to the quality score of the initial training data, for example, the quality score of the initial training data is directly used as the score value of the initial training data.
- the data quality of the initial training data can characterize the training effect of the initial training data.
- the quality score of the initial training data is higher.
- the training effect is better.
- the quality score of the initial training data is lower.
- the quality score of the initial training data can be determined according to the data quality of the initial training data.
- the training effect is better, and the quality score of the initial training data is higher.
- the training effect is poor, and the quality of the initial training data is low.
- Case 3 If the feature information includes application scenarios and data quality, determine the scenario score of each initial training data according to the application scenario of each initial training data in the data set, and the scenario score is used to represent the training effect of the initial training data . And, according to the data quality of each initial training data in the data set, the quality score of each initial training data is determined, and the quality score is used to represent the training effect of the initial training data. Then, for each initial training data, the score value of the initial training data is determined according to the scene score and scene weight value of the initial training data, as well as the quality score and quality weight value.
- the scene weight value and the quality weight value can be configured according to experience, and there is no restriction on this and can be configured arbitrarily.
- the sum of the scene weight value and the quality weight value can be 1. If the user pays attention to the application scene, the scene weight value is greater than the quality weight value, for example, the scene weight value is 0.7, the quality weight value is 0.3, or the scene weight value is 0.6 , The quality weight value is 0.4. If the user is concerned about data quality, the quality weight value is greater than the scene weight value. For example, the scene weight value is 0.3 and the quality weight value is 0.7, or the scene weight value is 0.4 and the quality weight value is 0.6. In addition, you can also set both the scene weight value and the quality weight value to 0.5. Of course, the above are just a few examples of scene weight values and quality weight values.
- case 1 and case 3 it is necessary to determine the scene score of each initial training data according to the application scenario of each initial training data in the data set.
- the pre-configured mapping relationship is queried through the application scenario of the initial training data (the mapping relationship includes the corresponding relationship between the application scenario and the scenario score), and the scenario score of the initial training data is obtained.
- the specific implementation method refer to the above-mentioned method 1, replace the feature information with the application scenario, and replace the score value with the scenario score, which will not be repeated here.
- the specific implementation method refer to the above method 2. I won't repeat it here.
- the data quality of the initial training data is used to query the pre-configured mapping relationship (the mapping relationship includes the corresponding relationship between the data quality and the quality score), and the quality score of the initial training data is obtained.
- the data quality of the initial training data is used to query the pre-configured mapping relationship (the mapping relationship includes the corresponding relationship between the data quality and the quality score), and the quality score of the initial training data is obtained.
- the specific implementation method refer to the above-mentioned method 1. Replace the characteristic information with the data quality, and replace the score value with the quality score, which will not be repeated here.
- all initial training data are sorted, and the quality score of each initial training data is determined according to the sorting result.
- the specific implementation please refer to the above method 2. I won't repeat it here.
- Case 1 and Case 3 when the application scenarios of at least two initial training data are the same, the scenario scores of these initial training data may be the same or different. In case 2 and case 3, when the data quality of at least two initial training data is the same, the quality scores of these initial training data may be the same or different.
- the score value of each initial training data can be determined according to the characteristic information of each initial training data in the data set. After the score value of each initial training data is obtained, in a possible implementation manner, the score value can be directly used as the score value of the initial training data. In another possible implementation manner, the score value of the initial training data can also be corrected, and the corrected score value is used as the score value of the initial training data. The following describes the correction process of the score value: determine the similarity between two initial training data; if the similarity is greater than the preset similarity threshold, keep the score value of one initial training data unchanged, and reduce the other initial training The score value of the data.
- the preset similarity threshold can be configured based on experience, and there is no restriction on this.
- the similarity between the two initial training data is greater than the preset similarity threshold, it means that the two initial training data are very close. It is considered that the initial training data is the same or similar, that is, the two are repeated.
- Euclidean distance can be used to determine the similarity between two initial training data
- cosine similarity can be used to determine the similarity between two initial training data
- Peel The Sun correlation coefficient determines the similarity between two initial training data.
- the score value of the initial training data with a high score value may be kept unchanged, or the score value of the initial training data with a low score value may be kept unchanged.
- the initial training data does not participate in the subsequent comparison process, that is, the similarity between the initial training data and other initial training data is no longer compared.
- Step 103 Select target training data from the data set according to the score value of each initial training data.
- the score value is used to represent the training effect of the initial training data
- the higher the score value the better the training effect of the initial training data
- the lower the score value the worse the training effect of the initial training data. Therefore, based on each For the score value of the initial training data, the initial training data with a high score value can be used as the target training data. In this way, the initial training data with a better training effect can be used as the target training data.
- the target training data can be selected from the data set in the following manner:
- the preset score threshold can be configured based on experience, and there is no restriction on this.
- the score value is greater than the preset score threshold, it indicates that the training effect of the initial training data is good, and the initial training data can be used as the target training data.
- the score value is not greater than the preset score threshold, it indicates that the training effect of the initial training data is poor, and the initial training data does not need to be used as the target training data.
- the initial training data 1 may be determined as the target training data.
- the initial training data 2 is not determined as the target training data, and so on.
- Method 2 Sort all the initial training data according to the score value of each initial training data in the data set, and select multiple initial training data as the target training data according to the sorting result.
- all the initial training data are sorted in the order of the score value from high to low. Based on the ranking result, starting from the initial training data with a high score value, multiple initial training data with the highest ranking are selected as the target training data.
- the data cleaning time interval (indicating that data cleaning is performed in this time interval) may be divided into multiple statistical periods, and the duration of each statistical period is the same.
- the ranking result is initial training data 1-initial training data 100.
- initial training data 1-initial training data 10 are selected as the target training data
- initial training data 11- The initial training data 20 is used as the target training data, and so on.
- the number M to be cleaned in the next statistical period may be determined first, and M may be a positive integer, that is, a natural number.
- M initial training data can be selected in turn as the target training data.
- M can be configured based on experience, and there is no restriction on this. For example, when all operating nodes can perform data cleaning on 10 target training data in a statistical period, M can be 10 or slightly greater than 1. Assuming that the target training data to be cleaned is several pictures, if the value of M is 0, it can be considered that the number of pictures in the next statistical period is 0, and all pictures have been cleaned.
- M can also be determined in the following way: determine the next statistical period according to the cleaning efficiency of operating nodes The number to be cleaned M, the cleaning efficiency represents the completed cleaning amount of the operating nodes (that is, all operating nodes) in the current statistical period.
- the data cleaning time interval can be divided into multiple statistical periods, and the duration of each statistical period is the same.
- first select 10 initial training data as the target training data add these 10 target training data to the list to be cleaned, and the operation node obtains the target training data from the list to be cleaned, and performs a calculation on the target training data.
- Data cleaning If the operating node can perform data cleaning on 15 target training data in the first statistical cycle, then in the first statistical cycle, it is also necessary to select 5 initial training data as the target training data, and add these 5 target training data to the target training data.
- the cleaning list the operating node obtains the target training data from the list to be cleaned, and performs data cleaning on the target training data.
- the cleaning efficiency can be 15, and it is determined that the number M to be cleaned in the second statistical period is 15.
- first select 15 initial training data as the target training data add these 15 target training data to the list to be cleaned, and the operation node obtains the target training data from the list to be cleaned, and performs a calculation on the target training data.
- Data cleaning If the operating node can perform data cleaning on 12 target training data in the second statistical period, there is no need to add new target training data to the list to be cleaned.
- the cleaning efficiency can be 12, and it is determined that the number M to be cleaned in the third statistical period is 12.
- the operating node can obtain target training data from the list to be cleaned, and perform data cleaning on the target training data, and so on.
- Step 104 Perform data cleaning according to the target training data.
- the target training data and the cleaning parameters may be sent to the operating node, so that the operating node performs data cleaning on the target training data according to the cleaning parameters, which may also be referred to as data labeling.
- the initial training data/target training data may be picture data, audio data, video data, text data, etc., and there is no restriction on the type of the initial training data/target training data.
- performing data cleaning on the target training data refers to at least one of operations such as classifying, drawing a frame, annotating, and marking (that is, a label indicating a certain attribute) on the target training data.
- the method of cleaning this data is not For restrictions, all data cleaning methods related to neural networks are applicable.
- the cleaning parameter indicates how to clean the target training data, for example, how to realize the classification parameters, how to realize the border drawing parameters, how to realize the annotation parameters, how to realize the marked parameters, etc. Therefore, the operation node can be based on The cleaning parameter performs data cleaning on the target training data.
- the number of operation nodes can be dynamically adjusted according to the number of target training data. For example, with respect to the above method 1, initial training data with a score greater than a preset score threshold may be determined as the target training data. Assuming that there are 48 target training data, and each operation node can complete the data cleaning of 5 target training data, 10 operation nodes need to be deployed. Based on this, in step 104, 48 target training data and cleaning parameters can be sent to 10 operating nodes, so that these operating nodes perform data cleaning on the target training data according to the cleaning parameters.
- the amount of target training data can be dynamically adjusted according to the cleaning efficiency of the operating node. For example, for the second method above, the number M to be cleaned in the next statistical period is determined according to the cleaning efficiency of the operating node, and M initial training data are selected as the target training data in the next statistical period. For example, when the cleaning efficiency of the operating node is 10, the number M to be cleaned in the next statistical period is determined to be 10. Based on this, in step 104, 10 target training data and cleaning parameters are sent to the operating node, so that the operating node Perform data cleaning on the target training data according to the cleaning parameters.
- the score value of the initial training data is determined according to the characteristic information of the initial training data.
- the score value is used to represent the training effect of the initial training data. Select the target training data from the initial training data, and perform data cleaning on the target training data instead of performing data cleaning on all initial training data, thereby improving the efficiency of data cleaning and reducing the invalid input of redundant data. It can clean the target training data with good training effect (that is, the score value is high), and provide the most effective data for training, so that the training data with better effect can participate in machine learning, and the effect of machine learning is better, which can improve cleaning Resource utilization.
- FIG. 2 is a schematic diagram of the application scenario of the embodiment of this application
- the control center module 21, the data import module 22, the active learning module 23, and the cleaning control module 24 can be deployed on the same device or on different devices.
- the data cleaning method may include:
- the control center module 21 creates a cleaning task.
- the cleaning task may include a data cleaning time interval (indicating that data cleaning is performed in this time interval), cleaning parameters, and the like.
- step 302 the control center module 21 sends a work instruction to the data import module 22.
- the data import module 22 obtains a data set, which includes a plurality of initial training data.
- initial training data can be obtained from historical data, and/or initial training data can be obtained from real-time data, and there is no restriction on this.
- the data importing module 22 imports the same type of initial training data into the same data set, thereby obtaining at least one data set.
- step 304 the data import module 22 returns a data import success message to the control center module 21.
- the data import success message indicates that the data import module 22 has completed the data import work, that is, the data set has been obtained, and the data import success message may also carry the amount of initial training data in the data set.
- step 305 the control center module 21 sends a work instruction to the active learning module 23.
- the active learning module 23 determines the score value of each initial training data according to the characteristic information of each initial training data in the data set.
- the active learning module 23 starts to work after receiving the work instruction.
- the data set is obtained from the data import module 22, and the score value of each initial training data is determined according to the characteristic information of each initial training data in the data set.
- the active learning module 23 can obtain all the initial training data in the data set from the data import module 22, that is, obtain all the initial training data at one time, and according to the data in the data set For the characteristic information of each initial training data in, determine the score value of each initial training data.
- the active learning module 23 can obtain all the initial training data in the data set from the data import module 22, that is, obtain all the initial training data at one time, and according to the data in the data set For the characteristic information of each initial training data in, determine the score value of each initial training data.
- a specific method please refer to method 1 or method 2 of step 102, which will not be repeated here.
- the active learning module 23 may obtain part of the initial training data in the data set from the data import module 22, and determine part of the initial training data according to the characteristic information of the part of the initial training data. The score value of the data. After the score value determination is completed, part of the initial training data in the data set is obtained from the data import module 22, and so on, until all the initial training data in the data set is obtained from the data import module 22, and the score value determination is completed.
- the active learning module 23 obtains 10 initial training data from the data importing module 22, and for each initial training data, query the pre-configured mapping relationship through the feature information of the initial training data to obtain the score value of the initial training data. Then, 10 pieces of initial training data are obtained from the data import module 22, and so on, until the score values of all initial training data are determined.
- the active learning module 23 obtains the initial training data 1-10 from the data import module 22, sorts the initial training data 1-10 according to the important priority of the feature information of the initial training data 1-10, and determines the initial training data according to the sorting result. The score value of the training data 1-10. Then, obtain the initial training data 11-20 from the data import module 22, sort the initial training data 1-20 according to the important priority of the feature information of the initial training data 1-20, and determine the initial training data 1- 20 points value.
- the score values of the initial training data 1-10 are re-determined, the score values of the initial training data 1-10 need to be corrected, that is, the score values of the revised initial training data 1-10 are used.
- the initial training data 21-30 from the data import module 22, sort the initial training data 1-30 according to the important priority of the feature information of the initial training data 1-30, and determine the initial training data 1- 30 points value. Since the score value of the initial training data 1-20 has been re-determined, the score value of the initial training data 1-20 needs to be corrected, that is, the score value of the revised initial training data 1-20 is used, and so on, Until the completion of all initial training data scores are determined.
- the score value of the initial training data 1-10 needs to be corrected.
- the reason is that the important priority is based on the feature information of the initial training data 1-10.
- the score value of the initial training data 5 is 100.
- the initial training data 5 may not be in the first place. If it is in the sixth place, the initial training data 5 is The score value is 95, that is, the score value of the initial training data 5 has changed, so the score value of the initial training data 5 needs to be corrected.
- the active learning module 23 determines the score value of each initial training data according to the characteristic information of each initial training data in the data set, it can also determine the similarity between the two initial training data; if the similarity is greater than the preset similarity Degree threshold, keep the score value of one initial training data unchanged, and reduce the score value of another initial training data. For example, if there are repeated initial training data, keep the score value of the first initial training data unchanged, and set the score value of other initial training data to 0.
- the active learning module 23 may also perform a similarity comparison process before determining the score value of the initial training data. For example, first determine the similarity between the initial training data. If the similarity is greater than the preset similarity threshold, keep an initial training data in the data set, and set the score value of the remaining initial training data to 0, and The remaining initial training data is not kept in the data set. Based on this, the active learning module 23 can determine the score value of each initial training data according to the feature information of each initial training data (the initial training data whose score value is set to 0 is not included) in the data set.
- the active learning module 23 supports querying initial training data according to conditions, for example, the number of initial training data whose score value is greater than a certain value, the distribution of different score value intervals, and so on.
- step 307 the active learning module 23 sends a scoring completion message to the control center module 21.
- the scoring completion message indicates that the active learning module 23 has scored all the initial training data.
- step 308 the control center module 21 sends a work instruction to the cleaning control module 24.
- the cleaning control module 24 determines the quantity M to be cleaned, and sends the quantity M to be cleaned to the active learning module 23.
- the cleaning control module 24 starts to work after receiving the work instruction. In the working process, the quantity M to be cleaned is determined first, and the quantity M to be cleaned is sent to the active learning module 23.
- the number M1 to be cleaned in the first statistical period can be configured based on experience.
- the number M2 to be cleaned in the second statistical period is determined based on the cleaning efficiency of all operating nodes in the first statistical period.
- the number M2 to be cleaned in the third statistical period is determined based on the cleaning efficiency of all operating nodes in the second statistical period, and so on.
- the cleaning control module 24 can determine the quantity M to be cleaned in each statistical period, and send the quantity M to be cleaned to the active learning module 23.
- the cleaning efficiency of operating nodes increases or decreases in the statistical period, and/or the number of operating nodes increases or decreases, it will cause the cleaning efficiency of all operating nodes to change, that is, the number of cleaning nodes M will be changed.
- the change occurs, so that the quantity M to be cleaned can be dynamically adjusted.
- the cleaning control module 24 can count the cleaning efficiency of each operating node, that is, the number of target training data completed by the operating node in the current statistical period. Then, the cleaning efficiency of all operating nodes is determined, and the number M to be cleaned is determined based on the cleaning efficiency of all operating nodes.
- Step 310 the active learning module 23 sorts all the initial training data according to the score value of each initial training data. Based on the ranking result, starting from the initial training data with the higher score value, select the first M initial training data as the target training data , Send the target training data to the cleaning control module 24.
- step 311 the cleaning control module 24 adds the target training data to the list to be cleaned.
- the active learning module 23 uses M1 initial training data as target training data, sends M1 target training data to the cleaning control module 24, and the cleaning control module 24 adds M1 target training data to the to-be-cleaned List.
- the active learning module 23 uses M2 initial training data as target training data, sends M2 target training data to the cleaning control module 24, and the cleaning control module 24 adds M2 target training data to the list to be cleaned. And so on.
- the cleaning control module 24 sends the target training data to the operating node, so that the operating node performs data cleaning on the target training data.
- the target training data and cleaning parameters are sent to the operating node, so that the operating node performs data cleaning on the target training data according to the cleaning parameters.
- the operating node when it can process new target training data, it can send a request message to the cleaning control module 24.
- the request message is used to request N target training data, indicating that the operating node can perform data cleaning on the N target training data. Can be a positive integer.
- the cleaning control module 24 determines whether there are N target training data in the list to be cleaned. If so, directly send N target training data to the operating node. If not, then obtain (Na) target training data from the active learning module 23, a is used to represent the target training data that already exists in the list to be cleaned, so that N target training data can be obtained, and the N target training data Sent to the operation node.
- the operation node may also be referred to as a cleaning node.
- the operation node may be a machine or a manual operation. There is no restriction on this, as long as the target training data can be cleaned.
- step 313 the cleaning control module 24 feeds back the task execution status to the control center module 21.
- data cleaning can be performed on the target training data instead of data cleaning on all initial training data, thereby improving the efficiency of data cleaning and reducing the invalid input of redundant data.
- data cleaning can be performed on target training data with good training effects (that is, high score values), and the most effective data can be provided for training, so that training data with better effects can participate in machine learning, and the effect of machine learning is better. Improve the utilization of cleaning resources.
- Each operation node can perform data cleaning on 100 target training data every day. Assuming there are 1000 initial training data, you can select the score value from the 1000 initial training data. 200 initial training data larger than n, these 200 initial training data are used as target training data. Then, 100 target training data can be provided to one operating node, and the remaining 100 target training data can be provided to another operating node. In this way, two operating nodes can perform data cleaning on the above 200 target training data.
- 100 initial training data with high scores can be selected from the remaining 800 initial training data, and these 100 initial training data are used as the target training data. These target training data are provided to newly added operation nodes.
- the cleaning control module 24 obtains 1000 target training data from the active learning module 23 and invests in cleaning.
- the cleaning control module 24 obtains 100 target training data from the active learning module 23 and invests in cleaning.
- 1100 target training data are acquired from the active learning module 23 and used for cleaning.
- FIG. 4 it is a structural diagram of the data cleaning device, and the device includes:
- the obtaining module 41 is used to obtain a data set, and the data set includes a plurality of initial training data; the determining module 42 is used to determine the value of each initial training data according to the characteristic information of each initial training data in the data set.
- the score value is used to indicate the training effect of the initial training data; the selection module 43 is used to select target training data from the data set according to the score value of each initial training data; the cleaning module 44 is used to Data cleaning is performed on the target training data.
- the determining module 42 is specifically configured to: for each initial training data in the data set, query a pre-configured mapping relationship through the feature information of the initial training data to obtain the score value of the initial training data; wherein, The mapping relationship includes the corresponding relationship between the feature information and the score value; or,
- all initial training data are sorted, and the score value of each initial training data is determined according to the sorting result.
- the characteristic information includes application scenarios and/or data quality, and the determining module 42 is specifically configured to:
- the feature information includes application scenarios and data quality, determine the scenario score of each initial training data according to the application scenario of each initial training data in the data set; according to each initial training data in the data set The data quality of the data, determine the quality score of each initial training data;
- the score value of the initial training data is determined according to the scene score and the scene weight value of the initial training data, as well as the quality score and the quality weight value.
- the determining module 42 is further configured to: after determining the score value of each initial training data according to the characteristic information of each initial training data in the data set, determine the difference between the two initial training data. Similarity; if the similarity is greater than the preset similarity threshold, the score value of one initial training data is kept unchanged, and the score value of another initial training data is reduced.
- the selection module 43 is specifically configured to: for each initial training data, if the score value of the initial training data is greater than a preset score threshold, determine the initial training data as target training data; or, according to the data The score value of each initial training data in the set is sorted, and multiple initial training data are selected as the target training data according to the sorting result.
- the selection module 43 selects a plurality of initial training data as target training data according to the sorting result, it is specifically used to: determine the quantity M to be cleaned in the next statistical period;
- M initial training data are sequentially selected as the target training data.
- the selection module 43 determines the quantity M to be cleaned in the next statistical period, it is specifically used for:
- the quantity M to be cleaned in the next statistical period is determined according to the cleaning efficiency of the operating node, and the cleaning efficiency represents the completed cleaning quantity of the operating node in the current statistical period.
- the cleaning module 44 is specifically configured to send the target training data and cleaning parameters to an operating node, so that the operating node performs data cleaning on the target training data according to the cleaning parameters.
- an embodiment of this application also proposes a data cleaning device.
- the data cleaning device may include: a processor 51 and a machine-readable storage medium 52, where the machine-readable storage medium 52 stores machine-executable instructions that can be executed by the processor 51; the processor 51 is used to execute the machine Executable instructions are used to implement the methods disclosed in the above examples of this application.
- the processor 51 is used to execute machine executable instructions to implement the following steps:
- an embodiment of the application also provides a machine-readable storage medium, wherein a number of computer instructions are stored on the machine-readable storage medium, and when the computer instructions are executed by a processor, The method disclosed in the above examples of this application is implemented.
- the foregoing machine-readable storage medium may be any electronic, magnetic, optical, or other physical storage device, and may contain or store information, such as executable instructions, data, and so on.
- the machine-readable storage medium can be: RAM (Radom Access Memory), volatile memory, non-volatile memory, flash memory, storage drives (such as hard drives), solid state drives, and any type of storage disk (Such as CD, DVD, etc.), or similar storage media, or a combination of them.
- a typical implementation device is a computer.
- the specific form of the computer can be a personal computer, a laptop computer, a cellular phone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email receiving and sending device, and a game control A console, a tablet computer, a wearable device, or a combination of any of these devices.
- the embodiments of the present application can be provided as a method, a system, or a computer program product. Therefore, this application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, the embodiments of the present application may adopt the form of computer program products implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.
- computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
- these computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device,
- the instruction device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
- These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment.
- the instructions provide steps for implementing functions specified in a flow or multiple flows in the flowchart and/or a block or multiple blocks in the block diagram.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Databases & Information Systems (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Biomedical Technology (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Quality & Reliability (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Medical Informatics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
Description
特征信息Feature information | 分数值Point value |
特征信息a1Characteristic information a1 | 100100 |
特征信息a2Characteristic information a2 | 9595 |
特征信息a3Characteristic information a3 | 9090 |
特征信息a4Characteristic information a4 | 8585 |
…… | …… |
特征信息Feature information | 重要优先级Important priority |
特征信息a1Characteristic information a1 | 1010 |
特征信息a2Characteristic information a2 | 99 |
特征信息a3Characteristic information a3 | 88 |
特征信息a4Characteristic information a4 | 77 |
…… | …… |
Claims (12)
- 一种数据清洗方法,其特征在于,所述方法包括:A data cleaning method, characterized in that the method includes:获取数据集合,所述数据集合包括多个初始训练数据;Acquiring a data set, the data set including a plurality of initial training data;根据所述数据集合中的每个初始训练数据的特征信息,确定每个初始训练数据的分数值,所述分数值用于表示初始训练数据的训练效果;Determine the score value of each initial training data according to the feature information of each initial training data in the data set, where the score value is used to represent the training effect of the initial training data;根据每个初始训练数据的分数值从所述数据集合中选取目标训练数据;Selecting target training data from the data set according to the score value of each initial training data;根据所述目标训练数据进行数据清洗。Perform data cleaning according to the target training data.
- 根据权利要求1所述的方法,其特征在于,所述根据所述数据集合中的每个初始训练数据的特征信息,确定每个初始训练数据的分数值,包括:The method according to claim 1, wherein the determining the score value of each initial training data according to the characteristic information of each initial training data in the data set comprises:针对所述数据集合中的每个初始训练数据,通过所述初始训练数据的特征信息查询预先配置的映射关系,得到所述初始训练数据的分数值;其中,所述映射关系包括特征信息与分数值的对应关系;或者,For each initial training data in the data set, the pre-configured mapping relationship is queried through the feature information of the initial training data to obtain the score value of the initial training data; wherein, the mapping relationship includes feature information and score Correspondence of values; or,根据所述数据集合中的每个初始训练数据的特征信息的重要优先级,对所有初始训练数据进行排序,根据排序结果确定每个初始训练数据的分数值。According to the important priority of the feature information of each initial training data in the data set, all initial training data are sorted, and the score value of each initial training data is determined according to the sorting result.
- 根据权利要求1或2所述的方法,其特征在于,The method according to claim 1 or 2, characterized in that:所述特征信息包括应用场景和/或数据质量,所述根据所述数据集合中的每个初始训练数据的特征信息,确定每个初始训练数据的分数值,包括:The feature information includes application scenarios and/or data quality, and determining the score value of each initial training data according to the feature information of each initial training data in the data set includes:若所述特征信息包括应用场景和数据质量,则根据所述数据集合中的每个初始训练数据的应用场景,确定每个初始训练数据的场景分;根据所述数据集合中的每个初始训练数据的数据质量,确定每个初始训练数据的质量分;If the feature information includes application scenarios and data quality, determine the scenario score of each initial training data according to the application scenario of each initial training data in the data set; according to each initial training data in the data set The data quality of the data, determine the quality score of each initial training data;针对每个初始训练数据,根据所述初始训练数据的场景分和场景权重值,以及质量分和质量权重值,确定所述初始训练数据的分数值。For each initial training data, the score value of the initial training data is determined according to the scene score and the scene weight value of the initial training data, as well as the quality score and the quality weight value.
- 根据权利要求1或2所述的方法,其特征在于,The method according to claim 1 or 2, characterized in that:所述根据所述数据集合中的每个初始训练数据的特征信息,确定每个初始训练数据的分数值之后,所述方法还包括:After determining the score value of each initial training data according to the characteristic information of each initial training data in the data set, the method further includes:确定两个初始训练数据之间的相似度;若所述相似度大于预设相似度阈值,则保持一个初始训练数据的分数值不变,并降低另一个初始训练数据的分数值。Determine the similarity between two initial training data; if the similarity is greater than the preset similarity threshold, keep the score value of one initial training data unchanged, and reduce the score value of the other initial training data.
- 根据权利要求1所述的方法,其特征在于,所述根据每个初始训练数据的分数值从所述数据集合中选取目标训练数据,包括:The method according to claim 1, wherein the selecting target training data from the data set according to the score value of each initial training data comprises:针对每个初始训练数据,若所述初始训练数据的分数值大于预设分数阈值,则将所述初始训练数据确定为目标训练数据;或者,For each initial training data, if the score value of the initial training data is greater than the preset score threshold, the initial training data is determined as the target training data; or,根据所述数据集合中的每个初始训练数据的分数值,对所有初始训练数据进行排序,并根据排序结果选取多个初始训练数据作为目标训练数据。According to the score value of each initial training data in the data set, all initial training data are sorted, and multiple initial training data are selected as target training data according to the sorting result.
- 根据权利要求5所述的方法,其特征在于,The method of claim 5, wherein:所述根据排序结果选取多个初始训练数据作为目标训练数据,包括:The selecting multiple initial training data as target training data according to the sorting result includes:确定下一个统计周期的待清洗数量M;Determine the quantity M to be cleaned in the next statistical period;在下一个统计周期,基于所述排序结果,从分数值高的初始训练数据开始,依次选取M个初始训练数据作为目标训练数据,其中M为自然数。In the next statistical period, based on the sorting result, starting from the initial training data with a high score value, M initial training data are sequentially selected as the target training data, where M is a natural number.
- 根据权利要求6所述的方法,其特征在于,The method of claim 6, wherein:所述确定下一个统计周期的待清洗数量M,包括:The determining the quantity M to be cleaned in the next statistical period includes:根据操作节点的清洗效率确定下一个统计周期的待清洗数量M,所述清洗效率表示所述操作节点在当前统计周期的已完成清洗量。The quantity M to be cleaned in the next statistical period is determined according to the cleaning efficiency of the operating node, and the cleaning efficiency represents the completed cleaning quantity of the operating node in the current statistical period.
- 根据权利要求1所述的方法,其特征在于,The method of claim 1, wherein:所述根据所述目标训练数据进行数据清洗,包括:The performing data cleaning according to the target training data includes:将所述目标训练数据以及清洗参数发送给操作节点,以使所述操作节点根据所述清洗参数对所述目标训练数据进行数据清洗。The target training data and the cleaning parameter are sent to the operating node, so that the operating node performs data cleaning on the target training data according to the cleaning parameter.
- 一种数据清洗装置,其特征在于,所述装置包括:A data cleaning device, characterized in that the device includes:获取模块,用于获取数据集合,所述数据集合包括多个初始训练数据;An acquisition module for acquiring a data set, the data set including a plurality of initial training data;确定模块,用于根据所述数据集合中的每个初始训练数据的特征信息,确定每个初始训练数据的分数值,所述分数值用于表示初始训练数据的训练效果;The determining module is configured to determine the score value of each initial training data according to the feature information of each initial training data in the data set, and the score value is used to represent the training effect of the initial training data;选取模块,用于根据每个初始训练数据的分数值从所述数据集合中选取目标训练数据;The selection module is used to select target training data from the data set according to the score value of each initial training data;清洗模块,用于根据所述目标训练数据进行数据清洗。The cleaning module is used for data cleaning according to the target training data.
- 一种数据清洗设备,其特征在于,包括:处理器和机器可读存储介质,所述机器可读存储介质存储有能够被所述处理器执行的机器可执行指令;A data cleaning device, characterized by comprising: a processor and a machine-readable storage medium, the machine-readable storage medium storing machine executable instructions that can be executed by the processor;所述处理器用于执行机器可执行指令,以实现如下步骤:The processor is used to execute machine executable instructions to implement the following steps:获取数据集合,所述数据集合包括多个初始训练数据;Acquiring a data set, the data set including a plurality of initial training data;根据所述数据集合中的每个初始训练数据的特征信息,确定每个初始训练数据的分数值,所述分数值用于表示初始训练数据的训练效果;Determine the score value of each initial training data according to the feature information of each initial training data in the data set, where the score value is used to represent the training effect of the initial training data;根据每个初始训练数据的分数值从所述数据集合中选取目标训练数据;Selecting target training data from the data set according to the score value of each initial training data;根据所述目标训练数据进行数据清洗。Perform data cleaning according to the target training data.
- 一种计算机程序,所述计算机程序存储于机器可读存储介质,并且当处理器执 行计算机程序时,促使处理器实现根据权利要求1-8中任一项所述的方法。A computer program, which is stored in a machine-readable storage medium, and when the processor executes the computer program, causes the processor to implement the method according to any one of claims 1-8.
- 一种机器可读存储介质,所述机器可读存储介质存储有机器可执行指令,在被处理器调用和执行时,所述机器可执行指令促使所述处理器执行根据权利要求1-8中任一项所述的方法。A machine-readable storage medium, the machine-readable storage medium stores machine-executable instructions, when called and executed by a processor, the machine-executable instructions prompt the processor to execute according to claims 1-8 Any one of the methods.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010495705.0 | 2020-06-03 | ||
CN202010495705.0A CN113762519B (en) | 2020-06-03 | Data cleaning method, device and equipment |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021244583A1 true WO2021244583A1 (en) | 2021-12-09 |
Family
ID=78783341
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2021/097992 WO2021244583A1 (en) | 2020-06-03 | 2021-06-02 | Data cleaning method, apparatus and device, program, and storage medium |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2021244583A1 (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116501829A (en) * | 2023-06-29 | 2023-07-28 | 北京法伯宏业科技发展有限公司 | Data management method and system based on artificial intelligence large language model platform |
CN117171153A (en) * | 2023-09-11 | 2023-12-05 | 北京三维天地科技股份有限公司 | Visual data cleaning method and system supporting custom cleaning flow |
CN117891812A (en) * | 2024-03-18 | 2024-04-16 | 北京数字一百信息技术有限公司 | Big data cleaning method and system based on artificial intelligence |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109961098A (en) * | 2019-03-22 | 2019-07-02 | 中国科学技术大学 | A kind of training data selection method of machine learning |
US20190311300A1 (en) * | 2018-04-09 | 2019-10-10 | Veda Data Solutions, Inc. | Scheduling Machine Learning Tasks, and Applications Thereof |
CN110866658A (en) * | 2019-12-05 | 2020-03-06 | 国网江苏省电力有限公司南通供电分公司 | Method for predicting medium and long term load of urban power grid |
-
2021
- 2021-06-02 WO PCT/CN2021/097992 patent/WO2021244583A1/en active Application Filing
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190311300A1 (en) * | 2018-04-09 | 2019-10-10 | Veda Data Solutions, Inc. | Scheduling Machine Learning Tasks, and Applications Thereof |
CN109961098A (en) * | 2019-03-22 | 2019-07-02 | 中国科学技术大学 | A kind of training data selection method of machine learning |
CN110866658A (en) * | 2019-12-05 | 2020-03-06 | 国网江苏省电力有限公司南通供电分公司 | Method for predicting medium and long term load of urban power grid |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116501829A (en) * | 2023-06-29 | 2023-07-28 | 北京法伯宏业科技发展有限公司 | Data management method and system based on artificial intelligence large language model platform |
CN116501829B (en) * | 2023-06-29 | 2023-09-19 | 北京法伯宏业科技发展有限公司 | Data management method and system based on artificial intelligence large language model platform |
CN117171153A (en) * | 2023-09-11 | 2023-12-05 | 北京三维天地科技股份有限公司 | Visual data cleaning method and system supporting custom cleaning flow |
CN117891812A (en) * | 2024-03-18 | 2024-04-16 | 北京数字一百信息技术有限公司 | Big data cleaning method and system based on artificial intelligence |
CN117891812B (en) * | 2024-03-18 | 2024-05-24 | 北京数字一百信息技术有限公司 | Big data cleaning method and system based on artificial intelligence |
Also Published As
Publication number | Publication date |
---|---|
CN113762519A (en) | 2021-12-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021244583A1 (en) | Data cleaning method, apparatus and device, program, and storage medium | |
WO2022151649A1 (en) | Deep interest network-based topic recommendation method and apparatus | |
CN108427738B (en) | Rapid image retrieval method based on deep learning | |
JP6874757B2 (en) | Learning equipment, learning methods and programs | |
CN110781819A (en) | Image target detection method, system, electronic equipment and storage medium | |
WO2022042123A1 (en) | Image recognition model generation method and apparatus, computer device and storage medium | |
CN111914085A (en) | Text fine-grained emotion classification method, system, device and storage medium | |
Barnaghi et al. | A comparative study for various methods of classification | |
WO2020233709A1 (en) | Model compression method, and device | |
Li et al. | Deep representation via convolutional neural network for classification of spatiotemporal event streams | |
CN116089883B (en) | Training method for improving classification degree of new and old categories in existing category increment learning | |
JP2022117941A (en) | Image searching method and device, electronic apparatus, and computer readable storage medium | |
CN115115825B (en) | Method, device, computer equipment and storage medium for detecting object in image | |
CN111783997A (en) | Data processing method, device and equipment | |
CN114118207B (en) | Incremental learning image identification method based on network expansion and memory recall mechanism | |
WO2021253938A1 (en) | Neural network training method and apparatus, and video recognition method and apparatus | |
Mithun et al. | Generating diverse image datasets with limited labeling | |
US10909167B1 (en) | Systems and methods for organizing an image gallery | |
JP6991960B2 (en) | Image recognition device, image recognition method and program | |
CN112529078A (en) | Service processing method, device and equipment | |
US8660974B2 (en) | Inference over semantic network with some links omitted from indexes | |
CN114187465A (en) | Method and device for training classification model, electronic equipment and storage medium | |
CN114821248B (en) | Point cloud understanding-oriented data active screening and labeling method and device | |
CN116543250A (en) | Model compression method based on class attention transmission | |
CN113762519B (en) | Data cleaning method, device and equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21818151 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21818151 Country of ref document: EP Kind code of ref document: A1 |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21818151 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 22.06.2023) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21818151 Country of ref document: EP Kind code of ref document: A1 |