WO2021244583A1 - Data cleaning method, apparatus and device, program, and storage medium - Google Patents

Data cleaning method, apparatus and device, program, and storage medium Download PDF

Info

Publication number
WO2021244583A1
WO2021244583A1 PCT/CN2021/097992 CN2021097992W WO2021244583A1 WO 2021244583 A1 WO2021244583 A1 WO 2021244583A1 CN 2021097992 W CN2021097992 W CN 2021097992W WO 2021244583 A1 WO2021244583 A1 WO 2021244583A1
Authority
WO
WIPO (PCT)
Prior art keywords
training data
data
initial training
score value
initial
Prior art date
Application number
PCT/CN2021/097992
Other languages
French (fr)
Chinese (zh)
Inventor
许江浩
任国焘
陈杰
Original Assignee
杭州海康威视数字技术股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN202010495705.0A external-priority patent/CN113762519B/en
Application filed by 杭州海康威视数字技术股份有限公司 filed Critical 杭州海康威视数字技术股份有限公司
Publication of WO2021244583A1 publication Critical patent/WO2021244583A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • This application relates to the field of image processing technology, in particular to a data cleaning method, device and equipment, program and storage medium.
  • Machine learning is a way to realize artificial intelligence. It is a multi-disciplinary interdisciplinary, involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and other subjects. Machine learning is used to study how computers simulate or implement human learning behaviors in order to acquire new knowledge or skills, and reorganize the existing knowledge structure to continuously improve its own performance. Machine learning pays more attention to algorithm design, so that computers can automatically learn rules from data and use the rules to predict unknown data.
  • Machine learning has been widely used, such as: data mining, computer vision, natural language processing, biometric recognition, search engines, medical diagnosis, detection of credit card fraud, stock market analysis, DNA sequence sequencing, speech and handwriting recognition, strategy Games and robot applications, etc.
  • the above method requires data cleaning of all initial training data, and it is impossible to filter the initial training data.
  • training data with poor effects is also involved in machine learning, and the learning effect is poor.
  • This application provides a data cleaning method, which includes:
  • the present application provides a data cleaning device, which includes:
  • An acquisition module for acquiring a data set, the data set including a plurality of initial training data
  • the determining module is configured to determine the score value of each initial training data according to the feature information of each initial training data in the data set, and the score value is used to represent the training effect of the initial training data;
  • the selection module is used to select target training data from the data set according to the score value of each initial training data
  • the cleaning module is used for data cleaning according to the target training data.
  • the present application provides a data cleaning device, including: a processor and a machine-readable storage medium, where the machine-readable storage medium stores machine-executable instructions that can be executed by the processor;
  • the processor is used to execute machine executable instructions to implement the following steps:
  • the present application provides a computer program, which is stored in a machine-readable storage medium, and when a processor executes the computer program, it causes the processor to implement the method in the above first aspect.
  • the present application provides a machine-readable storage medium that stores machine-executable instructions. When called and executed by a processor, the machine-executable instructions cause the processor to execute the first aspect Methods.
  • the score value of the initial training data is determined according to the characteristic information of the initial training data.
  • the score value is used to represent the training effect of the initial training data. Select the target training data from the initial training data, and perform data cleaning on the target training data instead of performing data cleaning on all initial training data, thereby improving the efficiency of data cleaning and reducing the invalid input of redundant data. It can clean the target training data with good training effect (that is, the score value is high), and provide the most effective data for training, so that the training data with better effect can participate in machine learning, and the effect of machine learning is better, which can improve cleaning Resource utilization.
  • FIG. 1 is a flowchart of a data cleaning method in an embodiment of the present application
  • Fig. 2 is a schematic diagram of an application scenario in an embodiment of the present application
  • FIG. 3 is a flowchart of a data cleaning method in another embodiment of the present application.
  • FIG. 4 is a structural diagram of a data cleaning device in an embodiment of the present application.
  • Fig. 5 is a structural diagram of a data cleaning device in an embodiment of the present application.
  • first, second, third, etc. may be used in the embodiments of the present application to describe various information, the information should not be limited to these terms. These terms are only used to distinguish the same type of information from each other.
  • first information may also be referred to as second information, and similarly, the second information may also be referred to as first information.
  • word "if” used can be interpreted as "when” or "when” or "in response to certainty.”
  • Machine learning is a way to realize artificial intelligence. It is used to study how computers simulate or realize human learning behaviors in order to acquire new knowledge or skills, and reorganize the existing knowledge structure to continuously improve its own performance.
  • Neural network is the specific implementation of machine learning. This article uses neural network as an example to introduce the implementation of machine learning. For other types of machine learning algorithms, it is similar to neural network.
  • the neural network may include, but is not limited to: Convolutional Neural Network (abbreviated as CNN), Recurrent Neural Network (abbreviated as RNN), fully connected network, etc.
  • the structural units of the neural network may include, but are not limited to: Convolutional Layer (Conv), Pooling Layer (Pool), Excitation Layer, Fully Connected Layer (FC), etc., which are not limited.
  • the data features are enhanced by using the convolution kernel to perform the convolution operation.
  • the convolution layer uses the convolution kernel to perform the convolution operation in the space range.
  • the convolution kernel can be a matrix of m*n size.
  • the input of the convolutional layer is convolved with the convolution kernel, and the output of the convolutional layer can be obtained.
  • the convolution operation is actually a filtering process.
  • the data is convolved with the convolution kernel w(x, y) to obtain multiple convolution features. These convolution features are the output of the convolution layer. And can be provided to the pooling layer.
  • the pooling layer it is actually a down-sampling process.
  • the maximum value, minimum value, and average value of multiple convolutional features that is, the output of the convolutional layer
  • the amount of calculation can be reduced.
  • the principle of local correlation can be used to sub-sampling the data, which can reduce the amount of data processing and retain the useful information in the data.
  • an activation function (such as a non-linear function) can be used to map the output characteristics of the pooling layer, so as to introduce non-linear factors, so that the neural network can enhance the expression ability through non-linear combination.
  • the activation function of the excitation layer can include, but is not limited to, the ReLU (Rectified Linear Units) function.
  • the ReLU function can take all the features output by the pooling layer, which are less than 0. The feature is set to 0, and the feature greater than 0 remains unchanged.
  • the fully connected layer is used to perform fully connected processing on all the features input to the fully connected layer, thereby obtaining a feature vector, and the feature vector may include multiple features.
  • one or more convolutional layers, one or more pooling layers, one or more excitation layers, and one or more fully connected layers can be combined to construct a neural network according to different needs.
  • the neural network needs to be trained first.
  • a large amount of initial training data can be obtained, and the initial training data can be cleaned to obtain the cleaned training data, and the cleaned training data can be used to train the neural network parameters in the neural network, such as the convolutional layer Parameters (such as convolution kernel parameters), pooling layer parameters, excitation layer parameters, fully connected layer parameters, etc., there are no restrictions on this.
  • the convolutional layer Parameters such as convolution kernel parameters
  • pooling layer parameters such as convolution kernel parameters
  • excitation layer parameters fully connected layer parameters, etc.
  • the neural network can be used for business processing, for example, the input data is provided to the neural network, and the neural network processes the input data, such as using various neural network parameters to process the input data to obtain the output Data, and finally use neural network to complete business processing, such as face detection, vehicle detection, etc.
  • the score value of each initial training data can be determined, and the score value is used to represent the training effect of the initial training data, that is, the higher the score value, the better the training effect of the training data. Therefore, part of the initial training data with high scores can be used as target training data, the target training data can be cleaned, and the cleaned target training data can be used to train the neural network.
  • the target training data is training data with high scores, that is, training data with better training effects
  • the training effect of the neural network will be better, that is, the neural network is reliable Improved performance, such as face detection, and increased accuracy of vehicle detection.
  • the method may include:
  • Step 101 Obtain a data set, where the data set may include multiple initial training data.
  • the training data when training data needs to be used to train the neural network, the training data may be obtained first.
  • the training data is called initial training data.
  • the initial training data can be obtained from a certain device, or the initial training data input by the user can be received, and there is no restriction on this.
  • these initial training data may be classified, and each type of initial training data is added to a data set.
  • the initial training data for face detection is added to data set 1
  • the initial training data for vehicle detection is added to data set 2, and so on, and there is no restriction on this classification method.
  • at least one data set can be obtained, and each data set includes multiple initial training data. Since the processing procedure of each data set is the same, the following takes the processing procedure of a data set as an example for description.
  • Step 102 Determine the score value of each initial training data according to the feature information of each initial training data in the data set.
  • the score value is used to indicate the training effect of the initial training data. For example, the score value of the initial training data is High, it means that the training effect of the initial training data is better, and the score value of the initial training data is lower, it means that the training effect of the initial training data is worse.
  • the feature information of the initial training data can characterize the training effect of the initial training data.
  • the score value of the initial training data is higher.
  • the feature information characterizes the training effect of the initial training data the training effect of the initial training data is higher.
  • the score value of the initial training data is lower.
  • the score value of the initial training data can be determined according to the characteristic information of the initial training data.
  • the score values of these initial training data may be the same, and the score values of these initial training data may also be different.
  • the following method may be used to determine the score value of each initial training data:
  • the mapping relationship can be pre-configured, and the mapping relationship can include but is not limited to the correspondence relationship between feature information and score value.
  • the correspondence relationship between feature information and score value can be configured based on experience, and there is no limitation on this . For example, when the feature information a1 represents a better training effect of the initial training data, the score value corresponding to the feature information a1 is higher. For another example, when the feature information a2 characterizes that the training effect of the initial training data is poor, the score value corresponding to the feature information a2 is lower.
  • mapping relationship is used to record the corresponding relationship between the feature information and the score value.
  • the score value can adopt a percentage system or other score values, and there is no restriction on this.
  • Table 1 shows the mapping relationship in the form of a table.
  • other data structures can also be used to represent the mapping relationship, as long as it includes the corresponding relationship between the feature information and the score value, which is not limited.
  • feature information of the initial training data can be obtained.
  • the initial training data may include feature information. Therefore, the feature information of the initial training data can be obtained directly from the initial training data.
  • a certain algorithm such as a deep learning algorithm
  • This analysis process is not limited, as long as the characteristic information of the initial training data can be obtained.
  • the mapping relationship shown in Table 1 can be looked up through the feature information of the initial training data to obtain the score value of the initial training data. For example, if the feature information of the initial training data is feature information a3, the score value of the initial training data is 90.
  • mapping relationship shown in Table 1 can be inquired from the feature information of the initial training data to obtain the score value of the initial training data.
  • Manner 2 Sort all the initial training data according to the important priority of the feature information of each initial training data in the data set, and determine the score value of each initial training data according to the sorting result.
  • the important priority of the feature information can be pre-configured, and the important priority can be configured based on experience, and there is no restriction on the important priority. For example, when the feature information a1 characterizes the training effect of the initial training data is good, and the feature information a2 characterizes the training effect of the initial training data is poor, the important priority of the feature information a1 may be greater than the important priority of the feature information a2.
  • Table 2 which is an example of the important priority of feature information. The higher the value of the important priority, the greater the important priority. Table 2 expresses important priorities in a tabular manner, and other data structures may also be used to express important priorities, as long as the important priorities of the feature information are included, and there is no restriction on this.
  • the characteristic information of the initial training data can be acquired.
  • Table 2 can be looked up through the feature information of the initial training data to obtain the important priority of the initial training data. Then, according to the important priority of the feature information of each initial training data, all the initial training data are sorted, and the score value of each initial training data is determined according to the sorting result.
  • the ranking result is the initial training data 1, the initial training Data 2 and initial training data 3. Therefore, the score value of the initial training data 1 is greater than the score value of the initial training data 2, and the score value of the initial training data 2 is greater than the score value of the initial training data 3.
  • the score of the initial training data 1 The value is 100, the score value of the initial training data 2 is 99, and the score value of the initial training data 3 is 98.
  • the above score value is just an example.
  • the score value of each initial training data can be determined according to the sorting result.
  • the feature information may include, but is not limited to, application scenarios and/or data quality.
  • the application scenario is used to represent the scenario information of the initial training data, such as daytime, night, sunny day, rainy day, etc.
  • Data quality is used to indicate the quality information of the initial training data, such as resolution, etc. The higher the resolution, the better the data quality and the clearer the data.
  • the above are only examples of data quality, and there is no restriction on this.
  • Case 1 If the feature information includes application scenarios, determine the scenario score of each initial training data according to the application scenario of each initial training data in the data set; the scenario score is used to represent the training effect of the initial training data, for example , The higher the scene score of the initial training data, the better the training effect of the initial training data, and the lower the scene score of the initial training data, the worse the training effect of the initial training data. Then, the score value of the initial training data is determined according to the scene score of the initial training data, for example, the scene score of the initial training data is directly used as the score value of the initial training data.
  • the application scenario of the initial training data can characterize the training effect of the initial training data.
  • the initial training data has a higher scenario score.
  • the training effect of the initial training data is higher.
  • the scene score of the initial training data is lower.
  • the scene score of the initial training data can be determined according to the application scene of the initial training data. For example, for nights, rainy days, etc., when the initial training data of these application scenarios is used for training, the training effect is better, and the initial training data has a higher scene score. For daytime, sunny days, etc., when the initial training data of these application scenarios is used for training, the training effect is poor, and the scene score of the initial training data is low.
  • the feature information includes data quality
  • the score value of the initial training data is determined according to the quality score of the initial training data, for example, the quality score of the initial training data is directly used as the score value of the initial training data.
  • the data quality of the initial training data can characterize the training effect of the initial training data.
  • the quality score of the initial training data is higher.
  • the training effect is better.
  • the quality score of the initial training data is lower.
  • the quality score of the initial training data can be determined according to the data quality of the initial training data.
  • the training effect is better, and the quality score of the initial training data is higher.
  • the training effect is poor, and the quality of the initial training data is low.
  • Case 3 If the feature information includes application scenarios and data quality, determine the scenario score of each initial training data according to the application scenario of each initial training data in the data set, and the scenario score is used to represent the training effect of the initial training data . And, according to the data quality of each initial training data in the data set, the quality score of each initial training data is determined, and the quality score is used to represent the training effect of the initial training data. Then, for each initial training data, the score value of the initial training data is determined according to the scene score and scene weight value of the initial training data, as well as the quality score and quality weight value.
  • the scene weight value and the quality weight value can be configured according to experience, and there is no restriction on this and can be configured arbitrarily.
  • the sum of the scene weight value and the quality weight value can be 1. If the user pays attention to the application scene, the scene weight value is greater than the quality weight value, for example, the scene weight value is 0.7, the quality weight value is 0.3, or the scene weight value is 0.6 , The quality weight value is 0.4. If the user is concerned about data quality, the quality weight value is greater than the scene weight value. For example, the scene weight value is 0.3 and the quality weight value is 0.7, or the scene weight value is 0.4 and the quality weight value is 0.6. In addition, you can also set both the scene weight value and the quality weight value to 0.5. Of course, the above are just a few examples of scene weight values and quality weight values.
  • case 1 and case 3 it is necessary to determine the scene score of each initial training data according to the application scenario of each initial training data in the data set.
  • the pre-configured mapping relationship is queried through the application scenario of the initial training data (the mapping relationship includes the corresponding relationship between the application scenario and the scenario score), and the scenario score of the initial training data is obtained.
  • the specific implementation method refer to the above-mentioned method 1, replace the feature information with the application scenario, and replace the score value with the scenario score, which will not be repeated here.
  • the specific implementation method refer to the above method 2. I won't repeat it here.
  • the data quality of the initial training data is used to query the pre-configured mapping relationship (the mapping relationship includes the corresponding relationship between the data quality and the quality score), and the quality score of the initial training data is obtained.
  • the data quality of the initial training data is used to query the pre-configured mapping relationship (the mapping relationship includes the corresponding relationship between the data quality and the quality score), and the quality score of the initial training data is obtained.
  • the specific implementation method refer to the above-mentioned method 1. Replace the characteristic information with the data quality, and replace the score value with the quality score, which will not be repeated here.
  • all initial training data are sorted, and the quality score of each initial training data is determined according to the sorting result.
  • the specific implementation please refer to the above method 2. I won't repeat it here.
  • Case 1 and Case 3 when the application scenarios of at least two initial training data are the same, the scenario scores of these initial training data may be the same or different. In case 2 and case 3, when the data quality of at least two initial training data is the same, the quality scores of these initial training data may be the same or different.
  • the score value of each initial training data can be determined according to the characteristic information of each initial training data in the data set. After the score value of each initial training data is obtained, in a possible implementation manner, the score value can be directly used as the score value of the initial training data. In another possible implementation manner, the score value of the initial training data can also be corrected, and the corrected score value is used as the score value of the initial training data. The following describes the correction process of the score value: determine the similarity between two initial training data; if the similarity is greater than the preset similarity threshold, keep the score value of one initial training data unchanged, and reduce the other initial training The score value of the data.
  • the preset similarity threshold can be configured based on experience, and there is no restriction on this.
  • the similarity between the two initial training data is greater than the preset similarity threshold, it means that the two initial training data are very close. It is considered that the initial training data is the same or similar, that is, the two are repeated.
  • Euclidean distance can be used to determine the similarity between two initial training data
  • cosine similarity can be used to determine the similarity between two initial training data
  • Peel The Sun correlation coefficient determines the similarity between two initial training data.
  • the score value of the initial training data with a high score value may be kept unchanged, or the score value of the initial training data with a low score value may be kept unchanged.
  • the initial training data does not participate in the subsequent comparison process, that is, the similarity between the initial training data and other initial training data is no longer compared.
  • Step 103 Select target training data from the data set according to the score value of each initial training data.
  • the score value is used to represent the training effect of the initial training data
  • the higher the score value the better the training effect of the initial training data
  • the lower the score value the worse the training effect of the initial training data. Therefore, based on each For the score value of the initial training data, the initial training data with a high score value can be used as the target training data. In this way, the initial training data with a better training effect can be used as the target training data.
  • the target training data can be selected from the data set in the following manner:
  • the preset score threshold can be configured based on experience, and there is no restriction on this.
  • the score value is greater than the preset score threshold, it indicates that the training effect of the initial training data is good, and the initial training data can be used as the target training data.
  • the score value is not greater than the preset score threshold, it indicates that the training effect of the initial training data is poor, and the initial training data does not need to be used as the target training data.
  • the initial training data 1 may be determined as the target training data.
  • the initial training data 2 is not determined as the target training data, and so on.
  • Method 2 Sort all the initial training data according to the score value of each initial training data in the data set, and select multiple initial training data as the target training data according to the sorting result.
  • all the initial training data are sorted in the order of the score value from high to low. Based on the ranking result, starting from the initial training data with a high score value, multiple initial training data with the highest ranking are selected as the target training data.
  • the data cleaning time interval (indicating that data cleaning is performed in this time interval) may be divided into multiple statistical periods, and the duration of each statistical period is the same.
  • the ranking result is initial training data 1-initial training data 100.
  • initial training data 1-initial training data 10 are selected as the target training data
  • initial training data 11- The initial training data 20 is used as the target training data, and so on.
  • the number M to be cleaned in the next statistical period may be determined first, and M may be a positive integer, that is, a natural number.
  • M initial training data can be selected in turn as the target training data.
  • M can be configured based on experience, and there is no restriction on this. For example, when all operating nodes can perform data cleaning on 10 target training data in a statistical period, M can be 10 or slightly greater than 1. Assuming that the target training data to be cleaned is several pictures, if the value of M is 0, it can be considered that the number of pictures in the next statistical period is 0, and all pictures have been cleaned.
  • M can also be determined in the following way: determine the next statistical period according to the cleaning efficiency of operating nodes The number to be cleaned M, the cleaning efficiency represents the completed cleaning amount of the operating nodes (that is, all operating nodes) in the current statistical period.
  • the data cleaning time interval can be divided into multiple statistical periods, and the duration of each statistical period is the same.
  • first select 10 initial training data as the target training data add these 10 target training data to the list to be cleaned, and the operation node obtains the target training data from the list to be cleaned, and performs a calculation on the target training data.
  • Data cleaning If the operating node can perform data cleaning on 15 target training data in the first statistical cycle, then in the first statistical cycle, it is also necessary to select 5 initial training data as the target training data, and add these 5 target training data to the target training data.
  • the cleaning list the operating node obtains the target training data from the list to be cleaned, and performs data cleaning on the target training data.
  • the cleaning efficiency can be 15, and it is determined that the number M to be cleaned in the second statistical period is 15.
  • first select 15 initial training data as the target training data add these 15 target training data to the list to be cleaned, and the operation node obtains the target training data from the list to be cleaned, and performs a calculation on the target training data.
  • Data cleaning If the operating node can perform data cleaning on 12 target training data in the second statistical period, there is no need to add new target training data to the list to be cleaned.
  • the cleaning efficiency can be 12, and it is determined that the number M to be cleaned in the third statistical period is 12.
  • the operating node can obtain target training data from the list to be cleaned, and perform data cleaning on the target training data, and so on.
  • Step 104 Perform data cleaning according to the target training data.
  • the target training data and the cleaning parameters may be sent to the operating node, so that the operating node performs data cleaning on the target training data according to the cleaning parameters, which may also be referred to as data labeling.
  • the initial training data/target training data may be picture data, audio data, video data, text data, etc., and there is no restriction on the type of the initial training data/target training data.
  • performing data cleaning on the target training data refers to at least one of operations such as classifying, drawing a frame, annotating, and marking (that is, a label indicating a certain attribute) on the target training data.
  • the method of cleaning this data is not For restrictions, all data cleaning methods related to neural networks are applicable.
  • the cleaning parameter indicates how to clean the target training data, for example, how to realize the classification parameters, how to realize the border drawing parameters, how to realize the annotation parameters, how to realize the marked parameters, etc. Therefore, the operation node can be based on The cleaning parameter performs data cleaning on the target training data.
  • the number of operation nodes can be dynamically adjusted according to the number of target training data. For example, with respect to the above method 1, initial training data with a score greater than a preset score threshold may be determined as the target training data. Assuming that there are 48 target training data, and each operation node can complete the data cleaning of 5 target training data, 10 operation nodes need to be deployed. Based on this, in step 104, 48 target training data and cleaning parameters can be sent to 10 operating nodes, so that these operating nodes perform data cleaning on the target training data according to the cleaning parameters.
  • the amount of target training data can be dynamically adjusted according to the cleaning efficiency of the operating node. For example, for the second method above, the number M to be cleaned in the next statistical period is determined according to the cleaning efficiency of the operating node, and M initial training data are selected as the target training data in the next statistical period. For example, when the cleaning efficiency of the operating node is 10, the number M to be cleaned in the next statistical period is determined to be 10. Based on this, in step 104, 10 target training data and cleaning parameters are sent to the operating node, so that the operating node Perform data cleaning on the target training data according to the cleaning parameters.
  • the score value of the initial training data is determined according to the characteristic information of the initial training data.
  • the score value is used to represent the training effect of the initial training data. Select the target training data from the initial training data, and perform data cleaning on the target training data instead of performing data cleaning on all initial training data, thereby improving the efficiency of data cleaning and reducing the invalid input of redundant data. It can clean the target training data with good training effect (that is, the score value is high), and provide the most effective data for training, so that the training data with better effect can participate in machine learning, and the effect of machine learning is better, which can improve cleaning Resource utilization.
  • FIG. 2 is a schematic diagram of the application scenario of the embodiment of this application
  • the control center module 21, the data import module 22, the active learning module 23, and the cleaning control module 24 can be deployed on the same device or on different devices.
  • the data cleaning method may include:
  • the control center module 21 creates a cleaning task.
  • the cleaning task may include a data cleaning time interval (indicating that data cleaning is performed in this time interval), cleaning parameters, and the like.
  • step 302 the control center module 21 sends a work instruction to the data import module 22.
  • the data import module 22 obtains a data set, which includes a plurality of initial training data.
  • initial training data can be obtained from historical data, and/or initial training data can be obtained from real-time data, and there is no restriction on this.
  • the data importing module 22 imports the same type of initial training data into the same data set, thereby obtaining at least one data set.
  • step 304 the data import module 22 returns a data import success message to the control center module 21.
  • the data import success message indicates that the data import module 22 has completed the data import work, that is, the data set has been obtained, and the data import success message may also carry the amount of initial training data in the data set.
  • step 305 the control center module 21 sends a work instruction to the active learning module 23.
  • the active learning module 23 determines the score value of each initial training data according to the characteristic information of each initial training data in the data set.
  • the active learning module 23 starts to work after receiving the work instruction.
  • the data set is obtained from the data import module 22, and the score value of each initial training data is determined according to the characteristic information of each initial training data in the data set.
  • the active learning module 23 can obtain all the initial training data in the data set from the data import module 22, that is, obtain all the initial training data at one time, and according to the data in the data set For the characteristic information of each initial training data in, determine the score value of each initial training data.
  • the active learning module 23 can obtain all the initial training data in the data set from the data import module 22, that is, obtain all the initial training data at one time, and according to the data in the data set For the characteristic information of each initial training data in, determine the score value of each initial training data.
  • a specific method please refer to method 1 or method 2 of step 102, which will not be repeated here.
  • the active learning module 23 may obtain part of the initial training data in the data set from the data import module 22, and determine part of the initial training data according to the characteristic information of the part of the initial training data. The score value of the data. After the score value determination is completed, part of the initial training data in the data set is obtained from the data import module 22, and so on, until all the initial training data in the data set is obtained from the data import module 22, and the score value determination is completed.
  • the active learning module 23 obtains 10 initial training data from the data importing module 22, and for each initial training data, query the pre-configured mapping relationship through the feature information of the initial training data to obtain the score value of the initial training data. Then, 10 pieces of initial training data are obtained from the data import module 22, and so on, until the score values of all initial training data are determined.
  • the active learning module 23 obtains the initial training data 1-10 from the data import module 22, sorts the initial training data 1-10 according to the important priority of the feature information of the initial training data 1-10, and determines the initial training data according to the sorting result. The score value of the training data 1-10. Then, obtain the initial training data 11-20 from the data import module 22, sort the initial training data 1-20 according to the important priority of the feature information of the initial training data 1-20, and determine the initial training data 1- 20 points value.
  • the score values of the initial training data 1-10 are re-determined, the score values of the initial training data 1-10 need to be corrected, that is, the score values of the revised initial training data 1-10 are used.
  • the initial training data 21-30 from the data import module 22, sort the initial training data 1-30 according to the important priority of the feature information of the initial training data 1-30, and determine the initial training data 1- 30 points value. Since the score value of the initial training data 1-20 has been re-determined, the score value of the initial training data 1-20 needs to be corrected, that is, the score value of the revised initial training data 1-20 is used, and so on, Until the completion of all initial training data scores are determined.
  • the score value of the initial training data 1-10 needs to be corrected.
  • the reason is that the important priority is based on the feature information of the initial training data 1-10.
  • the score value of the initial training data 5 is 100.
  • the initial training data 5 may not be in the first place. If it is in the sixth place, the initial training data 5 is The score value is 95, that is, the score value of the initial training data 5 has changed, so the score value of the initial training data 5 needs to be corrected.
  • the active learning module 23 determines the score value of each initial training data according to the characteristic information of each initial training data in the data set, it can also determine the similarity between the two initial training data; if the similarity is greater than the preset similarity Degree threshold, keep the score value of one initial training data unchanged, and reduce the score value of another initial training data. For example, if there are repeated initial training data, keep the score value of the first initial training data unchanged, and set the score value of other initial training data to 0.
  • the active learning module 23 may also perform a similarity comparison process before determining the score value of the initial training data. For example, first determine the similarity between the initial training data. If the similarity is greater than the preset similarity threshold, keep an initial training data in the data set, and set the score value of the remaining initial training data to 0, and The remaining initial training data is not kept in the data set. Based on this, the active learning module 23 can determine the score value of each initial training data according to the feature information of each initial training data (the initial training data whose score value is set to 0 is not included) in the data set.
  • the active learning module 23 supports querying initial training data according to conditions, for example, the number of initial training data whose score value is greater than a certain value, the distribution of different score value intervals, and so on.
  • step 307 the active learning module 23 sends a scoring completion message to the control center module 21.
  • the scoring completion message indicates that the active learning module 23 has scored all the initial training data.
  • step 308 the control center module 21 sends a work instruction to the cleaning control module 24.
  • the cleaning control module 24 determines the quantity M to be cleaned, and sends the quantity M to be cleaned to the active learning module 23.
  • the cleaning control module 24 starts to work after receiving the work instruction. In the working process, the quantity M to be cleaned is determined first, and the quantity M to be cleaned is sent to the active learning module 23.
  • the number M1 to be cleaned in the first statistical period can be configured based on experience.
  • the number M2 to be cleaned in the second statistical period is determined based on the cleaning efficiency of all operating nodes in the first statistical period.
  • the number M2 to be cleaned in the third statistical period is determined based on the cleaning efficiency of all operating nodes in the second statistical period, and so on.
  • the cleaning control module 24 can determine the quantity M to be cleaned in each statistical period, and send the quantity M to be cleaned to the active learning module 23.
  • the cleaning efficiency of operating nodes increases or decreases in the statistical period, and/or the number of operating nodes increases or decreases, it will cause the cleaning efficiency of all operating nodes to change, that is, the number of cleaning nodes M will be changed.
  • the change occurs, so that the quantity M to be cleaned can be dynamically adjusted.
  • the cleaning control module 24 can count the cleaning efficiency of each operating node, that is, the number of target training data completed by the operating node in the current statistical period. Then, the cleaning efficiency of all operating nodes is determined, and the number M to be cleaned is determined based on the cleaning efficiency of all operating nodes.
  • Step 310 the active learning module 23 sorts all the initial training data according to the score value of each initial training data. Based on the ranking result, starting from the initial training data with the higher score value, select the first M initial training data as the target training data , Send the target training data to the cleaning control module 24.
  • step 311 the cleaning control module 24 adds the target training data to the list to be cleaned.
  • the active learning module 23 uses M1 initial training data as target training data, sends M1 target training data to the cleaning control module 24, and the cleaning control module 24 adds M1 target training data to the to-be-cleaned List.
  • the active learning module 23 uses M2 initial training data as target training data, sends M2 target training data to the cleaning control module 24, and the cleaning control module 24 adds M2 target training data to the list to be cleaned. And so on.
  • the cleaning control module 24 sends the target training data to the operating node, so that the operating node performs data cleaning on the target training data.
  • the target training data and cleaning parameters are sent to the operating node, so that the operating node performs data cleaning on the target training data according to the cleaning parameters.
  • the operating node when it can process new target training data, it can send a request message to the cleaning control module 24.
  • the request message is used to request N target training data, indicating that the operating node can perform data cleaning on the N target training data. Can be a positive integer.
  • the cleaning control module 24 determines whether there are N target training data in the list to be cleaned. If so, directly send N target training data to the operating node. If not, then obtain (Na) target training data from the active learning module 23, a is used to represent the target training data that already exists in the list to be cleaned, so that N target training data can be obtained, and the N target training data Sent to the operation node.
  • the operation node may also be referred to as a cleaning node.
  • the operation node may be a machine or a manual operation. There is no restriction on this, as long as the target training data can be cleaned.
  • step 313 the cleaning control module 24 feeds back the task execution status to the control center module 21.
  • data cleaning can be performed on the target training data instead of data cleaning on all initial training data, thereby improving the efficiency of data cleaning and reducing the invalid input of redundant data.
  • data cleaning can be performed on target training data with good training effects (that is, high score values), and the most effective data can be provided for training, so that training data with better effects can participate in machine learning, and the effect of machine learning is better. Improve the utilization of cleaning resources.
  • Each operation node can perform data cleaning on 100 target training data every day. Assuming there are 1000 initial training data, you can select the score value from the 1000 initial training data. 200 initial training data larger than n, these 200 initial training data are used as target training data. Then, 100 target training data can be provided to one operating node, and the remaining 100 target training data can be provided to another operating node. In this way, two operating nodes can perform data cleaning on the above 200 target training data.
  • 100 initial training data with high scores can be selected from the remaining 800 initial training data, and these 100 initial training data are used as the target training data. These target training data are provided to newly added operation nodes.
  • the cleaning control module 24 obtains 1000 target training data from the active learning module 23 and invests in cleaning.
  • the cleaning control module 24 obtains 100 target training data from the active learning module 23 and invests in cleaning.
  • 1100 target training data are acquired from the active learning module 23 and used for cleaning.
  • FIG. 4 it is a structural diagram of the data cleaning device, and the device includes:
  • the obtaining module 41 is used to obtain a data set, and the data set includes a plurality of initial training data; the determining module 42 is used to determine the value of each initial training data according to the characteristic information of each initial training data in the data set.
  • the score value is used to indicate the training effect of the initial training data; the selection module 43 is used to select target training data from the data set according to the score value of each initial training data; the cleaning module 44 is used to Data cleaning is performed on the target training data.
  • the determining module 42 is specifically configured to: for each initial training data in the data set, query a pre-configured mapping relationship through the feature information of the initial training data to obtain the score value of the initial training data; wherein, The mapping relationship includes the corresponding relationship between the feature information and the score value; or,
  • all initial training data are sorted, and the score value of each initial training data is determined according to the sorting result.
  • the characteristic information includes application scenarios and/or data quality, and the determining module 42 is specifically configured to:
  • the feature information includes application scenarios and data quality, determine the scenario score of each initial training data according to the application scenario of each initial training data in the data set; according to each initial training data in the data set The data quality of the data, determine the quality score of each initial training data;
  • the score value of the initial training data is determined according to the scene score and the scene weight value of the initial training data, as well as the quality score and the quality weight value.
  • the determining module 42 is further configured to: after determining the score value of each initial training data according to the characteristic information of each initial training data in the data set, determine the difference between the two initial training data. Similarity; if the similarity is greater than the preset similarity threshold, the score value of one initial training data is kept unchanged, and the score value of another initial training data is reduced.
  • the selection module 43 is specifically configured to: for each initial training data, if the score value of the initial training data is greater than a preset score threshold, determine the initial training data as target training data; or, according to the data The score value of each initial training data in the set is sorted, and multiple initial training data are selected as the target training data according to the sorting result.
  • the selection module 43 selects a plurality of initial training data as target training data according to the sorting result, it is specifically used to: determine the quantity M to be cleaned in the next statistical period;
  • M initial training data are sequentially selected as the target training data.
  • the selection module 43 determines the quantity M to be cleaned in the next statistical period, it is specifically used for:
  • the quantity M to be cleaned in the next statistical period is determined according to the cleaning efficiency of the operating node, and the cleaning efficiency represents the completed cleaning quantity of the operating node in the current statistical period.
  • the cleaning module 44 is specifically configured to send the target training data and cleaning parameters to an operating node, so that the operating node performs data cleaning on the target training data according to the cleaning parameters.
  • an embodiment of this application also proposes a data cleaning device.
  • the data cleaning device may include: a processor 51 and a machine-readable storage medium 52, where the machine-readable storage medium 52 stores machine-executable instructions that can be executed by the processor 51; the processor 51 is used to execute the machine Executable instructions are used to implement the methods disclosed in the above examples of this application.
  • the processor 51 is used to execute machine executable instructions to implement the following steps:
  • an embodiment of the application also provides a machine-readable storage medium, wherein a number of computer instructions are stored on the machine-readable storage medium, and when the computer instructions are executed by a processor, The method disclosed in the above examples of this application is implemented.
  • the foregoing machine-readable storage medium may be any electronic, magnetic, optical, or other physical storage device, and may contain or store information, such as executable instructions, data, and so on.
  • the machine-readable storage medium can be: RAM (Radom Access Memory), volatile memory, non-volatile memory, flash memory, storage drives (such as hard drives), solid state drives, and any type of storage disk (Such as CD, DVD, etc.), or similar storage media, or a combination of them.
  • a typical implementation device is a computer.
  • the specific form of the computer can be a personal computer, a laptop computer, a cellular phone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email receiving and sending device, and a game control A console, a tablet computer, a wearable device, or a combination of any of these devices.
  • the embodiments of the present application can be provided as a method, a system, or a computer program product. Therefore, this application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, the embodiments of the present application may adopt the form of computer program products implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.
  • computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • these computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device,
  • the instruction device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
  • These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment.
  • the instructions provide steps for implementing functions specified in a flow or multiple flows in the flowchart and/or a block or multiple blocks in the block diagram.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Quality & Reliability (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

A data cleaning method, apparatus and device, a program, and a storage medium. The method comprises: obtaining a data set, the data set comprising a plurality of pieces of initial training data (101); according to feature information of each piece of initial training data in the data set, determining a score value of each piece of initial training data, the score value being used for representing the training effectiveness of the initial training data (102); according to the score value of each piece of initial training data, selecting target training data from the data set (103); and carrying out data cleaning according to the target training data (104). The solution can increase data cleaning efficiency, reduce invalid input of redundant data, and improve the utilization rate of cleaning resources.

Description

一种数据清洗方法、装置及设备、程序及存储介质Data cleaning method, device, equipment, program and storage medium 技术领域Technical field
本申请涉及图像处理技术领域,尤其是一种数据清洗方法、装置及设备、程序及存储介质。This application relates to the field of image processing technology, in particular to a data cleaning method, device and equipment, program and storage medium.
背景技术Background technique
机器学习是实现人工智能的一种途径,是一门多领域交叉学科,涉及概率论、统计学、逼近论、凸分析、算法复杂度理论等多门学科。机器学习用于研究计算机如何模拟或实现人类的学习行为,以获取新的知识或技能,重新组织已有的知识结构使之不断改善自身的性能。机器学习更加注重算法设计,使计算机能够自动地从数据中学习规律,并利用规律对未知数据进行预测。Machine learning is a way to realize artificial intelligence. It is a multi-disciplinary interdisciplinary, involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and other subjects. Machine learning is used to study how computers simulate or implement human learning behaviors in order to acquire new knowledge or skills, and reorganize the existing knowledge structure to continuously improve its own performance. Machine learning pays more attention to algorithm design, so that computers can automatically learn rules from data and use the rules to predict unknown data.
机器学习已经有了十分广泛的应用,例如:数据挖掘、计算机视觉、自然语言处理、生物特征识别、搜索引擎、医学诊断、检测信用卡欺诈、证券市场分析、DNA序列测序、语音和手写识别、战略游戏和机器人运用等等。Machine learning has been widely used, such as: data mining, computer vision, natural language processing, biometric recognition, search engines, medical diagnosis, detection of credit card fraud, stock market analysis, DNA sequence sequencing, speech and handwriting recognition, strategy Games and robot applications, etc.
为了实现机器学习,需要获取大量初始训练数据,对这些初始训练数据进行数据清洗,得到已清洗训练数据,并根据已清洗训练数据实现机器学习。In order to implement machine learning, it is necessary to obtain a large amount of initial training data, perform data cleaning on these initial training data, obtain cleaned training data, and implement machine learning based on the cleaned training data.
但是,上述方式需要对所有初始训练数据进行数据清洗,无法对初始训练数据进行筛选,导致效果较差的训练数据也参与到机器学习,学习效果较差。However, the above method requires data cleaning of all initial training data, and it is impossible to filter the initial training data. As a result, training data with poor effects is also involved in machine learning, and the learning effect is poor.
发明内容Summary of the invention
本申请提供一种数据清洗方法,所述方法包括:This application provides a data cleaning method, which includes:
获取数据集合,所述数据集合包括多个初始训练数据;Acquiring a data set, the data set including a plurality of initial training data;
根据所述数据集合中的每个初始训练数据的特征信息,确定每个初始训练数据的分数值,所述分数值用于表示初始训练数据的训练效果;Determine the score value of each initial training data according to the feature information of each initial training data in the data set, where the score value is used to represent the training effect of the initial training data;
根据每个初始训练数据的分数值从所述数据集合中选取目标训练数据;Selecting target training data from the data set according to the score value of each initial training data;
根据所述目标训练数据进行数据清洗。Perform data cleaning according to the target training data.
本申请提供一种数据清洗装置,所述装置包括:The present application provides a data cleaning device, which includes:
获取模块,用于获取数据集合,所述数据集合包括多个初始训练数据;An acquisition module for acquiring a data set, the data set including a plurality of initial training data;
确定模块,用于根据所述数据集合中的每个初始训练数据的特征信息,确定每个初始训练数据的分数值,所述分数值用于表示初始训练数据的训练效果;The determining module is configured to determine the score value of each initial training data according to the feature information of each initial training data in the data set, and the score value is used to represent the training effect of the initial training data;
选取模块,用于根据每个初始训练数据的分数值从所述数据集合中选取目标训练数据;The selection module is used to select target training data from the data set according to the score value of each initial training data;
清洗模块,用于根据所述目标训练数据进行数据清洗。The cleaning module is used for data cleaning according to the target training data.
本申请提供一种数据清洗设备,包括:处理器和机器可读存储介质,所述机器可读存储介质存储有能够被所述处理器执行的机器可执行指令;The present application provides a data cleaning device, including: a processor and a machine-readable storage medium, where the machine-readable storage medium stores machine-executable instructions that can be executed by the processor;
所述处理器用于执行机器可执行指令,以实现如下步骤:The processor is used to execute machine executable instructions to implement the following steps:
获取数据集合,所述数据集合包括多个初始训练数据;Acquiring a data set, the data set including a plurality of initial training data;
根据所述数据集合中的每个初始训练数据的特征信息,确定每个初始训练数据的分数值,所述分数值用于表示初始训练数据的训练效果;Determine the score value of each initial training data according to the feature information of each initial training data in the data set, where the score value is used to represent the training effect of the initial training data;
根据每个初始训练数据的分数值从所述数据集合中选取目标训练数据;Selecting target training data from the data set according to the score value of each initial training data;
根据所述目标训练数据进行数据清洗。Perform data cleaning according to the target training data.
本申请提供一种计算机程序,所述计算机程序存储于机器可读存储介质,并且当处理器执行计算机程序时,促使处理器实现上述第一方面中的方法。The present application provides a computer program, which is stored in a machine-readable storage medium, and when a processor executes the computer program, it causes the processor to implement the method in the above first aspect.
本申请提供一种机器可读存储介质,所述机器可读存储介质存储有机器可执行指令,在被处理器调用和执行时,所述机器可执行指令促使所述处理器执行第一方面中的方法。The present application provides a machine-readable storage medium that stores machine-executable instructions. When called and executed by a processor, the machine-executable instructions cause the processor to execute the first aspect Methods.
由以上技术方案可见,本申请实施例中,根据初始训练数据的特征信息确定初始训练数据的分数值,分数值用于表示初始训练数据的训练效果,根据每个初始训练数据的分数值从所有初始训练数据中选取目标训练数据,对目标训练数据进行数据清洗,而不是对所有初始训练数据进行数据清洗,从而提高数据清洗效率,减少冗余数据的无效投入。能够对训练效果好(即分数值高)的目标训练数据进行数据清洗,提供最有效的数据用于训练,使得效果较好的训练数据参与到机器学习,机器学习的效果较好,可以提高清洗资源的利用率。It can be seen from the above technical solutions that in this embodiment of the application, the score value of the initial training data is determined according to the characteristic information of the initial training data. The score value is used to represent the training effect of the initial training data. Select the target training data from the initial training data, and perform data cleaning on the target training data instead of performing data cleaning on all initial training data, thereby improving the efficiency of data cleaning and reducing the invalid input of redundant data. It can clean the target training data with good training effect (that is, the score value is high), and provide the most effective data for training, so that the training data with better effect can participate in machine learning, and the effect of machine learning is better, which can improve cleaning Resource utilization.
附图说明Description of the drawings
为了更加清楚地说明本申请实施例或者现有技术中的技术方案,下面将对本申请实施例或者现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请中记载的一些实施例,对于本领域普通技术人员来讲,还可以根据本申请实施例的这些附图获得其他的附图。In order to explain the embodiments of the application or the technical solutions in the prior art more clearly, the following will briefly introduce the drawings that need to be used in the description of the embodiments of the application or the prior art. Obviously, the drawings in the following description These are just some of the embodiments described in this application. For those of ordinary skill in the art, other drawings can also be obtained based on these drawings of the embodiments of this application.
图1是本申请一种实施方式中的数据清洗方法的流程图;FIG. 1 is a flowchart of a data cleaning method in an embodiment of the present application;
图2是本申请一种实施方式中的应用场景示意图;Fig. 2 is a schematic diagram of an application scenario in an embodiment of the present application;
图3是本申请另一种实施方式中的数据清洗方法的流程图;FIG. 3 is a flowchart of a data cleaning method in another embodiment of the present application;
图4是本申请一种实施方式中的数据清洗装置的结构图;FIG. 4 is a structural diagram of a data cleaning device in an embodiment of the present application;
图5是本申请一种实施方式中的数据清洗设备的结构图。Fig. 5 is a structural diagram of a data cleaning device in an embodiment of the present application.
具体实施方式detailed description
在本申请实施例使用的术语仅仅是出于描述特定实施例的目的,而非限制本申请。本申请和权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其它含义。还应当理解,本文中使用的术语“和/或”是指包含一个或多个相关联的列出项目的任何或所有可能组合。The terms used in the embodiments of the present application are only for the purpose of describing specific embodiments, rather than limiting the present application. The singular forms of "a", "said" and "the" used in this application and claims are also intended to include plural forms, unless the context clearly indicates other meanings. It should also be understood that the term "and/or" as used herein refers to any or all possible combinations of one or more associated listed items.
应当理解,尽管在本申请实施例可能采用术语第一、第二、第三等来描述各种信息,但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如,在不脱离本申请范围的情况下,第一信息也可以被称为第二信息,类似地,第二信息也可以被称为第一信息。取决于语境,此外,所使用的词语“如果”可以被解释成为“在……时”或“当……时”或“响应于确定”。It should be understood that although the terms first, second, third, etc. may be used in the embodiments of the present application to describe various information, the information should not be limited to these terms. These terms are only used to distinguish the same type of information from each other. For example, without departing from the scope of this application, the first information may also be referred to as second information, and similarly, the second information may also be referred to as first information. Depending on the context, in addition, the word "if" used can be interpreted as "when" or "when" or "in response to certainty."
机器学习是实现人工智能的一种途径,用于研究计算机如何模拟或实现人类的学习行为,以获取新的知识或技能,重新组织已有的知识结构使之不断改善自身性能。神经网络是机器学习的具体实现方式,本文以神经网络为例,介绍机器学习的实现方式,针对其它类型的机器学习算法,与神经网络类似。Machine learning is a way to realize artificial intelligence. It is used to study how computers simulate or realize human learning behaviors in order to acquire new knowledge or skills, and reorganize the existing knowledge structure to continuously improve its own performance. Neural network is the specific implementation of machine learning. This article uses neural network as an example to introduce the implementation of machine learning. For other types of machine learning algorithms, it is similar to neural network.
示例性的,神经网络可以包括但不限于:卷积神经网络(简称CNN)、循环神经网络(简称RNN)、全连接网络等。神经网络的结构单元可以包括但不限于:卷积层(Conv)、池化层(Pool)、激励层、全连接层(FC)等,对此不做限制。Exemplarily, the neural network may include, but is not limited to: Convolutional Neural Network (abbreviated as CNN), Recurrent Neural Network (abbreviated as RNN), fully connected network, etc. The structural units of the neural network may include, but are not limited to: Convolutional Layer (Conv), Pooling Layer (Pool), Excitation Layer, Fully Connected Layer (FC), etc., which are not limited.
在卷积层中,通过使用卷积核对数据进行卷积运算,使数据特征增强,卷积层在空间范围内使用卷积核进行卷积运算,该卷积核可以是m*n大小的矩阵,卷积层的输入与卷积核进行卷积,可得到卷积层的输出。卷积运算实际是一个滤波过程,在卷积运算中,是将数据与卷积核w(x,y)进行卷积,得到多个卷积特征,这些卷积特征就是卷积层的输出,且可以被提供给池化层。In the convolution layer, the data features are enhanced by using the convolution kernel to perform the convolution operation. The convolution layer uses the convolution kernel to perform the convolution operation in the space range. The convolution kernel can be a matrix of m*n size. , The input of the convolutional layer is convolved with the convolution kernel, and the output of the convolutional layer can be obtained. The convolution operation is actually a filtering process. In the convolution operation, the data is convolved with the convolution kernel w(x, y) to obtain multiple convolution features. These convolution features are the output of the convolution layer. And can be provided to the pooling layer.
在池化层中,实际上就是一个降采样的过程,通过对多个卷积特征(即卷积层的输出)进行取最大值、取最小值、取平均值等操作,从而可以减少计算量,并保持特征不变性。在池化层中,可以利用局部相关性的原理,对数据进行子抽样,从而可以减少数据处理量,并保留数据中的有用信息。In the pooling layer, it is actually a down-sampling process. By taking the maximum value, minimum value, and average value of multiple convolutional features (that is, the output of the convolutional layer), the amount of calculation can be reduced. , And maintain the invariance of characteristics. In the pooling layer, the principle of local correlation can be used to sub-sampling the data, which can reduce the amount of data processing and retain the useful information in the data.
在激励层中,可以使用激活函数(如非线性函数)对池化层输出的特征进行映射,从而引入非线性因素,使得神经网络通过非线性的组合而增强表达能力。其中,激励层的激活函数可以包括但不限于ReLU(Rectified Linear Units,整流线性单元)函数,以ReLU函数为例进行说明,则该ReLU函数可以将池化层输出的所有特征中,小于0的特征置0,而大于0的特征保持不变。In the excitation layer, an activation function (such as a non-linear function) can be used to map the output characteristics of the pooling layer, so as to introduce non-linear factors, so that the neural network can enhance the expression ability through non-linear combination. Among them, the activation function of the excitation layer can include, but is not limited to, the ReLU (Rectified Linear Units) function. Taking the ReLU function as an example, the ReLU function can take all the features output by the pooling layer, which are less than 0. The feature is set to 0, and the feature greater than 0 remains unchanged.
在全连接层中,该全连接层用于将输入给本全连接层的所有特征进行全连接处理,从而得到一个特征向量,且该特征向量中可以包括多个特征。In the fully connected layer, the fully connected layer is used to perform fully connected processing on all the features input to the fully connected layer, thereby obtaining a feature vector, and the feature vector may include multiple features.
在实际应用中,可以根据不同需求,将一个或多个卷积层、一个或多个池化层、一 个或多个激励层和一个或多个全连接层进行组合构建神经网络。In practical applications, one or more convolutional layers, one or more pooling layers, one or more excitation layers, and one or more fully connected layers can be combined to construct a neural network according to different needs.
示例性的,在使用神经网络进行业务处理之前,需要先对神经网络进行训练。在神经网络的训练过程中,可以获取大量初始训练数据,对这些初始训练数据进行数据清洗,得到已清洗训练数据,并利用已清洗训练数据训练神经网络内的各神经网络参数,如卷积层参数(如卷积核参数)、池化层参数、激励层参数、全连接层参数等,对此不做限制。基于已经完成训练的神经网络,就可以使用神经网络进行业务处理,例如,将输入数据提供给神经网络,由神经网络对输入数据进行处理,如利用各神经网络参数对输入数据进行处理,得到输出数据,最终使用神经网络完成业务处理,如人脸检测,车辆检测等。Exemplarily, before using the neural network for business processing, the neural network needs to be trained first. In the training process of the neural network, a large amount of initial training data can be obtained, and the initial training data can be cleaned to obtain the cleaned training data, and the cleaned training data can be used to train the neural network parameters in the neural network, such as the convolutional layer Parameters (such as convolution kernel parameters), pooling layer parameters, excitation layer parameters, fully connected layer parameters, etc., there are no restrictions on this. Based on the neural network that has been trained, the neural network can be used for business processing, for example, the input data is provided to the neural network, and the neural network processes the input data, such as using various neural network parameters to process the input data to obtain the output Data, and finally use neural network to complete business processing, such as face detection, vehicle detection, etc.
在相关技术中,需要对所有初始训练数据进行数据清洗,并采用所有已清洗训练数据对神经网络进行训练。但是,在这些训练数据中,可能存在无法使用的训练数据,可能存在重复的训练数据,可能存在训练效果较差的训练数据,将这些训练数据均提供给神经网络时,会导致神经网络的训练效果较差,即神经网络的可靠性降低,如人脸检测,车辆检测的准确性大大降低。In related technologies, it is necessary to perform data cleaning on all initial training data, and use all the cleaned training data to train the neural network. However, in these training data, there may be unusable training data, there may be repeated training data, and there may be training data with poor training effects. When these training data are provided to the neural network, it will lead to the training of the neural network. The effect is poor, that is, the reliability of the neural network is reduced, such as face detection, and the accuracy of vehicle detection is greatly reduced.
针对上述发现,在本申请实施例中,可以确定每个初始训练数据的分数值,而分数值用于表示初始训练数据的训练效果,即分数值越高,训练数据的训练效果越好,基于此,可以将分数值高的部分初始训练数据作为目标训练数据,对目标训练数据进行数据清洗,并利用已清洗的目标训练数据对神经网络进行训练。显然,由于目标训练数据是分数值高的训练数据,即训练效果较好的训练数据,因此,将这些训练数据提供给神经网络时,会使得神经网络的训练效果较好,即神经网络的可靠性提高,如人脸检测,车辆检测的准确性增加。In response to the above findings, in the embodiments of the present application, the score value of each initial training data can be determined, and the score value is used to represent the training effect of the initial training data, that is, the higher the score value, the better the training effect of the training data. Therefore, part of the initial training data with high scores can be used as target training data, the target training data can be cleaned, and the cleaned target training data can be used to train the neural network. Obviously, because the target training data is training data with high scores, that is, training data with better training effects, when these training data are provided to the neural network, the training effect of the neural network will be better, that is, the neural network is reliable Improved performance, such as face detection, and increased accuracy of vehicle detection.
以下结合具体实施例,对本申请实施例的技术方案进行说明。The technical solutions of the embodiments of the present application will be described below in conjunction with specific embodiments.
参见图1所示,为数据清洗方法的流程示意图,该方法可以包括:Refer to Figure 1, which is a schematic flow diagram of a data cleaning method. The method may include:
步骤101,获取数据集合,该数据集合可以包括多个初始训练数据。Step 101: Obtain a data set, where the data set may include multiple initial training data.
示例性的,当需要采用训练数据对神经网络进行训练时,可以先获取训练数据,为了区分方便,将该训练数据称为初始训练数据。例如,可以从某设备获取初始训练数据,也可以接收用户输入的初始训练数据,对此不做限制。Exemplarily, when training data needs to be used to train the neural network, the training data may be obtained first. For the convenience of distinction, the training data is called initial training data. For example, the initial training data can be obtained from a certain device, or the initial training data input by the user can be received, and there is no restriction on this.
示例性的,针对获取到的大量初始训练数据,可以对这些初始训练数据进行分类,每个类型的初始训练数据添加到一个数据集合。例如,将用于进行人脸检测的初始训练数据添加到数据集合1,将用于进行车辆检测的初始训练数据添加到数据集合2,以此类推,对此分类方式不做限制。综上所述,可以得到至少一个数据集合,每个数据集合包括多个初始训练数据。由于每个数据集合的处理过程相同,因此,后续以一个数据集合的处理过程为例进行说明。Exemplarily, for a large amount of acquired initial training data, these initial training data may be classified, and each type of initial training data is added to a data set. For example, the initial training data for face detection is added to data set 1, and the initial training data for vehicle detection is added to data set 2, and so on, and there is no restriction on this classification method. In summary, at least one data set can be obtained, and each data set includes multiple initial training data. Since the processing procedure of each data set is the same, the following takes the processing procedure of a data set as an example for description.
步骤102,根据数据集合中的每个初始训练数据的特征信息,确定每个初始训练数据的分数值,该分数值用于表示初始训练数据的训练效果,比如说,初始训练数据的分数值越高,则表示该初始训练数据的训练效果越好,初始训练数据的分数值越低,则表示该初始训练数据的训练效果越差。Step 102: Determine the score value of each initial training data according to the feature information of each initial training data in the data set. The score value is used to indicate the training effect of the initial training data. For example, the score value of the initial training data is High, it means that the training effect of the initial training data is better, and the score value of the initial training data is lower, it means that the training effect of the initial training data is worse.
初始训练数据的特征信息能够表征初始训练数据的训练效果,当特征信息表征初始训练数据的训练效果较好时,则初始训练数据的分数值较高,当特征信息表征初始训练数据的训练效果较差时,则初始训练数据的分数值较低。综上所述,可以根据初始训练数据的特征信息确定初始训练数据的分数值。The feature information of the initial training data can characterize the training effect of the initial training data. When the feature information characterizes the training effect of the initial training data, the score value of the initial training data is higher. When the feature information characterizes the training effect of the initial training data, the training effect of the initial training data is higher. When it is bad, the score value of the initial training data is lower. In summary, the score value of the initial training data can be determined according to the characteristic information of the initial training data.
示例性的,当至少两个初始训练数据的特征信息相同时,则这些初始训练数据的分数值可以相同,这些初始训练数据的分数值也可以不同。Exemplarily, when the feature information of at least two initial training data is the same, the score values of these initial training data may be the same, and the score values of these initial training data may also be different.
示例性的,可以采用如下方式确定每个初始训练数据的分数值:Exemplarily, the following method may be used to determine the score value of each initial training data:
方式1、针对数据集合中的每个初始训练数据,通过该初始训练数据的特征信息查询预先配置的映射关系,得到该初始训练数据的分数值。Manner 1. For each initial training data in the data set, the pre-configured mapping relationship is queried through the characteristic information of the initial training data to obtain the score value of the initial training data.
示例性的,针对方式1,可以预先配置映射关系,该映射关系可以包括但不限于特征信息与分数值的对应关系,特征信息与分数值的对应关系可以根据经验进行配置,对此不做限制。例如,当特征信息a1表征初始训练数据的训练效果较好时,则特征信息a1对应的分数值较高。又例如,当特征信息a2表征初始训练数据的训练效果较差时,则特征信息a2对应的分数值较低。Exemplarily, for mode 1, the mapping relationship can be pre-configured, and the mapping relationship can include but is not limited to the correspondence relationship between feature information and score value. The correspondence relationship between feature information and score value can be configured based on experience, and there is no limitation on this . For example, when the feature information a1 represents a better training effect of the initial training data, the score value corresponding to the feature information a1 is higher. For another example, when the feature information a2 characterizes that the training effect of the initial training data is poor, the score value corresponding to the feature information a2 is lower.
参见表1所示,为映射关系的一个示例,该映射关系用于记录特征信息与分数值的对应关系,分数值可以采用百分制,也可以是其它分值,对此不做限制。表1是以表格的方式表示映射关系,当然,也可以采用其它数据结构表示映射关系,只要包括特征信息与分数值的对应关系即可,对此不做限制。Refer to Table 1, which is an example of the mapping relationship. The mapping relationship is used to record the corresponding relationship between the feature information and the score value. The score value can adopt a percentage system or other score values, and there is no restriction on this. Table 1 shows the mapping relationship in the form of a table. Of course, other data structures can also be used to represent the mapping relationship, as long as it includes the corresponding relationship between the feature information and the score value, which is not limited.
表1Table 1
特征信息Feature information 分数值Point value
特征信息a1Characteristic information a1 100100
特征信息a2Characteristic information a2 9595
特征信息a3Characteristic information a3 9090
特征信息a4Characteristic information a4 8585
示例性的,针对数据集合中的每个初始训练数据,可以获取该初始训练数据的特征信息。例如,该初始训练数据可以包括特征信息,因此,可以直接从该初始训练数据中得到该初始训练数据的特征信息。又例如,可以采用某种算法(如深度学习算法)对初始训练数据进行分析,得到该初始训练数据的特征信息,对此分析过程不做限制,只要能够得到初始训练数据的特征信息即可。Exemplarily, for each initial training data in the data set, feature information of the initial training data can be obtained. For example, the initial training data may include feature information. Therefore, the feature information of the initial training data can be obtained directly from the initial training data. For another example, a certain algorithm (such as a deep learning algorithm) can be used to analyze the initial training data to obtain the characteristic information of the initial training data. This analysis process is not limited, as long as the characteristic information of the initial training data can be obtained.
在得到初始训练数据的特征信息后,可以通过该初始训练数据的特征信息查询表1所示的映射关系,得到该初始训练数据的分数值。例如,若初始训练数据的特征信息为特征信息a3,则该初始训练数据的分数值为90。After the feature information of the initial training data is obtained, the mapping relationship shown in Table 1 can be looked up through the feature information of the initial training data to obtain the score value of the initial training data. For example, if the feature information of the initial training data is feature information a3, the score value of the initial training data is 90.
综上所述,针对数据集合中的每个初始训练数据,可以通过该初始训练数据的特征信息查询表1所示的映射关系,得到该初始训练数据的分数值。In summary, for each initial training data in the data set, the mapping relationship shown in Table 1 can be inquired from the feature information of the initial training data to obtain the score value of the initial training data.
方式2、根据数据集合中的每个初始训练数据的特征信息的重要优先级,对所有初始训练数据进行排序,根据排序结果确定每个初始训练数据的分数值。Manner 2. Sort all the initial training data according to the important priority of the feature information of each initial training data in the data set, and determine the score value of each initial training data according to the sorting result.
示例性的,针对方式2,可以预先配置特征信息的重要优先级,该重要优先级可以根据经验进行配置,对此重要优先级不做限制。例如,当特征信息a1表征初始训练数据的训练效果较好,特征信息a2表征初始训练数据的训练效果较差时,则特征信息a1的重要优先级可以大于特征信息a2的重要优先级。Exemplarily, for mode 2, the important priority of the feature information can be pre-configured, and the important priority can be configured based on experience, and there is no restriction on the important priority. For example, when the feature information a1 characterizes the training effect of the initial training data is good, and the feature information a2 characterizes the training effect of the initial training data is poor, the important priority of the feature information a1 may be greater than the important priority of the feature information a2.
参见表2所示,为特征信息的重要优先级的示例,重要优先级的数值越高,表示重要优先级越大。表2是以表格方式表示重要优先级,也可以采用其它数据结构表示重要优先级,只要包括特征信息的重要优先级即可,对此不做限制。Refer to Table 2, which is an example of the important priority of feature information. The higher the value of the important priority, the greater the important priority. Table 2 expresses important priorities in a tabular manner, and other data structures may also be used to express important priorities, as long as the important priorities of the feature information are included, and there is no restriction on this.
表2Table 2
特征信息Feature information 重要优先级Important priority
特征信息a1Characteristic information a1 1010
特征信息a2Characteristic information a2 99
特征信息a3Characteristic information a3 88
特征信息a4Characteristic information a4 77
示例性的,针对数据集合中的每个初始训练数据,可以获取该初始训练数据的特征信息,获取方式参见上述实施例,在此不再赘述。在得到初始训练数据的特征信息后,可以通过该初始训练数据的特征信息查询表2,得到该初始训练数据的重要优先级。然后,根据每个初始训练数据的特征信息的重要优先级,对所有初始训练数据进行排序,根据排序结果确定每个初始训练数据的分数值。Exemplarily, for each initial training data in the data set, the characteristic information of the initial training data can be acquired. For the acquisition method, please refer to the foregoing embodiment, and will not be repeated here. After the feature information of the initial training data is obtained, Table 2 can be looked up through the feature information of the initial training data to obtain the important priority of the initial training data. Then, according to the important priority of the feature information of each initial training data, all the initial training data are sorted, and the score value of each initial training data is determined according to the sorting result.
例如,假设初始训练数据1的特征信息的重要优先级>初始训练数据2的特征信息的重要优先级>初始训练数据3的特征信息的重要优先级,则排序结果为初始训练数据1、初始训练数据2和初始训练数据3,因此,初始训练数据1的分数值大于初始训练数据2的分数值,初始训练数据2的分数值大于初始训练数据3的分数值,例如,初始训练数据1的分数值为100,初始训练数据2的分数值为99,初始训练数据3的分数值为98,当然,上述分数值只是示例。For example, assuming that the important priority of the feature information of the initial training data 1> the important priority of the feature information of the initial training data 2> the important priority of the feature information of the initial training data 3, the ranking result is the initial training data 1, the initial training Data 2 and initial training data 3. Therefore, the score value of the initial training data 1 is greater than the score value of the initial training data 2, and the score value of the initial training data 2 is greater than the score value of the initial training data 3. For example, the score of the initial training data 1 The value is 100, the score value of the initial training data 2 is 99, and the score value of the initial training data 3 is 98. Of course, the above score value is just an example.
综上所述,针对数据集合中的每个初始训练数据,通过对所述初始训练数据进行排序,能够根据排序结果确定每个初始训练数据的分数值。In summary, for each initial training data in the data set, by sorting the initial training data, the score value of each initial training data can be determined according to the sorting result.
当然,上述方式1和方式2只是本申请的两个示例,对此不做限制,只要能够根据初始训练数据的特征信息确定初始训练数据的分数值即可。Of course, the above method 1 and method 2 are only two examples of this application, and there is no limitation on this, as long as the score value of the initial training data can be determined according to the characteristic information of the initial training data.
在一种可能的实施方式中,特征信息可以包括但不限于应用场景和/或数据质量,对此特征信息不做限制,所有能够表征训练效果的信息,均可以作为特征信息。应用场景用于表示初始训练数据的场景信息,如白天,夜晚,晴天,雨天等,当然,上述只是应用场景的几个示例,对此不做限制。数据质量用于表示初始训练数据的质量信息,如分辨率等,分辨率越高,则数据质量越好,数据越清晰,当然,上述只是数据质量的示例,对此不做限制。In a possible implementation manner, the feature information may include, but is not limited to, application scenarios and/or data quality. There is no restriction on the feature information, and all information that can characterize the training effect can be used as the feature information. The application scenario is used to represent the scenario information of the initial training data, such as daytime, night, sunny day, rainy day, etc. Of course, the above are just a few examples of the application scenario, and there is no restriction on this. Data quality is used to indicate the quality information of the initial training data, such as resolution, etc. The higher the resolution, the better the data quality and the clearer the data. Of course, the above are only examples of data quality, and there is no restriction on this.
以下结合具体情况,对应用场景和/或数据质量的实现过程进行说明。The following describes the implementation process of application scenarios and/or data quality in combination with specific conditions.
情况1、若特征信息包括应用场景,则根据数据集合中的每个初始训练数据的应用场景,确定每个初始训练数据的场景分;该场景分用于表示初始训练数据的训练效果,比如说,初始训练数据的场景分越高,则表示该初始训练数据的训练效果越好,初始训练数据的场景分越低,则表示该初始训练数据的训练效果越差。然后,根据初始训练数据的场景分确定该初始训练数据的分数值,例如,直接将该初始训练数据的场景分作为该初始训练数据的分数值。Case 1. If the feature information includes application scenarios, determine the scenario score of each initial training data according to the application scenario of each initial training data in the data set; the scenario score is used to represent the training effect of the initial training data, for example , The higher the scene score of the initial training data, the better the training effect of the initial training data, and the lower the scene score of the initial training data, the worse the training effect of the initial training data. Then, the score value of the initial training data is determined according to the scene score of the initial training data, for example, the scene score of the initial training data is directly used as the score value of the initial training data.
初始训练数据的应用场景能够表征初始训练数据的训练效果,当应用场景表征初始训练数据的训练效果较好时,则初始训练数据的场景分较高,当应用场景表征初始训练数据的训练效果较差时,则初始训练数据的场景分较低。综上所述,可以根据初始训练数据的应用场景确定初始训练数据的场景分。例如,针对夜晚、雨天等,采用这些应用场景的初始训练数据进行训练时,训练效果较好,初始训练数据的场景分较高。针对白天,晴天等,采用这些应用场景的初始训练数据进行训练时,训练效果较差,初始训练数据的场景分较低。The application scenario of the initial training data can characterize the training effect of the initial training data. When the application scenario characterizes the training effect of the initial training data, the initial training data has a higher scenario score. When the application scenario characterizes the training effect of the initial training data, the training effect of the initial training data is higher. When it is bad, the scene score of the initial training data is lower. In summary, the scene score of the initial training data can be determined according to the application scene of the initial training data. For example, for nights, rainy days, etc., when the initial training data of these application scenarios is used for training, the training effect is better, and the initial training data has a higher scene score. For daytime, sunny days, etc., when the initial training data of these application scenarios is used for training, the training effect is poor, and the scene score of the initial training data is low.
情况2、若特征信息包括数据质量,则根据数据集合中的每个初始训练数据的数据质量,确定每个初始训练数据的质量分;该质量分用于表示初始训练数据的训练效果,比如说,初始训练数据的质量分越高,则表示该初始训练数据的训练效果越好,初始训练数据的质量分越低,则表示该初始训练数据的训练效果越差。然后,根据初始训练数据的质量分确定该初始训练数据的分数值,例如,直接将该初始训练数据的质量分作为该初始训练数据的分数值。Case 2. If the feature information includes data quality, determine the quality score of each initial training data according to the data quality of each initial training data in the data set; the quality score is used to indicate the training effect of the initial training data, for example , The higher the quality score of the initial training data, the better the training effect of the initial training data, and the lower the quality score of the initial training data, the worse the training effect of the initial training data. Then, the score value of the initial training data is determined according to the quality score of the initial training data, for example, the quality score of the initial training data is directly used as the score value of the initial training data.
初始训练数据的数据质量能够表征初始训练数据的训练效果,当数据质量表征初始训练数据的训练效果较好时,则初始训练数据的质量分较高,当数据质量表征初始训练数据的训练效果较差时,则初始训练数据的质量分较低。综上所述,可以根据初始训练数据的数据质量确定初始训练数据的质量分。The data quality of the initial training data can characterize the training effect of the initial training data. When the data quality characterizes the training effect of the initial training data, the quality score of the initial training data is higher. When the data quality characterizes the training effect of the initial training data, the training effect is better. When it is bad, the quality score of the initial training data is lower. In summary, the quality score of the initial training data can be determined according to the data quality of the initial training data.
例如,针对分辨率较低(即数据质量较差)的初始训练数据,采用其进行训练时,训练效果较好,初始训练数据的质量分较高。针对分辨率较高的初始训练数据,采用其 进行训练时,训练效果较差,初始训练数据的质量分较低。For example, for the initial training data with low resolution (that is, poor data quality), when it is used for training, the training effect is better, and the quality score of the initial training data is higher. For the initial training data with higher resolution, when it is used for training, the training effect is poor, and the quality of the initial training data is low.
情况3、若特征信息包括应用场景和数据质量,则根据数据集合中的每个初始训练数据的应用场景,确定每个初始训练数据的场景分,该场景分用于表示初始训练数据的训练效果。以及,根据数据集合中的每个初始训练数据的数据质量,确定每个初始训练数据的质量分,该质量分用于表示初始训练数据的训练效果。然后,针对每个初始训练数据,根据该初始训练数据的场景分和场景权重值,以及质量分和质量权重值,确定该初始训练数据的分数值。Case 3. If the feature information includes application scenarios and data quality, determine the scenario score of each initial training data according to the application scenario of each initial training data in the data set, and the scenario score is used to represent the training effect of the initial training data . And, according to the data quality of each initial training data in the data set, the quality score of each initial training data is determined, and the quality score is used to represent the training effect of the initial training data. Then, for each initial training data, the score value of the initial training data is determined according to the scene score and scene weight value of the initial training data, as well as the quality score and quality weight value.
示例性的,可以根据经验配置场景权重值和质量权重值,对此不做限制,可以任意配置。例如,场景权重值与质量权重值的和可以为1,若用户关注应用场景,则场景权重值大于质量权重值,如场景权重值为0.7,质量权重值为0.3,或者,场景权重值为0.6,质量权重值为0.4。若用户关注数据质量,则质量权重值大于场景权重值,如场景权重值为0.3,质量权重值为0.7,或者,场景权重值为0.4,质量权重值为0.6。此外,还可以将场景权重值和质量权重值均设置为0.5。当然,上述只是场景权重值和质量权重值的几个示例。Exemplarily, the scene weight value and the quality weight value can be configured according to experience, and there is no restriction on this and can be configured arbitrarily. For example, the sum of the scene weight value and the quality weight value can be 1. If the user pays attention to the application scene, the scene weight value is greater than the quality weight value, for example, the scene weight value is 0.7, the quality weight value is 0.3, or the scene weight value is 0.6 , The quality weight value is 0.4. If the user is concerned about data quality, the quality weight value is greater than the scene weight value. For example, the scene weight value is 0.3 and the quality weight value is 0.7, or the scene weight value is 0.4 and the quality weight value is 0.6. In addition, you can also set both the scene weight value and the quality weight value to 0.5. Of course, the above are just a few examples of scene weight values and quality weight values.
在情况1和情况3中,需要根据数据集合中的每个初始训练数据的应用场景,确定每个初始训练数据的场景分。例如,针对数据集合中的每个初始训练数据,通过该初始训练数据的应用场景查询预先配置的映射关系(该映射关系包括应用场景与场景分的对应关系),得到该初始训练数据的场景分,具体实现方式参见上述方式1,将特征信息替换为应用场景,将分数值替换为场景分即可,在此不再赘述。又例如,根据数据集合中的每个初始训练数据的应用场景的重要优先级,对所有初始训练数据进行排序,根据排序结果确定每个初始训练数据的场景分,具体实现方式参见上述方式2,在此不再赘述。In case 1 and case 3, it is necessary to determine the scene score of each initial training data according to the application scenario of each initial training data in the data set. For example, for each initial training data in the data set, the pre-configured mapping relationship is queried through the application scenario of the initial training data (the mapping relationship includes the corresponding relationship between the application scenario and the scenario score), and the scenario score of the initial training data is obtained. For the specific implementation method, refer to the above-mentioned method 1, replace the feature information with the application scenario, and replace the score value with the scenario score, which will not be repeated here. For another example, sort all the initial training data according to the important priority of the application scenario of each initial training data in the data set, and determine the scene score of each initial training data according to the sorting result. For the specific implementation method, refer to the above method 2. I won't repeat it here.
在情况2和情况3中,需要根据数据集合中的每个初始训练数据的数据质量,确定每个初始训练数据的质量分。例如,针对数据集合中的每个初始训练数据,通过该初始训练数据的数据质量查询预先配置的映射关系(该映射关系包括数据质量与质量分的对应关系),得到该初始训练数据的质量分,具体实现方式参见上述方式1,将特征信息替换为数据质量,将分数值替换为质量分即可,在此不再赘述。又例如,根据数据集合中的每个初始训练数据的数据质量的重要优先级,对所有初始训练数据进行排序,根据排序结果确定每个初始训练数据的质量分,具体实现方式参见上述方式2,在此不再赘述。In case 2 and case 3, it is necessary to determine the quality score of each initial training data according to the data quality of each initial training data in the data set. For example, for each initial training data in the data set, the data quality of the initial training data is used to query the pre-configured mapping relationship (the mapping relationship includes the corresponding relationship between the data quality and the quality score), and the quality score of the initial training data is obtained. For the specific implementation method, refer to the above-mentioned method 1. Replace the characteristic information with the data quality, and replace the score value with the quality score, which will not be repeated here. For another example, according to the important priority of the data quality of each initial training data in the data set, all initial training data are sorted, and the quality score of each initial training data is determined according to the sorting result. For the specific implementation, please refer to the above method 2. I won't repeat it here.
情况1和情况3中,当至少两个初始训练数据的应用场景相同时,这些初始训练数据的场景分可以相同或不同。情况2和情况3中,当至少两个初始训练数据的数据质量相同时,这些初始训练数据的质量分可以相同或不同。In Case 1 and Case 3, when the application scenarios of at least two initial training data are the same, the scenario scores of these initial training data may be the same or different. In case 2 and case 3, when the data quality of at least two initial training data is the same, the quality scores of these initial training data may be the same or different.
综上所述,可以根据数据集合中的每个初始训练数据的特征信息,确定每个初始训 练数据的分数值。在得到每个初始训练数据的分数值之后,在一种可能的实施方式中,可以直接将该分数值作为初始训练数据的分数值。在另一种可能的实施方式中,还可以对初始训练数据的分数值进行修正,将修正后的分数值作为初始训练数据的分数值。以下对分数值的修正过程进行说明:确定两个初始训练数据之间的相似度;若相似度大于预设相似度阈值,则保持一个初始训练数据的分数值不变,并降低另一个初始训练数据的分数值。In summary, the score value of each initial training data can be determined according to the characteristic information of each initial training data in the data set. After the score value of each initial training data is obtained, in a possible implementation manner, the score value can be directly used as the score value of the initial training data. In another possible implementation manner, the score value of the initial training data can also be corrected, and the corrected score value is used as the score value of the initial training data. The following describes the correction process of the score value: determine the similarity between two initial training data; if the similarity is greater than the preset similarity threshold, keep the score value of one initial training data unchanged, and reduce the other initial training The score value of the data.
示例性的,预设相似度阈值可以根据经验配置,对此不做限制,当两个初始训练数据之间的相似度大于预设相似度阈值时,表示这两个初始训练数据非常接近,可以认为是相同或者近似的初始训练数据,即二者是重复的。Exemplarily, the preset similarity threshold can be configured based on experience, and there is no restriction on this. When the similarity between the two initial training data is greater than the preset similarity threshold, it means that the two initial training data are very close. It is considered that the initial training data is the same or similar, that is, the two are repeated.
示例性的,关于相似度的确定方式,可以采用欧氏距离确定两个初始训练数据之间的相似度,或者,采用余弦相似度确定两个初始训练数据之间的相似度,或者,采用皮尔逊相关系数确定两个初始训练数据之间的相似度。当然,上述只是几个示例,对此确定方式不做限制,可以采用任意的相似度算法。Exemplarily, regarding the determination method of similarity, Euclidean distance can be used to determine the similarity between two initial training data, or cosine similarity can be used to determine the similarity between two initial training data, or Peel The Sun correlation coefficient determines the similarity between two initial training data. Of course, the above are just a few examples, and there is no restriction on this determination method, and any similarity algorithm can be used.
例如,比较初始训练数据1与初始训练数据2的相似度,若该相似度大于预设相似度阈值,则保持初始训练数据1的分数值不变,并降低初始训练数据2的分数值,如将分数值降为0。若该相似度不大于预设相似度阈值,则保持初始训练数据1和初始训练数据2的分数值不变。然后,比较初始训练数据1与初始训练数据3的相似度,以此类推,可以比较任意两个初始训练数据的相似度。For example, compare the similarity between the initial training data 1 and the initial training data 2. If the similarity is greater than the preset similarity threshold, keep the score value of the initial training data 1 unchanged, and reduce the score value of the initial training data 2, such as Decrease the score value to 0. If the similarity is not greater than the preset similarity threshold, the score values of the initial training data 1 and the initial training data 2 are kept unchanged. Then, the similarity between the initial training data 1 and the initial training data 3 is compared, and so on, the similarity of any two initial training data can be compared.
示例性的,在相似度大于预设相似度阈值时,可以保持分数值高的初始训练数据的分数值不变,也可以保持分数值低的初始训练数据的分数值不变。Exemplarily, when the similarity is greater than the preset similarity threshold, the score value of the initial training data with a high score value may be kept unchanged, or the score value of the initial training data with a low score value may be kept unchanged.
示例性的,在降低某个初始训练数据的分数值后,该初始训练数据不参与后续的比较过程,即不再比较该初始训练数据与其它初始训练数据的相似度。Exemplarily, after the score value of a certain initial training data is reduced, the initial training data does not participate in the subsequent comparison process, that is, the similarity between the initial training data and other initial training data is no longer compared.
步骤103,根据每个初始训练数据的分数值从数据集合中选取目标训练数据。Step 103: Select target training data from the data set according to the score value of each initial training data.
示例性的,由于分数值用于表示初始训练数据的训练效果,分数值越高则初始训练数据的训练效果越好,分数值越低则初始训练数据的训练效果越差,因此,基于每个初始训练数据的分数值,可以将分数值高的初始训练数据作为目标训练数据,这样,能够将训练效果较好的初始训练数据作为目标训练数据。Exemplarily, since the score value is used to represent the training effect of the initial training data, the higher the score value, the better the training effect of the initial training data, and the lower the score value, the worse the training effect of the initial training data. Therefore, based on each For the score value of the initial training data, the initial training data with a high score value can be used as the target training data. In this way, the initial training data with a better training effect can be used as the target training data.
示例性的,可以采用如下方式从数据集合中选取目标训练数据:Exemplarily, the target training data can be selected from the data set in the following manner:
方式一、针对数据集合中的每个初始训练数据,若该初始训练数据的分数值大于预设分数阈值,则可以将该初始训练数据确定为目标训练数据。Manner 1: For each initial training data in the data set, if the score value of the initial training data is greater than the preset score threshold, the initial training data can be determined as the target training data.
示例性的,可以根据经验配置预设分数阈值,对此不做限制。当分数值大于预设分数阈值时,则表示初始训练数据的训练效果较好,可以将该初始训练数据作为目标训练数据。当分数值不大于预设分数阈值时,则表示初始训练数据的训练效果较差,不需要 将该初始训练数据作为目标训练数据。Exemplarily, the preset score threshold can be configured based on experience, and there is no restriction on this. When the score value is greater than the preset score threshold, it indicates that the training effect of the initial training data is good, and the initial training data can be used as the target training data. When the score value is not greater than the preset score threshold, it indicates that the training effect of the initial training data is poor, and the initial training data does not need to be used as the target training data.
例如,假设初始训练数据1的分数值大于预设分数阈值,则可以将初始训练数据1确定为目标训练数据。又例如,假设初始训练数据2的分数值不大于预设分数阈值,则不将初始训练数据2确定为目标训练数据,以此类推。For example, assuming that the score value of the initial training data 1 is greater than the preset score threshold, the initial training data 1 may be determined as the target training data. For another example, assuming that the score value of the initial training data 2 is not greater than the preset score threshold, the initial training data 2 is not determined as the target training data, and so on.
方式二、根据数据集合中的每个初始训练数据的分数值,对所有初始训练数据进行排序,并根据排序结果选取多个初始训练数据作为目标训练数据。Method 2: Sort all the initial training data according to the score value of each initial training data in the data set, and select multiple initial training data as the target training data according to the sorting result.
例如,基于数据集合中的每个初始训练数据的分数值,按照分数值从高到低的顺序,对所有初始训练数据进行排序。基于排序结果,从分数值高的初始训练数据开始,选取排序靠前的多个初始训练数据作为目标训练数据。For example, based on the score value of each initial training data in the data set, all the initial training data are sorted in the order of the score value from high to low. Based on the ranking result, starting from the initial training data with a high score value, multiple initial training data with the highest ranking are selected as the target training data.
示例性的,可以将数据清洗时间区间(表示在这个时间区间内进行数据清洗)划分为多个统计周期,每个统计周期的时长相同。在每个统计周期,可以从分数值高的初始训练数据开始,选取排序靠前的多个初始训练数据作为目标训练数据。例如,排序结果为初始训练数据1-初始训练数据100,在第1个统计周期,选取初始训练数据1-初始训练数据10作为目标训练数据,在第2个统计周期,选取初始训练数据11-初始训练数据20作为目标训练数据,以此类推。Exemplarily, the data cleaning time interval (indicating that data cleaning is performed in this time interval) may be divided into multiple statistical periods, and the duration of each statistical period is the same. In each statistical period, you can start with the initial training data with a high score, and select multiple initial training data with the highest ranking as the target training data. For example, the ranking result is initial training data 1-initial training data 100. In the first statistical period, initial training data 1-initial training data 10 are selected as the target training data, and in the second statistical period, initial training data 11- The initial training data 20 is used as the target training data, and so on.
在一种可能的实施方式中,可以先确定下一个统计周期的待清洗数量M,M可以为正整数,即自然数。在下一个统计周期,基于排序结果,可以从分数值高的初始训练数据开始,依次选取M个初始训练数据作为目标训练数据。In a possible implementation manner, the number M to be cleaned in the next statistical period may be determined first, and M may be a positive integer, that is, a natural number. In the next statistical cycle, based on the sorting result, starting from the initial training data with a high score value, M initial training data can be selected in turn as the target training data.
M的取值可以根据经验配置,对此不做限制。例如,所有操作节点在一个统计周期能够对10个目标训练数据进行数据清洗时,M可以为10或者略大于1。假设待清洗的目标训练数据为若干个图片,那么,如果M取值为0则可以认为下一个统计周期的图片数量为0,所有图片均已被清洗完毕。The value of M can be configured based on experience, and there is no restriction on this. For example, when all operating nodes can perform data cleaning on 10 target training data in a statistical period, M can be 10 or slightly greater than 1. Assuming that the target training data to be cleaned is several pictures, if the value of M is 0, it can be considered that the number of pictures in the next statistical period is 0, and all pictures have been cleaned.
由于操作节点的数量可能发生变化,操作节点在不同统计周期进行数据清洗的目标训练数据的数量也可能发生变化,因此,M还可以采用如下方式确定:根据操作节点的清洗效率确定下一个统计周期的待清洗数量M,该清洗效率表示操作节点(即所有操作节点)在当前统计周期的已完成清洗量。Since the number of operating nodes may change, and the number of target training data for data cleaning of operating nodes in different statistical periods may also change, M can also be determined in the following way: determine the next statistical period according to the cleaning efficiency of operating nodes The number to be cleaned M, the cleaning efficiency represents the completed cleaning amount of the operating nodes (that is, all operating nodes) in the current statistical period.
综上所述,可以将数据清洗时间区间划分为多个统计周期,每个统计周期的时长相同。在第1个统计周期,先选取10个初始训练数据作为目标训练数据,将这10个目标训练数据添加到待清洗列表,由操作节点从待清洗列表获取目标训练数据,并对目标训练数据进行数据清洗。若操作节点在第1个统计周期能够对15个目标训练数据进行数据清洗,则第1个统计周期,还需要选取5个初始训练数据作为目标训练数据,将这5个目标训练数据添加到待清洗列表,由操作节点从待清洗列表获取目标训练数据,并对目标训练数据进行数据清洗。In summary, the data cleaning time interval can be divided into multiple statistical periods, and the duration of each statistical period is the same. In the first statistical cycle, first select 10 initial training data as the target training data, add these 10 target training data to the list to be cleaned, and the operation node obtains the target training data from the list to be cleaned, and performs a calculation on the target training data. Data cleaning. If the operating node can perform data cleaning on 15 target training data in the first statistical cycle, then in the first statistical cycle, it is also necessary to select 5 initial training data as the target training data, and add these 5 target training data to the target training data. In the cleaning list, the operating node obtains the target training data from the list to be cleaned, and performs data cleaning on the target training data.
显然,由于操作节点在第1个统计周期共对15个目标训练数据进行数据清洗,则清洗效率可以为15,确定第2个统计周期的待清洗数量M为15。在第2个统计周期,先选取15个初始训练数据作为目标训练数据,将这15个目标训练数据添加到待清洗列表,由操作节点从待清洗列表获取目标训练数据,并对目标训练数据进行数据清洗。若操作节点在第2个统计周期能够对12个目标训练数据进行数据清洗,则不需要将新的目标训练数据添加到待清洗列表。Obviously, since the operating node performs data cleaning on a total of 15 target training data in the first statistical period, the cleaning efficiency can be 15, and it is determined that the number M to be cleaned in the second statistical period is 15. In the second statistical cycle, first select 15 initial training data as the target training data, add these 15 target training data to the list to be cleaned, and the operation node obtains the target training data from the list to be cleaned, and performs a calculation on the target training data. Data cleaning. If the operating node can perform data cleaning on 12 target training data in the second statistical period, there is no need to add new target training data to the list to be cleaned.
显然,由于操作节点在第2个统计周期共对12个目标训练数据进行数据清洗,则清洗效率可以为12,确定第3个统计周期的待清洗数量M为12。Obviously, since the operating node performs data cleaning on a total of 12 target training data in the second statistical period, the cleaning efficiency can be 12, and it is determined that the number M to be cleaned in the third statistical period is 12.
在第3个统计周期,先选取9个初始训练数据作为目标训练数据,将这9个目标训练数据添加到待清洗列表,由于待清洗列表中仍然存在3个目标训练数据,这样,待清洗列表一共存在12个目标训练数据。操作节点可以从待清洗列表获取目标训练数据,并对目标训练数据进行数据清洗,以此类推。In the third statistical cycle, first select 9 initial training data as the target training data, and add these 9 target training data to the list to be cleaned. Since there are still 3 target training data in the list to be cleaned, the list to be cleaned There are a total of 12 target training data. The operating node can obtain target training data from the list to be cleaned, and perform data cleaning on the target training data, and so on.
步骤104,根据目标训练数据进行数据清洗。Step 104: Perform data cleaning according to the target training data.
示例性的,可以将目标训练数据以及清洗参数发送给操作节点,以使操作节点根据清洗参数对目标训练数据进行数据清洗,也可以称为数据标注。Exemplarily, the target training data and the cleaning parameters may be sent to the operating node, so that the operating node performs data cleaning on the target training data according to the cleaning parameters, which may also be referred to as data labeling.
示例性的,初始训练数据/目标训练数据可以为图片数据,音频数据,视频数据,文本数据等类型,对此初始训练数据/目标训练数据的类型不做限制。Exemplarily, the initial training data/target training data may be picture data, audio data, video data, text data, etc., and there is no restriction on the type of the initial training data/target training data.
示例性的,对目标训练数据进行数据清洗是指:对目标训练数据进行分类、绘边框、注释、标记(即说明某种属性的标签)等操作的至少一种,对此数据清洗的方式不做限制,所有与神经网络有关的数据清洗方式均适用。Exemplarily, performing data cleaning on the target training data refers to at least one of operations such as classifying, drawing a frame, annotating, and marking (that is, a label indicating a certain attribute) on the target training data. The method of cleaning this data is not For restrictions, all data cleaning methods related to neural networks are applicable.
示例性的,清洗参数表示如何对目标训练数据进行数据清洗,例如,如何实现分类的参数,如何实现绘边框的参数,如何实现注释的参数,如何实现标记的参数等,因此,操作节点能够根据清洗参数对目标训练数据进行数据清洗。Exemplarily, the cleaning parameter indicates how to clean the target training data, for example, how to realize the classification parameters, how to realize the border drawing parameters, how to realize the annotation parameters, how to realize the marked parameters, etc. Therefore, the operation node can be based on The cleaning parameter performs data cleaning on the target training data.
在一种可能的实施方式中,可以根据目标训练数据的数量动态调整操作节点的数量。例如,针对上述方式一,可以将分数值大于预设分数阈值的初始训练数据确定为目标训练数据。假设目标训练数据为48个,每个操作节点能够完成5个目标训练数据的数据清洗工作,则需要部署10个操作节点。基于此,在步骤104中,可以将48个目标训练数据以及清洗参数发送给10个操作节点,以使这些操作节点根据清洗参数对目标训练数据进行数据清洗。In a possible implementation manner, the number of operation nodes can be dynamically adjusted according to the number of target training data. For example, with respect to the above method 1, initial training data with a score greater than a preset score threshold may be determined as the target training data. Assuming that there are 48 target training data, and each operation node can complete the data cleaning of 5 target training data, 10 operation nodes need to be deployed. Based on this, in step 104, 48 target training data and cleaning parameters can be sent to 10 operating nodes, so that these operating nodes perform data cleaning on the target training data according to the cleaning parameters.
在另一种可能的实施方式中,可以根据操作节点的清洗效率动态调整目标训练数据的数量。例如,针对上述方式二,根据操作节点的清洗效率确定下一个统计周期的待清洗数量M,在下一个统计周期选取M个初始训练数据作为目标训练数据。例如,操作节点的清洗效率为10时,则确定下一个统计周期的待清洗数量M为10,基于此, 步骤104中,将10个目标训练数据以及清洗参数发送给操作节点,以使操作节点根据清洗参数对目标训练数据进行数据清洗。In another possible implementation manner, the amount of target training data can be dynamically adjusted according to the cleaning efficiency of the operating node. For example, for the second method above, the number M to be cleaned in the next statistical period is determined according to the cleaning efficiency of the operating node, and M initial training data are selected as the target training data in the next statistical period. For example, when the cleaning efficiency of the operating node is 10, the number M to be cleaned in the next statistical period is determined to be 10. Based on this, in step 104, 10 target training data and cleaning parameters are sent to the operating node, so that the operating node Perform data cleaning on the target training data according to the cleaning parameters.
由以上技术方案可见,本申请实施例中,根据初始训练数据的特征信息确定初始训练数据的分数值,分数值用于表示初始训练数据的训练效果,根据每个初始训练数据的分数值从所有初始训练数据中选取目标训练数据,对目标训练数据进行数据清洗,而不是对所有初始训练数据进行数据清洗,从而提高数据清洗效率,减少冗余数据的无效投入。能够对训练效果好(即分数值高)的目标训练数据进行数据清洗,提供最有效的数据用于训练,使得效果较好的训练数据参与到机器学习,机器学习的效果较好,可以提高清洗资源的利用率。It can be seen from the above technical solutions that in this embodiment of the application, the score value of the initial training data is determined according to the characteristic information of the initial training data. The score value is used to represent the training effect of the initial training data. Select the target training data from the initial training data, and perform data cleaning on the target training data instead of performing data cleaning on all initial training data, thereby improving the efficiency of data cleaning and reducing the invalid input of redundant data. It can clean the target training data with good training effect (that is, the score value is high), and provide the most effective data for training, so that the training data with better effect can participate in machine learning, and the effect of machine learning is better, which can improve cleaning Resource utilization.
以下结合具体应用场景,对上述技术方案进行说明。参见图2所示,为本申请实施例的应用场景示意图,控制中心模块21,数据导入模块22,主动学习模块23和清洗控制模块24,可以部署在同一个设备,也可以部署在不同设备。The above technical solutions are described below in combination with specific application scenarios. As shown in FIG. 2, which is a schematic diagram of the application scenario of the embodiment of this application, the control center module 21, the data import module 22, the active learning module 23, and the cleaning control module 24 can be deployed on the same device or on different devices.
在上述应用场景下,参见图3所示,数据清洗方法可以包括:In the above application scenario, referring to Figure 3, the data cleaning method may include:
步骤301,控制中心模块21创建清洗任务,该清洗任务可以包括数据清洗时间区间(表示在这个时间区间内进行数据清洗),清洗参数等内容。In step 301, the control center module 21 creates a cleaning task. The cleaning task may include a data cleaning time interval (indicating that data cleaning is performed in this time interval), cleaning parameters, and the like.
步骤302,控制中心模块21向数据导入模块22发送工作指令。In step 302, the control center module 21 sends a work instruction to the data import module 22.
步骤303,数据导入模块22获取数据集合,该数据集合包括多个初始训练数据。示例性的,数据导入模块22接收到工作指令后,开始工作。在工作过程中,可以从历史数据中获取初始训练数据,和/或,从实时数据中获取初始训练数据,对此不做限制。针对得到的大量初始训练数据,数据导入模块22将相同类型的初始训练数据导入到同一个数据集合,从而得到至少一个数据集合。In step 303, the data import module 22 obtains a data set, which includes a plurality of initial training data. Exemplarily, after the data import module 22 receives the work instruction, it starts to work. In the working process, initial training data can be obtained from historical data, and/or initial training data can be obtained from real-time data, and there is no restriction on this. Regarding the obtained large amount of initial training data, the data importing module 22 imports the same type of initial training data into the same data set, thereby obtaining at least one data set.
步骤304,数据导入模块22向控制中心模块21返回数据导入成功消息。数据导入成功消息表示数据导入模块22已经完成数据导入工作,即已经得到数据集合,该数据导入成功消息还可以携带数据集合中的初始训练数据的数量。In step 304, the data import module 22 returns a data import success message to the control center module 21. The data import success message indicates that the data import module 22 has completed the data import work, that is, the data set has been obtained, and the data import success message may also carry the amount of initial training data in the data set.
步骤305,控制中心模块21向主动学习模块23发送工作指令。In step 305, the control center module 21 sends a work instruction to the active learning module 23.
步骤306,主动学习模块23根据数据集合中的每个初始训练数据的特征信息,确定每个初始训练数据的分数值。示例性的,主动学习模块23接收到工作指令后,开始工作。在工作过程中,从数据导入模块22获取数据集合,并根据数据集合中的每个初始训练数据的特征信息,确定每个初始训练数据的分数值。In step 306, the active learning module 23 determines the score value of each initial training data according to the characteristic information of each initial training data in the data set. Exemplarily, the active learning module 23 starts to work after receiving the work instruction. In the working process, the data set is obtained from the data import module 22, and the score value of each initial training data is determined according to the characteristic information of each initial training data in the data set.
在一种可能的实施方式中,主动学习模块23在接收到工作指令后,可以从数据导入模块22获取数据集合中的所有初始训练数据,即一次性获取所有初始训练数据,并根据数据集合中的每个初始训练数据的特征信息,确定每个初始训练数据的分数值,具体方式参见步骤102的方式1或方式2,在此不再赘述。In a possible implementation, after the active learning module 23 receives the work instruction, it can obtain all the initial training data in the data set from the data import module 22, that is, obtain all the initial training data at one time, and according to the data in the data set For the characteristic information of each initial training data in, determine the score value of each initial training data. For a specific method, please refer to method 1 or method 2 of step 102, which will not be repeated here.
在另一种可能的实施方式中,主动学习模块23在接收到工作指令后,可以从数据导入模块22获取数据集合中的部分初始训练数据,根据部分初始训练数据的特征信息,确定部分初始训练数据的分数值。在分数值确定完成后,再从数据导入模块22获取数据集合中的部分初始训练数据,以此类推,一直到从数据导入模块22获取数据集合中的所有初始训练数据,完成分数值确定。In another possible implementation manner, after receiving the work instruction, the active learning module 23 may obtain part of the initial training data in the data set from the data import module 22, and determine part of the initial training data according to the characteristic information of the part of the initial training data. The score value of the data. After the score value determination is completed, part of the initial training data in the data set is obtained from the data import module 22, and so on, until all the initial training data in the data set is obtained from the data import module 22, and the score value determination is completed.
例如,主动学习模块23从数据导入模块22获取10个初始训练数据,针对每个初始训练数据,通过该初始训练数据的特征信息查询预先配置的映射关系,得到该初始训练数据的分数值。然后,再从数据导入模块22获取10个初始训练数据,以此类推,一直到完成所有初始训练数据的分数值确定。For example, the active learning module 23 obtains 10 initial training data from the data importing module 22, and for each initial training data, query the pre-configured mapping relationship through the feature information of the initial training data to obtain the score value of the initial training data. Then, 10 pieces of initial training data are obtained from the data import module 22, and so on, until the score values of all initial training data are determined.
又例如,主动学习模块23从数据导入模块22获取初始训练数据1-10,根据初始训练数据1-10的特征信息的重要优先级,对初始训练数据1-10进行排序,根据排序结果确定初始训练数据1-10的分数值。然后,再从数据导入模块22获取初始训练数据11-20,根据初始训练数据1-20的特征信息的重要优先级,对初始训练数据1-20进行排序,根据排序结果确定初始训练数据1-20的分数值。For another example, the active learning module 23 obtains the initial training data 1-10 from the data import module 22, sorts the initial training data 1-10 according to the important priority of the feature information of the initial training data 1-10, and determines the initial training data according to the sorting result. The score value of the training data 1-10. Then, obtain the initial training data 11-20 from the data import module 22, sort the initial training data 1-20 according to the important priority of the feature information of the initial training data 1-20, and determine the initial training data 1- 20 points value.
由于重新确定了初始训练数据1-10的分数值,因此,需要对初始训练数据1-10的分数值进行修正,即采用修正后的初始训练数据1-10的分数值。Since the score values of the initial training data 1-10 are re-determined, the score values of the initial training data 1-10 need to be corrected, that is, the score values of the revised initial training data 1-10 are used.
然后,再从数据导入模块22获取初始训练数据21-30,根据初始训练数据1-30的特征信息的重要优先级,对初始训练数据1-30进行排序,根据排序结果确定初始训练数据1-30的分数值。由于重新确定了初始训练数据1-20的分数值,因此,需要对初始训练数据1-20的分数值进行修正,即采用修正后的初始训练数据1-20的分数值,以此类推,一直到完成所有初始训练数据的分数值确定。Then, obtain the initial training data 21-30 from the data import module 22, sort the initial training data 1-30 according to the important priority of the feature information of the initial training data 1-30, and determine the initial training data 1- 30 points value. Since the score value of the initial training data 1-20 has been re-determined, the score value of the initial training data 1-20 needs to be corrected, that is, the score value of the revised initial training data 1-20 is used, and so on, Until the completion of all initial training data scores are determined.
示例性的,从数据导入模块22获取初始训练数据11-20后,需要对初始训练数据1-10的分数值进行修正,其原因在于:在根据初始训练数据1-10的特征信息的重要优先级,对初始训练数据1-10进行排序时,假设初始训练数据5位于首位,则初始训练数据5的分数值为100。但是,在根据初始训练数据1-20的特征信息的重要优先级,对初始训练数据1-20进行排序时,初始训练数据5可能不是位于首位,如位于第6位,则初始训练数据5的分数值为95,即初始训练数据5的分数值发生变化,因此需要对初始训练数据5的分数值进行修正。Exemplarily, after the initial training data 11-20 is obtained from the data import module 22, the score value of the initial training data 1-10 needs to be corrected. The reason is that the important priority is based on the feature information of the initial training data 1-10. Level, when sorting the initial training data 1-10, assuming that the initial training data 5 is in the first place, the score value of the initial training data 5 is 100. However, when sorting the initial training data 1-20 according to the important priority of the feature information of the initial training data 1-20, the initial training data 5 may not be in the first place. If it is in the sixth place, the initial training data 5 is The score value is 95, that is, the score value of the initial training data 5 has changed, so the score value of the initial training data 5 needs to be corrected.
主动学习模块23根据数据集合中的每个初始训练数据的特征信息,确定每个初始训练数据的分数值之后,还可以确定两个初始训练数据之间的相似度;若相似度大于预设相似度阈值,则保持一个初始训练数据的分数值不变,并降低另一个初始训练数据的分数值。例如,如果有重复的初始训练数据,则保持第一个初始训练数据的分数值不变,将其它初始训练数据的分数值置为0。After the active learning module 23 determines the score value of each initial training data according to the characteristic information of each initial training data in the data set, it can also determine the similarity between the two initial training data; if the similarity is greater than the preset similarity Degree threshold, keep the score value of one initial training data unchanged, and reduce the score value of another initial training data. For example, if there are repeated initial training data, keep the score value of the first initial training data unchanged, and set the score value of other initial training data to 0.
当然,主动学习模块23也可以在确定初始训练数据的分数值之前,执行相似度 的比较过程。例如,先确定初始训练数据之间的相似度,若相似度大于预设相似度阈值,则将一个初始训练数据保留在数据集合中,并将剩余的初始训练数据的分数值置为0,且剩余的初始训练数据不保留在数据集合中。基于此,主动学习模块23可以根据数据集合中的每个初始训练数据(已经不包括分数值置为0的初始训练数据)的特征信息,确定每个初始训练数据的分数值。Of course, the active learning module 23 may also perform a similarity comparison process before determining the score value of the initial training data. For example, first determine the similarity between the initial training data. If the similarity is greater than the preset similarity threshold, keep an initial training data in the data set, and set the score value of the remaining initial training data to 0, and The remaining initial training data is not kept in the data set. Based on this, the active learning module 23 can determine the score value of each initial training data according to the feature information of each initial training data (the initial training data whose score value is set to 0 is not included) in the data set.
示例性的,主动学习模块23支持按条件查询初始训练数据,比如说,分数值大于某数值的初始训练数据的数量,不同分数值区间的分布情况等。Exemplarily, the active learning module 23 supports querying initial training data according to conditions, for example, the number of initial training data whose score value is greater than a certain value, the distribution of different score value intervals, and so on.
步骤307,主动学习模块23向控制中心模块21发送评分完成消息,该评分完成消息表示主动学习模块23已经对所有初始训练数据进行评分。In step 307, the active learning module 23 sends a scoring completion message to the control center module 21. The scoring completion message indicates that the active learning module 23 has scored all the initial training data.
步骤308,控制中心模块21向清洗控制模块24发送工作指令。In step 308, the control center module 21 sends a work instruction to the cleaning control module 24.
步骤309,清洗控制模块24确定待清洗数量M,将待清洗数量M发送给主动学习模块23。示例性的,清洗控制模块24接收到工作指令后,开始工作。在工作过程中,先确定待清洗数量M,并将待清洗数量M发送给主动学习模块23。In step 309, the cleaning control module 24 determines the quantity M to be cleaned, and sends the quantity M to be cleaned to the active learning module 23. Exemplarily, the cleaning control module 24 starts to work after receiving the work instruction. In the working process, the quantity M to be cleaned is determined first, and the quantity M to be cleaned is sent to the active learning module 23.
示例性的,第1个统计周期的待清洗数量M1,可以根据经验配置。第2个统计周期的待清洗数量M2,基于所有操作节点在第1个统计周期的清洗效率确定。第3个统计周期的待清洗数量M2,基于所有操作节点在第2个统计周期的清洗效率确定,以此类推。综上所述,清洗控制模块24可以确定每个统计周期的待清洗数量M,并将待清洗数量M发送给主动学习模块23。Exemplarily, the number M1 to be cleaned in the first statistical period can be configured based on experience. The number M2 to be cleaned in the second statistical period is determined based on the cleaning efficiency of all operating nodes in the first statistical period. The number M2 to be cleaned in the third statistical period is determined based on the cleaning efficiency of all operating nodes in the second statistical period, and so on. In summary, the cleaning control module 24 can determine the quantity M to be cleaned in each statistical period, and send the quantity M to be cleaned to the active learning module 23.
示例性的,当操作节点在统计周期的清洗效率有提高或减少,和/或,操作节点的数量有增加或减少时,均会导致所有操作节点的清洗效率发生变化,即待清洗数量M会发生变化,从而能够动态调整待清洗数量M。Exemplarily, when the cleaning efficiency of operating nodes increases or decreases in the statistical period, and/or the number of operating nodes increases or decreases, it will cause the cleaning efficiency of all operating nodes to change, that is, the number of cleaning nodes M will be changed. The change occurs, so that the quantity M to be cleaned can be dynamically adjusted.
示例性的,清洗控制模块24可以统计每个操作节点的清洗效率,即该操作节点在当前统计周期内完成的目标训练数据数量。然后,确定所有操作节点的清洗效率,基于所有操作节点的清洗效率确定待清洗数量M。Exemplarily, the cleaning control module 24 can count the cleaning efficiency of each operating node, that is, the number of target training data completed by the operating node in the current statistical period. Then, the cleaning efficiency of all operating nodes is determined, and the number M to be cleaned is determined based on the cleaning efficiency of all operating nodes.
步骤310,主动学习模块23根据每个初始训练数据的分数值,对所有初始训练数据进行排序,基于排序结果,从分数值高的初始训练数据开始,选取前M个初始训练数据作为目标训练数据,将目标训练数据发送给清洗控制模块24。 Step 310, the active learning module 23 sorts all the initial training data according to the score value of each initial training data. Based on the ranking result, starting from the initial training data with the higher score value, select the first M initial training data as the target training data , Send the target training data to the cleaning control module 24.
步骤311,清洗控制模块24将目标训练数据添加到待清洗列表。In step 311, the cleaning control module 24 adds the target training data to the list to be cleaned.
例如,第1个统计周期,主动学习模块23将M1个初始训练数据作为目标训练数据,将M1个目标训练数据发送给清洗控制模块24,清洗控制模块24将M1个目标训练数据添加到待清洗列表。第2个统计周期,主动学习模块23将M2个初始训练数据作为目标训练数据,将M2个目标训练数据发送给清洗控制模块24,清洗控制模块24将M2个目标训练数据添加到待清洗列表,以此类推。For example, in the first statistical cycle, the active learning module 23 uses M1 initial training data as target training data, sends M1 target training data to the cleaning control module 24, and the cleaning control module 24 adds M1 target training data to the to-be-cleaned List. In the second statistical cycle, the active learning module 23 uses M2 initial training data as target training data, sends M2 target training data to the cleaning control module 24, and the cleaning control module 24 adds M2 target training data to the list to be cleaned. And so on.
步骤312,清洗控制模块24将目标训练数据发送给操作节点,以使操作节点对目标训练数据进行数据清洗。例如,将目标训练数据以及清洗参数发送给操作节点,以使操作节点根据清洗参数对目标训练数据进行数据清洗。In step 312, the cleaning control module 24 sends the target training data to the operating node, so that the operating node performs data cleaning on the target training data. For example, the target training data and cleaning parameters are sent to the operating node, so that the operating node performs data cleaning on the target training data according to the cleaning parameters.
例如,操作节点能够处理新的目标训练数据时,可以向清洗控制模块24发送请求消息,该请求消息用于请求N个目标训练数据,表示操作节点能够对N个目标训练数据进行数据清洗,N可以为正整数。清洗控制模块24在接收到请求消息后,判断待清洗列表中是否存在N个目标训练数据。如果是,则直接将N个目标训练数据发送给操作节点。如果否,则从主动学习模块23获取(N-a)个目标训练数据,a用于表示待清洗列表中已存在的目标训练数据,这样,可以得到N个目标训练数据,并将N个目标训练数据发送给操作节点。For example, when the operating node can process new target training data, it can send a request message to the cleaning control module 24. The request message is used to request N target training data, indicating that the operating node can perform data cleaning on the N target training data. Can be a positive integer. After receiving the request message, the cleaning control module 24 determines whether there are N target training data in the list to be cleaned. If so, directly send N target training data to the operating node. If not, then obtain (Na) target training data from the active learning module 23, a is used to represent the target training data that already exists in the list to be cleaned, so that N target training data can be obtained, and the N target training data Sent to the operation node.
示例性的,操作节点也可以称为清洗节点,操作节点可以为机器,也可以为人工,对此不做限制,只要能够对目标训练数据进行数据清洗即可。Exemplarily, the operation node may also be referred to as a cleaning node. The operation node may be a machine or a manual operation. There is no restriction on this, as long as the target training data can be cleaned.
步骤313,清洗控制模块24向控制中心模块21反馈任务执行情况。In step 313, the cleaning control module 24 feeds back the task execution status to the control center module 21.
由以上技术方案可见,本申请实施例中,可以对目标训练数据进行数据清洗,而不是对所有初始训练数据进行数据清洗,从而提高数据清洗效率,减少冗余数据的无效投入。此外,能够对训练效果好(即分数值高)的目标训练数据进行数据清洗,提供最有效的数据用于训练,使得效果较好的训练数据参与到机器学习,机器学习的效果较好,可以提高清洗资源的利用率。It can be seen from the above technical solutions that in the embodiments of the present application, data cleaning can be performed on the target training data instead of data cleaning on all initial training data, thereby improving the efficiency of data cleaning and reducing the invalid input of redundant data. In addition, data cleaning can be performed on target training data with good training effects (that is, high score values), and the most effective data can be provided for training, so that training data with better effects can participate in machine learning, and the effect of machine learning is better. Improve the utilization of cleaning resources.
例如,当前有2个操作节点可以投入到数据清洗,每个操作节点每天可以对100个目标训练数据进行数据清洗,假设存在1000个初始训练数据,则可以从1000个初始训练数据中选取分数值大于n的200个初始训练数据,将这200个初始训练数据作为目标训练数据。然后,可以将100个目标训练数据提供给一个操作节点,并将剩余100个目标训练数据提供给另一个操作节点,这样,可以由2个操作节点对上述200个目标训练数据进行数据清洗。For example, there are currently 2 operation nodes that can be used for data cleaning. Each operation node can perform data cleaning on 100 target training data every day. Assuming there are 1000 initial training data, you can select the score value from the 1000 initial training data. 200 initial training data larger than n, these 200 initial training data are used as target training data. Then, 100 target training data can be provided to one operating node, and the remaining 100 target training data can be provided to another operating node. In this way, two operating nodes can perform data cleaning on the above 200 target training data.
在数据清洗过程中,若增加了一个操作节点,则可以从剩余的800个初始训练数据中,选取分数值高的100个初始训练数据,将这100个初始训练数据作为目标训练数据,并将这些目标训练数据提供给新增加的操作节点。In the data cleaning process, if an operation node is added, 100 initial training data with high scores can be selected from the remaining 800 initial training data, and these 100 initial training data are used as the target training data. These target training data are provided to newly added operation nodes.
又例如,当前有2个操作节点可以投入到数据清洗,在数据清洗过程中,发现分数值高的初始训练数据开始向上累积,即操作节点数量不够,则可以根据累积的高分数值的初始训练数据数量,动态调整投入的操作节点数量。For another example, there are currently 2 operating nodes that can be used for data cleaning. During the data cleaning process, it is found that the initial training data with high scores starts to accumulate, that is, the number of operating nodes is not enough, and the initial training can be based on the accumulated high scores. Data quantity, dynamic adjustment of the number of operation nodes invested.
又例如,基于历史积累的清洗效率,确定一个统计周期可以完成1000个目标训练数据,则清洗控制模块24从主动学习模块23获取1000个目标训练数据投入清洗。在数据清洗过程中,发现实际的清洗效率高,一个统计周期可以完成1100个目标训练数据,则清洗控制模块24从主动学习模块23获取100个目标训练数据投入清洗。在下 一个统计周期,先从主动学习模块23获取1100个目标训练数据投入清洗,以此类推,可以动态调整目标训练数据的数量。For another example, based on the historically accumulated cleaning efficiency, it is determined that one statistical period can complete 1000 target training data, then the cleaning control module 24 obtains 1000 target training data from the active learning module 23 and invests in cleaning. In the data cleaning process, it is found that the actual cleaning efficiency is high, and 1100 target training data can be completed in one statistical cycle, and then the cleaning control module 24 obtains 100 target training data from the active learning module 23 and invests in cleaning. In the next statistical cycle, 1100 target training data are acquired from the active learning module 23 and used for cleaning. By analogy, the amount of target training data can be dynamically adjusted.
基于与上述方法同样的申请构思,本申请实施例中还提出一种数据清洗装置,如图4所示,为所述数据清洗装置的结构图,所述装置包括:Based on the same application concept as the above method, an embodiment of the present application also proposes a data cleaning device. As shown in FIG. 4, it is a structural diagram of the data cleaning device, and the device includes:
获取模块41,用于获取数据集合,所述数据集合包括多个初始训练数据;确定模块42,用于根据所述数据集合中的每个初始训练数据的特征信息,确定每个初始训练数据的分数值,所述分数值用于表示初始训练数据的训练效果;选取模块43,用于根据每个初始训练数据的分数值从所述数据集合中选取目标训练数据;清洗模块44,用于根据所述目标训练数据进行数据清洗。The obtaining module 41 is used to obtain a data set, and the data set includes a plurality of initial training data; the determining module 42 is used to determine the value of each initial training data according to the characteristic information of each initial training data in the data set. The score value is used to indicate the training effect of the initial training data; the selection module 43 is used to select target training data from the data set according to the score value of each initial training data; the cleaning module 44 is used to Data cleaning is performed on the target training data.
所述确定模块42具体用于:针对所述数据集合中的每个初始训练数据,通过所述初始训练数据的特征信息查询预先配置的映射关系,得到所述初始训练数据的分数值;其中,所述映射关系包括特征信息与分数值的对应关系;或者,The determining module 42 is specifically configured to: for each initial training data in the data set, query a pre-configured mapping relationship through the feature information of the initial training data to obtain the score value of the initial training data; wherein, The mapping relationship includes the corresponding relationship between the feature information and the score value; or,
根据所述数据集合中的每个初始训练数据的特征信息的重要优先级,对所有初始训练数据进行排序,根据排序结果确定每个初始训练数据的分数值。According to the important priority of the feature information of each initial training data in the data set, all initial training data are sorted, and the score value of each initial training data is determined according to the sorting result.
所述特征信息包括应用场景和/或数据质量,所述确定模块42具体用于:The characteristic information includes application scenarios and/or data quality, and the determining module 42 is specifically configured to:
若所述特征信息包括应用场景和数据质量,则根据所述数据集合中的每个初始训练数据的应用场景,确定每个初始训练数据的场景分;根据所述数据集合中的每个初始训练数据的数据质量,确定每个初始训练数据的质量分;If the feature information includes application scenarios and data quality, determine the scenario score of each initial training data according to the application scenario of each initial training data in the data set; according to each initial training data in the data set The data quality of the data, determine the quality score of each initial training data;
针对每个初始训练数据,根据所述初始训练数据的场景分和场景权重值,以及质量分和质量权重值,确定所述初始训练数据的分数值。For each initial training data, the score value of the initial training data is determined according to the scene score and the scene weight value of the initial training data, as well as the quality score and the quality weight value.
示例性的,所述确定模块42还用于:在根据所述数据集合中的每个初始训练数据的特征信息,确定每个初始训练数据的分数值之后,确定两个初始训练数据之间的相似度;若所述相似度大于预设相似度阈值,则保持一个初始训练数据的分数值不变,并降低另一个初始训练数据的分数值。Exemplarily, the determining module 42 is further configured to: after determining the score value of each initial training data according to the characteristic information of each initial training data in the data set, determine the difference between the two initial training data. Similarity; if the similarity is greater than the preset similarity threshold, the score value of one initial training data is kept unchanged, and the score value of another initial training data is reduced.
所述选取模块43具体用于:针对每个初始训练数据,若所述初始训练数据的分数值大于预设分数阈值,则将所述初始训练数据确定为目标训练数据;或者,根据所述数据集合中的每个初始训练数据的分数值,对所有初始训练数据进行排序,并根据排序结果选取多个初始训练数据作为目标训练数据。The selection module 43 is specifically configured to: for each initial training data, if the score value of the initial training data is greater than a preset score threshold, determine the initial training data as target training data; or, according to the data The score value of each initial training data in the set is sorted, and multiple initial training data are selected as the target training data according to the sorting result.
所述选取模块43根据排序结果选取多个初始训练数据作为目标训练数据时具体用于:确定下一个统计周期的待清洗数量M;When the selection module 43 selects a plurality of initial training data as target training data according to the sorting result, it is specifically used to: determine the quantity M to be cleaned in the next statistical period;
在下一个统计周期,基于所述排序结果,从分数值高的初始训练数据开始,依次选取M个初始训练数据作为目标训练数据。In the next statistical period, based on the sorting result, starting from the initial training data with a high score value, M initial training data are sequentially selected as the target training data.
所述选取模块43确定下一个统计周期的待清洗数量M时具体用于:When the selection module 43 determines the quantity M to be cleaned in the next statistical period, it is specifically used for:
根据操作节点的清洗效率确定下一个统计周期的待清洗数量M,所述清洗效率表示所述操作节点在当前统计周期的已完成清洗量。The quantity M to be cleaned in the next statistical period is determined according to the cleaning efficiency of the operating node, and the cleaning efficiency represents the completed cleaning quantity of the operating node in the current statistical period.
所述清洗模块44具体用于:将所述目标训练数据以及清洗参数发送给操作节点,以使所述操作节点根据所述清洗参数对所述目标训练数据进行数据清洗。The cleaning module 44 is specifically configured to send the target training data and cleaning parameters to an operating node, so that the operating node performs data cleaning on the target training data according to the cleaning parameters.
基于与上述方法同样的申请构思,本申请实施例中还提出一种数据清洗设备,本申请实施例提供的数据清洗设备,从硬件层面而言,数据清洗设备的硬件架构示意图可以参见图5所示。数据清洗设备可以包括:处理器51和机器可读存储介质52,所述机器可读存储介质52存储有能够被所述处理器51执行的机器可执行指令;所述处理器51用于执行机器可执行指令,以实现本申请上述示例公开的方法。例如,处理器51用于执行机器可执行指令,以实现如下步骤:Based on the same application concept as the above method, an embodiment of this application also proposes a data cleaning device. For the data cleaning device provided in the embodiment of this application, from the hardware level, the schematic diagram of the hardware architecture of the data cleaning device can be seen in Figure 5. Show. The data cleaning device may include: a processor 51 and a machine-readable storage medium 52, where the machine-readable storage medium 52 stores machine-executable instructions that can be executed by the processor 51; the processor 51 is used to execute the machine Executable instructions are used to implement the methods disclosed in the above examples of this application. For example, the processor 51 is used to execute machine executable instructions to implement the following steps:
获取数据集合,所述数据集合包括多个初始训练数据;Acquiring a data set, the data set including a plurality of initial training data;
根据所述数据集合中的每个初始训练数据的特征信息,确定每个初始训练数据的分数值,所述分数值用于表示初始训练数据的训练效果;Determine the score value of each initial training data according to the feature information of each initial training data in the data set, where the score value is used to represent the training effect of the initial training data;
根据每个初始训练数据的分数值从所述数据集合中选取目标训练数据;Selecting target training data from the data set according to the score value of each initial training data;
根据所述目标训练数据进行数据清洗。Perform data cleaning according to the target training data.
基于与上述方法同样的申请构思,本申请实施例还提供一种机器可读存储介质,其中,所述机器可读存储介质上存储有若干计算机指令,所述计算机指令被处理器执行时,能够实现本申请上述示例公开的方法。Based on the same application concept as the above method, an embodiment of the application also provides a machine-readable storage medium, wherein a number of computer instructions are stored on the machine-readable storage medium, and when the computer instructions are executed by a processor, The method disclosed in the above examples of this application is implemented.
例如,所述计算机指令被处理器执行时,能够实现如下步骤:For example, when the computer instructions are executed by a processor, the following steps can be implemented:
获取数据集合,所述数据集合包括多个初始训练数据;Acquiring a data set, the data set including a plurality of initial training data;
根据所述数据集合中的每个初始训练数据的特征信息,确定每个初始训练数据的分数值,所述分数值用于表示初始训练数据的训练效果;Determine the score value of each initial training data according to the feature information of each initial training data in the data set, where the score value is used to represent the training effect of the initial training data;
根据每个初始训练数据的分数值从所述数据集合中选取目标训练数据;Selecting target training data from the data set according to the score value of each initial training data;
根据所述目标训练数据进行数据清洗。Perform data cleaning according to the target training data.
示例性的,上述机器可读存储介质可以是任何电子、磁性、光学或其它物理存储装置,可以包含或存储信息,如可执行指令、数据,等等。例如,机器可读存储介质可以是:RAM(Radom Access Memory,随机存取存储器)、易失存储器、非易失性存储器、闪存、存储驱动器(如硬盘驱动器)、固态硬盘、任何类型的存储盘(如光盘、dvd等),或者类似的存储介质,或者它们的组合。Exemplarily, the foregoing machine-readable storage medium may be any electronic, magnetic, optical, or other physical storage device, and may contain or store information, such as executable instructions, data, and so on. For example, the machine-readable storage medium can be: RAM (Radom Access Memory), volatile memory, non-volatile memory, flash memory, storage drives (such as hard drives), solid state drives, and any type of storage disk (Such as CD, DVD, etc.), or similar storage media, or a combination of them.
上述实施例阐明的系统、装置、模块或单元,具体可以由计算机芯片或实体实 现,或者由具有某种功能的产品来实现。一种典型的实现设备为计算机,计算机的具体形式可以是个人计算机、膝上型计算机、蜂窝电话、相机电话、智能电话、个人数字助理、媒体播放器、导航设备、电子邮件收发设备、游戏控制台、平板计算机、可穿戴设备或者这些设备中的任意几种设备的组合。The systems, devices, modules, or units explained in the foregoing embodiments may be specifically implemented by computer chips or entities, or implemented by products with certain functions. A typical implementation device is a computer. The specific form of the computer can be a personal computer, a laptop computer, a cellular phone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email receiving and sending device, and a game control A console, a tablet computer, a wearable device, or a combination of any of these devices.
为了描述的方便,描述以上装置时以功能分为各种单元分别描述。当然,在实施本申请时可以把各单元的功能在同一个或多个软件和/或硬件中实现。For the convenience of description, when describing the above device, the functions are divided into various units and described separately. Of course, when implementing this application, the functions of each unit can be implemented in the same one or more software and/or hardware.
本领域内的技术人员应明白,本申请的实施例可提供为方法、系统、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请实施例可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art should understand that the embodiments of the present application can be provided as a method, a system, or a computer program product. Therefore, this application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, the embodiments of the present application may adopt the form of computer program products implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.
本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可以由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其它可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其它可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。This application is described with reference to flowcharts and/or block diagrams of methods, devices (systems), and computer program products according to embodiments of this application. It should be understood that each process and/or block in the flowchart and/or block diagram, and the combination of processes and/or blocks in the flowchart and/or block diagram can be implemented by computer program instructions. These computer program instructions can be provided to the processor of a general-purpose computer, a special-purpose computer, an embedded processor, or other programmable data processing equipment to generate a machine, so that the instructions executed by the processor of the computer or other programmable data processing equipment are generated for use. It is a device that realizes the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
而且,这些计算机程序指令也可以存储在能引导计算机或其它可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或者多个流程和/或方框图一个方框或者多个方框中指定的功能。Moreover, these computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device, The instruction device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
这些计算机程序指令也可装载到计算机或其它可编程数据处理设备上,使得在计算机或者其它可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其它可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment. The instructions provide steps for implementing functions specified in a flow or multiple flows in the flowchart and/or a block or multiple blocks in the block diagram.
以上所述仅为本申请的实施例而已,并不用于限制本申请。对于本领域技术人员来说,本申请可以有各种更改和变化。凡在本申请的精神和原理之内所作的任何修改、等同替换、改进等,均应包含在本申请的权利要求范围之内。The above descriptions are only examples of this application, and are not intended to limit this application. For those skilled in the art, this application can have various modifications and changes. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of this application shall be included in the scope of the claims of this application.

Claims (12)

  1. 一种数据清洗方法,其特征在于,所述方法包括:A data cleaning method, characterized in that the method includes:
    获取数据集合,所述数据集合包括多个初始训练数据;Acquiring a data set, the data set including a plurality of initial training data;
    根据所述数据集合中的每个初始训练数据的特征信息,确定每个初始训练数据的分数值,所述分数值用于表示初始训练数据的训练效果;Determine the score value of each initial training data according to the feature information of each initial training data in the data set, where the score value is used to represent the training effect of the initial training data;
    根据每个初始训练数据的分数值从所述数据集合中选取目标训练数据;Selecting target training data from the data set according to the score value of each initial training data;
    根据所述目标训练数据进行数据清洗。Perform data cleaning according to the target training data.
  2. 根据权利要求1所述的方法,其特征在于,所述根据所述数据集合中的每个初始训练数据的特征信息,确定每个初始训练数据的分数值,包括:The method according to claim 1, wherein the determining the score value of each initial training data according to the characteristic information of each initial training data in the data set comprises:
    针对所述数据集合中的每个初始训练数据,通过所述初始训练数据的特征信息查询预先配置的映射关系,得到所述初始训练数据的分数值;其中,所述映射关系包括特征信息与分数值的对应关系;或者,For each initial training data in the data set, the pre-configured mapping relationship is queried through the feature information of the initial training data to obtain the score value of the initial training data; wherein, the mapping relationship includes feature information and score Correspondence of values; or,
    根据所述数据集合中的每个初始训练数据的特征信息的重要优先级,对所有初始训练数据进行排序,根据排序结果确定每个初始训练数据的分数值。According to the important priority of the feature information of each initial training data in the data set, all initial training data are sorted, and the score value of each initial training data is determined according to the sorting result.
  3. 根据权利要求1或2所述的方法,其特征在于,The method according to claim 1 or 2, characterized in that:
    所述特征信息包括应用场景和/或数据质量,所述根据所述数据集合中的每个初始训练数据的特征信息,确定每个初始训练数据的分数值,包括:The feature information includes application scenarios and/or data quality, and determining the score value of each initial training data according to the feature information of each initial training data in the data set includes:
    若所述特征信息包括应用场景和数据质量,则根据所述数据集合中的每个初始训练数据的应用场景,确定每个初始训练数据的场景分;根据所述数据集合中的每个初始训练数据的数据质量,确定每个初始训练数据的质量分;If the feature information includes application scenarios and data quality, determine the scenario score of each initial training data according to the application scenario of each initial training data in the data set; according to each initial training data in the data set The data quality of the data, determine the quality score of each initial training data;
    针对每个初始训练数据,根据所述初始训练数据的场景分和场景权重值,以及质量分和质量权重值,确定所述初始训练数据的分数值。For each initial training data, the score value of the initial training data is determined according to the scene score and the scene weight value of the initial training data, as well as the quality score and the quality weight value.
  4. 根据权利要求1或2所述的方法,其特征在于,The method according to claim 1 or 2, characterized in that:
    所述根据所述数据集合中的每个初始训练数据的特征信息,确定每个初始训练数据的分数值之后,所述方法还包括:After determining the score value of each initial training data according to the characteristic information of each initial training data in the data set, the method further includes:
    确定两个初始训练数据之间的相似度;若所述相似度大于预设相似度阈值,则保持一个初始训练数据的分数值不变,并降低另一个初始训练数据的分数值。Determine the similarity between two initial training data; if the similarity is greater than the preset similarity threshold, keep the score value of one initial training data unchanged, and reduce the score value of the other initial training data.
  5. 根据权利要求1所述的方法,其特征在于,所述根据每个初始训练数据的分数值从所述数据集合中选取目标训练数据,包括:The method according to claim 1, wherein the selecting target training data from the data set according to the score value of each initial training data comprises:
    针对每个初始训练数据,若所述初始训练数据的分数值大于预设分数阈值,则将所述初始训练数据确定为目标训练数据;或者,For each initial training data, if the score value of the initial training data is greater than the preset score threshold, the initial training data is determined as the target training data; or,
    根据所述数据集合中的每个初始训练数据的分数值,对所有初始训练数据进行排序,并根据排序结果选取多个初始训练数据作为目标训练数据。According to the score value of each initial training data in the data set, all initial training data are sorted, and multiple initial training data are selected as target training data according to the sorting result.
  6. 根据权利要求5所述的方法,其特征在于,The method of claim 5, wherein:
    所述根据排序结果选取多个初始训练数据作为目标训练数据,包括:The selecting multiple initial training data as target training data according to the sorting result includes:
    确定下一个统计周期的待清洗数量M;Determine the quantity M to be cleaned in the next statistical period;
    在下一个统计周期,基于所述排序结果,从分数值高的初始训练数据开始,依次选取M个初始训练数据作为目标训练数据,其中M为自然数。In the next statistical period, based on the sorting result, starting from the initial training data with a high score value, M initial training data are sequentially selected as the target training data, where M is a natural number.
  7. 根据权利要求6所述的方法,其特征在于,The method of claim 6, wherein:
    所述确定下一个统计周期的待清洗数量M,包括:The determining the quantity M to be cleaned in the next statistical period includes:
    根据操作节点的清洗效率确定下一个统计周期的待清洗数量M,所述清洗效率表示所述操作节点在当前统计周期的已完成清洗量。The quantity M to be cleaned in the next statistical period is determined according to the cleaning efficiency of the operating node, and the cleaning efficiency represents the completed cleaning quantity of the operating node in the current statistical period.
  8. 根据权利要求1所述的方法,其特征在于,The method of claim 1, wherein:
    所述根据所述目标训练数据进行数据清洗,包括:The performing data cleaning according to the target training data includes:
    将所述目标训练数据以及清洗参数发送给操作节点,以使所述操作节点根据所述清洗参数对所述目标训练数据进行数据清洗。The target training data and the cleaning parameter are sent to the operating node, so that the operating node performs data cleaning on the target training data according to the cleaning parameter.
  9. 一种数据清洗装置,其特征在于,所述装置包括:A data cleaning device, characterized in that the device includes:
    获取模块,用于获取数据集合,所述数据集合包括多个初始训练数据;An acquisition module for acquiring a data set, the data set including a plurality of initial training data;
    确定模块,用于根据所述数据集合中的每个初始训练数据的特征信息,确定每个初始训练数据的分数值,所述分数值用于表示初始训练数据的训练效果;The determining module is configured to determine the score value of each initial training data according to the feature information of each initial training data in the data set, and the score value is used to represent the training effect of the initial training data;
    选取模块,用于根据每个初始训练数据的分数值从所述数据集合中选取目标训练数据;The selection module is used to select target training data from the data set according to the score value of each initial training data;
    清洗模块,用于根据所述目标训练数据进行数据清洗。The cleaning module is used for data cleaning according to the target training data.
  10. 一种数据清洗设备,其特征在于,包括:处理器和机器可读存储介质,所述机器可读存储介质存储有能够被所述处理器执行的机器可执行指令;A data cleaning device, characterized by comprising: a processor and a machine-readable storage medium, the machine-readable storage medium storing machine executable instructions that can be executed by the processor;
    所述处理器用于执行机器可执行指令,以实现如下步骤:The processor is used to execute machine executable instructions to implement the following steps:
    获取数据集合,所述数据集合包括多个初始训练数据;Acquiring a data set, the data set including a plurality of initial training data;
    根据所述数据集合中的每个初始训练数据的特征信息,确定每个初始训练数据的分数值,所述分数值用于表示初始训练数据的训练效果;Determine the score value of each initial training data according to the feature information of each initial training data in the data set, where the score value is used to represent the training effect of the initial training data;
    根据每个初始训练数据的分数值从所述数据集合中选取目标训练数据;Selecting target training data from the data set according to the score value of each initial training data;
    根据所述目标训练数据进行数据清洗。Perform data cleaning according to the target training data.
  11. 一种计算机程序,所述计算机程序存储于机器可读存储介质,并且当处理器执 行计算机程序时,促使处理器实现根据权利要求1-8中任一项所述的方法。A computer program, which is stored in a machine-readable storage medium, and when the processor executes the computer program, causes the processor to implement the method according to any one of claims 1-8.
  12. 一种机器可读存储介质,所述机器可读存储介质存储有机器可执行指令,在被处理器调用和执行时,所述机器可执行指令促使所述处理器执行根据权利要求1-8中任一项所述的方法。A machine-readable storage medium, the machine-readable storage medium stores machine-executable instructions, when called and executed by a processor, the machine-executable instructions prompt the processor to execute according to claims 1-8 Any one of the methods.
PCT/CN2021/097992 2020-06-03 2021-06-02 Data cleaning method, apparatus and device, program, and storage medium WO2021244583A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010495705.0 2020-06-03
CN202010495705.0A CN113762519B (en) 2020-06-03 Data cleaning method, device and equipment

Publications (1)

Publication Number Publication Date
WO2021244583A1 true WO2021244583A1 (en) 2021-12-09

Family

ID=78783341

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/097992 WO2021244583A1 (en) 2020-06-03 2021-06-02 Data cleaning method, apparatus and device, program, and storage medium

Country Status (1)

Country Link
WO (1) WO2021244583A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116501829A (en) * 2023-06-29 2023-07-28 北京法伯宏业科技发展有限公司 Data management method and system based on artificial intelligence large language model platform
CN117171153A (en) * 2023-09-11 2023-12-05 北京三维天地科技股份有限公司 Visual data cleaning method and system supporting custom cleaning flow
CN117891812A (en) * 2024-03-18 2024-04-16 北京数字一百信息技术有限公司 Big data cleaning method and system based on artificial intelligence

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109961098A (en) * 2019-03-22 2019-07-02 中国科学技术大学 A kind of training data selection method of machine learning
US20190311300A1 (en) * 2018-04-09 2019-10-10 Veda Data Solutions, Inc. Scheduling Machine Learning Tasks, and Applications Thereof
CN110866658A (en) * 2019-12-05 2020-03-06 国网江苏省电力有限公司南通供电分公司 Method for predicting medium and long term load of urban power grid

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190311300A1 (en) * 2018-04-09 2019-10-10 Veda Data Solutions, Inc. Scheduling Machine Learning Tasks, and Applications Thereof
CN109961098A (en) * 2019-03-22 2019-07-02 中国科学技术大学 A kind of training data selection method of machine learning
CN110866658A (en) * 2019-12-05 2020-03-06 国网江苏省电力有限公司南通供电分公司 Method for predicting medium and long term load of urban power grid

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116501829A (en) * 2023-06-29 2023-07-28 北京法伯宏业科技发展有限公司 Data management method and system based on artificial intelligence large language model platform
CN116501829B (en) * 2023-06-29 2023-09-19 北京法伯宏业科技发展有限公司 Data management method and system based on artificial intelligence large language model platform
CN117171153A (en) * 2023-09-11 2023-12-05 北京三维天地科技股份有限公司 Visual data cleaning method and system supporting custom cleaning flow
CN117891812A (en) * 2024-03-18 2024-04-16 北京数字一百信息技术有限公司 Big data cleaning method and system based on artificial intelligence
CN117891812B (en) * 2024-03-18 2024-05-24 北京数字一百信息技术有限公司 Big data cleaning method and system based on artificial intelligence

Also Published As

Publication number Publication date
CN113762519A (en) 2021-12-07

Similar Documents

Publication Publication Date Title
WO2021244583A1 (en) Data cleaning method, apparatus and device, program, and storage medium
WO2022151649A1 (en) Deep interest network-based topic recommendation method and apparatus
CN108427738B (en) Rapid image retrieval method based on deep learning
JP6874757B2 (en) Learning equipment, learning methods and programs
CN110781819A (en) Image target detection method, system, electronic equipment and storage medium
WO2022042123A1 (en) Image recognition model generation method and apparatus, computer device and storage medium
CN111914085A (en) Text fine-grained emotion classification method, system, device and storage medium
Barnaghi et al. A comparative study for various methods of classification
WO2020233709A1 (en) Model compression method, and device
Li et al. Deep representation via convolutional neural network for classification of spatiotemporal event streams
CN116089883B (en) Training method for improving classification degree of new and old categories in existing category increment learning
JP2022117941A (en) Image searching method and device, electronic apparatus, and computer readable storage medium
CN115115825B (en) Method, device, computer equipment and storage medium for detecting object in image
CN111783997A (en) Data processing method, device and equipment
CN114118207B (en) Incremental learning image identification method based on network expansion and memory recall mechanism
WO2021253938A1 (en) Neural network training method and apparatus, and video recognition method and apparatus
Mithun et al. Generating diverse image datasets with limited labeling
US10909167B1 (en) Systems and methods for organizing an image gallery
JP6991960B2 (en) Image recognition device, image recognition method and program
CN112529078A (en) Service processing method, device and equipment
US8660974B2 (en) Inference over semantic network with some links omitted from indexes
CN114187465A (en) Method and device for training classification model, electronic equipment and storage medium
CN114821248B (en) Point cloud understanding-oriented data active screening and labeling method and device
CN116543250A (en) Model compression method based on class attention transmission
CN113762519B (en) Data cleaning method, device and equipment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21818151

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21818151

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 21818151

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 22.06.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21818151

Country of ref document: EP

Kind code of ref document: A1