CN110490237B - Data processing method and device, storage medium and electronic equipment - Google Patents

Data processing method and device, storage medium and electronic equipment Download PDF

Info

Publication number
CN110490237B
CN110490237B CN201910713784.5A CN201910713784A CN110490237B CN 110490237 B CN110490237 B CN 110490237B CN 201910713784 A CN201910713784 A CN 201910713784A CN 110490237 B CN110490237 B CN 110490237B
Authority
CN
China
Prior art keywords
data
model
data set
target
correct
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910713784.5A
Other languages
Chinese (zh)
Other versions
CN110490237A (en
Inventor
罗彤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jinsheng Communication Technology Co ltd
Guangdong Oppo Mobile Telecommunications Corp Ltd
Original Assignee
Shanghai Jinsheng Communication Technology Co ltd
Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jinsheng Communication Technology Co ltd, Guangdong Oppo Mobile Telecommunications Corp Ltd filed Critical Shanghai Jinsheng Communication Technology Co ltd
Priority to CN201910713784.5A priority Critical patent/CN110490237B/en
Publication of CN110490237A publication Critical patent/CN110490237A/en
Application granted granted Critical
Publication of CN110490237B publication Critical patent/CN110490237B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a data processing method, a data processing device, a storage medium and electronic equipment. The method comprises the following steps: acquiring a plurality of data, wherein the plurality of data carry the same class label; dividing the plurality of data into a first data set and a second data set; extracting features of each of the first data set and the second data set; acquiring the correctness information of the category label of each data in the second data set; training a preset two-classification model according to the correctness information of the class label of each data in the second data set and the characteristics of each data to obtain a target model; acquiring first target data of which the class labels are judged to be correct in the first data set by using the target model and the characteristics of each data in the first data set; and obtaining second target data according to the first target data and the data with correct class labels in the second data set. The data cleaning efficiency can be improved.

Description

Data processing method, data processing device, storage medium and electronic equipment
Technical Field
The present application belongs to the field of data technologies, and in particular, to a data processing method, an apparatus, a storage medium, and an electronic device.
Background
Data cleansing refers to the process of reviewing and verifying data, and aims to remove erroneous information in a data set. Taking the data cleaning process of the classified pictures as an example, the method mainly checks whether the classification labels of the pictures are correct, and deletes the pictures with the wrong classification labels. However, in the related art, the efficiency of the data cleansing process is low.
Disclosure of Invention
The embodiment of the application provides a data processing method, a data processing device, a storage medium and electronic equipment, which can improve the efficiency of data cleaning.
An embodiment of the present application provides a data processing method, including:
acquiring a plurality of data, wherein the plurality of data carry the same class label;
dividing the plurality of data into a first data set and a second data set;
extracting features of each of the first data set and the second data set;
acquiring the correctness information of the category label of each data in the second data set;
training a preset two-classification model according to the correctness information of the class label of each data in the second data set and the characteristics of each data to obtain a target model;
acquiring first target data with a category label judged to be correct in the first data set by using the target model and the characteristics of each data in the first data set;
and obtaining second target data according to the first target data and the data with correct class labels in the second data set.
An embodiment of the present application provides a data processing apparatus, including:
the device comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a plurality of data, and the plurality of data carry the same category label;
a dividing module for dividing the plurality of data into a first data set and a second data set;
an extraction module for extracting features of each of the first and second data sets;
the second acquisition module is used for acquiring the correctness information of the category label of each data in the second data set;
the training module is used for training a preset two-classification model according to the correctness information of the class label of each data in the second data set and the characteristics of each data to obtain a target model;
a third obtaining module, configured to obtain, by using the target model and a feature of each data in the first data set, first target data in which a category label in the first data set is determined to be correct;
and the processing module is used for obtaining second target data according to the first target data and the data with correct class labels in the second data set.
The embodiment of the present application provides a storage medium, on which a computer program is stored, and when the computer program is executed on a computer, the computer is caused to execute the flow in the data processing method provided by the embodiment of the present application.
The embodiment of the present application further provides an electronic device, which includes a memory and a processor, where the processor is configured to execute the flow in the data processing method provided in the embodiment of the present application by calling the computer program stored in the memory.
In the embodiment of the application, the electronic device can perform data cleaning work by using the learning-trained binary model. Due to the learning-trained binary model, correct data of the class labels can be quickly determined. Therefore, the present embodiment can quickly obtain clean data. Compared with a data cleaning mode in which whether the tag information of the inspection data is wrong or not is manually browsed one by one in the related art, the data cleaning efficiency can be improved.
Drawings
The technical solutions and advantages of the present application will become apparent from the following detailed description of specific embodiments of the present application when taken in conjunction with the accompanying drawings.
Fig. 1 is a schematic flowchart of a first data processing method according to an embodiment of the present application.
Fig. 2 is a second flowchart of a data processing method according to an embodiment of the present application.
Fig. 3 is a third flowchart illustrating a data processing method according to an embodiment of the present application.
Fig. 4 is a schematic structural diagram of a fourth model provided in the embodiment of the present application.
Fig. 5 to fig. 7 are schematic scene diagrams of a data processing method according to an embodiment of the present application.
Fig. 8 is a fourth flowchart illustrating a data processing method according to an embodiment of the present application.
Fig. 9 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application.
Fig. 10 is a schematic structural diagram of an electronic device provided in an embodiment of the present application.
Fig. 11 is another schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
Referring to the drawings, wherein like reference numbers refer to like elements, the principles of the present application are illustrated as being implemented in a suitable computing environment. The following description is based on illustrated embodiments of the application and should not be taken as limiting the application with respect to other embodiments that are not detailed herein.
It is understood that the execution subject of the embodiment of the present application may be an electronic device such as a smart phone or a tablet computer or a desktop computer or a server.
Referring to fig. 1 and fig. 2, fig. 1 is a first schematic flow chart of a data processing method provided in an embodiment of the present application, and fig. 2 is a second schematic flow chart of the data processing method provided in the embodiment of the present application, where the flow chart may include:
101. and acquiring a plurality of data, wherein the plurality of data carry the same class label.
Data cleansing refers to the process of reviewing and verifying data, and aims to remove erroneous information in a data set. Taking the data cleaning process of classified pictures as an example, the related art mainly cleans data by manual inspection. For example, the pictures with the wrong classification labels are deleted by manually checking whether the classification labels of the pictures are correct. However, in the related art, the efficiency of the data cleansing process is low.
In this embodiment, for example, the electronic device may first acquire a plurality of data, and the data may carry the same category label. It is understood that the plurality of data are data that need to be cleaned. For example, the electronic device may acquire a data set that requires data cleansing.
For example, the data to be cleaned is a picture set, and the pictures included in the picture set may be pictures with the same category label. For example, the category label of the pictures included in the picture set is a flower category or the like.
102. The plurality of data is divided into a first data set and a second data set.
For example, after data that needs to be cleaned is acquired, the electronic device may divide the data into a first data set and a second data set.
For example, the data that needs to be cleaned is 1000 pictures, and the electronic device may divide the 1000 pictures into a first data set (e.g., a first picture set) and a second data set (e.g., a second picture set).
103. Features of each of the data in the first data set and the second data set are extracted.
For example, after the first data set and the second data set are obtained by dividing, the electronic device may extract features of each data in the first data set and extract features of each data in the second data set.
For example, the electronic device may extract features of each picture in the first data set and extract features of each picture in the second data set.
104. And acquiring the correctness information of the category label of each data in the second data set.
For example, after obtaining the second data set, the electronic device may obtain correctness information of the category label of each data in the second data set. For example, the electronic device may obtain information whether the category label of each picture in the second data set is correct.
For example, 200 pictures are included in the second data set, and then it can be determined whether the category label of each of the 200 pictures is correct by means of manual inspection. If the type label of the picture is correct, the inspector can label the picture with corresponding information indicating that the type label is correct, such as a label number "1" or an English letter "T", through the electronic device. If the category label of the picture is wrong, the inspector can mark the picture with corresponding information indicating the category label error through the electronic device, such as a marked number "0" or an English letter "F". Thus, the electronic device can acquire the information about whether the category label of the 200 pictures is correct or not.
105. And training a preset two-classification model according to the correctness information of the class label of each data in the second data set and the characteristics of each data to obtain a target model.
For example, after the information about whether the category label of each data in the second data set is correct or not is acquired, the electronic device may perform learning training on a preset binary model according to the information about whether the category label of each data in the second data set is correct or not and the characteristics of each data in the second data set, so as to obtain a model subjected to learning training, that is, a target model.
For example, after acquiring the information on whether the category label of 200 pictures in the second data set is correct, the electronic device may input the information on whether the category label of 200 pictures is correct and the characteristics of 200 pictures as input data into a preset binary model to perform learning training on the binary model, so as to obtain the target model.
In one embodiment, for example, picture PiFor a picture in the second data set, then picture PiCharacteristic f ofiAnd the picture PiInformation b of whether the category label of (2) is correct or notiCan be expressed as<fi,bi>In the form of<fi,bi>Can be used as a piece of learning sample data of the binary classification model.
It can be understood that the target model obtained through the learning training is a model that can output information whether the class label of the picture is correct according to the picture feature.
106. And acquiring first target data with the category label judged to be correct in the first data set by using the target model and the characteristics of each data in the first data set.
For example, after obtaining the target model, the electronic device may obtain, by using the target model and the features of each data in the first data set, where the class label is determined to be correct, that is, the first target data.
For example, the first data set contains 800 pictures. Then, after obtaining the target model, the electronic device may input the features of each of the 800 pictures into the target model, output, by the target model, information whether the category label of each picture is correct according to the features of the picture, and determine the picture with the determined correct category label as the first target data. It is to be understood that the first target data may include a plurality of pictures.
107. And obtaining second target data according to the first target data and the data with correct class labels in the second data set.
For example, after obtaining the first target data, the electronic device may further obtain data with correct category labels in the second data set, and combine the first target data and the data with correct category labels in the second data set to obtain the second target data. It is understood that the second target data is clean data obtained after the data cleaning process.
For example, in 104, the electronic device acquires information about whether the category label of 200 pictures in the second data set is correct, and the category label of 190 pictures in the 200 pictures is correct. At 106, the electronic device determines that the category label of 790 pictures in the 800 pictures in the first data set is correct by using the object model, and then the electronic device may merge the 190 pictures in the second data set with the 790 pictures in the first data set to obtain 980 pictures. The 980 pictures can be considered as clean data after data washing.
It is understood that, in the present embodiment, the electronic device may perform data cleansing work by using the learning-trained binary model. Due to the learning-trained binary model, correct data of the class labels can be quickly determined. Therefore, the present embodiment can quickly obtain clean data. Compared with a data cleaning mode in which whether the tag information of the inspection data is wrong or not is manually browsed one by one in the related art, the data cleaning efficiency can be improved.
In the embodiment of the application, the two classification models are introduced into the cleaning work of classification data. And training a binary model by using the known information about whether the class label of the data in the second data set is correct and the characteristics of the data in the second data set to obtain a target model, outputting the information about whether the class label of the data in the first data set is correct by using the target model, and finally obtaining the data with correct class labels, namely the clean data.
Referring to fig. 3, fig. 3 is a third schematic flow chart of a data processing method according to an embodiment of the present application, where the flow chart may include:
201. the electronic device obtains a plurality of data, and the plurality of data carry the same category label.
For example, the electronic device may obtain 1000 pictures that need to be data cleaned. The 1000 photos carry the same category label. For example, the 1000 pictures have the same floral category label manually labeled. For example, the 1000 pictures are respectively P1、P2、P3,……,P1000
202. The electronic device divides the plurality of data into a first data set and a second data set.
For example, after the 1000 pictures are acquired, the electronic device may divide the 1000 pictures into a first data set and a second data set. For example, the electronic device may randomly extract 800 pictures from the 1000 pictures into a first data set and group the remaining 200 pictures into a second data set.
After dividing 1000 pictures into a first data set and a second data set, the electronic device may detect whether its current computing power is below a preset threshold.
If the current computing capability is detected to be lower than the preset threshold, the current computing capability of the electronic device can be considered to be weak. In this case, 203 may be entered.
If the current computing power is detected to be not lower than the preset threshold, the current computing power of the electronic equipment can be considered to be stronger. In this case, 204 may be entered.
203. When the computing power of the electronic device is lower than a preset threshold value, the electronic device extracts the features of each data in the first data set and the second data set by using a preset feature extraction model.
For example, if the electronic device detects that the current computing power is lower than a preset threshold, the electronic device may obtain a preset feature extraction model, and extract features of each picture in the first data set and the second data set by using the preset feature extraction model.
In an implementation manner, the embodiment of the present application may obtain the preset feature extraction model by:
the electronic equipment acquires a first model, wherein the first model is a ResNet model obtained according to ImageNet training;
the electronic equipment performs learning training on the ResNet model by using the data to obtain a second model;
and the electronic equipment removes the full connection layer positioned at the last layer of the second model to obtain a third model, and determines the third model as a preset feature extraction model.
For example, when the data is a picture, that is, the data that needs to be cleaned is a picture, the electronic device may first obtain a first model, where the first model is a ResNet model trained according to ImageNet.
It should be noted that the ImageNet project is a large visualization database for visual object recognition software research. Image URLs in excess of 1400 million were manually annotated by ImageNet to indicate objects in the picture. Since 2010, the ImageNet project has a software race, ImageNet Large Scale visual recognition challenge race (ILSVRC), held every year, where software programs race to correctly classify and detect objects and scenes.
ResNet (residual Neural network) successfully trained a Neural network at layer 152 using a ResNet Unit and picked a champion in an ILSVRC2015 race. The structure of ResNet can accelerate the training of the neural network very fast, and the accuracy of the model is greatly improved.
That is, ImageNet is an open, free large database of pictures, which contains 2.2 ten thousand categories of classified pictures. And ResNet is a picture classification model trained by data in ImageNet.
For example, after obtaining the ResNet model, the electronic device may perform machine learning training on the ResNet model by using pictures (such as the 1000 pictures) that need to be subjected to data cleaning, so as to obtain a second model. After obtaining the second model, the electronic device may remove the full connection layer located at the last layer of the second model, so as to obtain a third model, and determine the third model as the preset feature extraction model. It should be noted that the last layer of the ResNet model is a fully connected layer, the fully connected layer has a function of classifying pictures in the model, and the other neural network layers except the fully connected layer of the last layer in the ResNet model have a function of extracting features, so that the neural network layer obtained by removing the fully connected layer of the last layer of the second model can be used as a feature extraction model. In addition, the ResNet model needs to be subjected to learning training again by using the pictures needing data cleaning, because ResNet is a relatively universal classification model, and the ResNet model is subjected to learning training again by using the pictures needing data cleaning to obtain a second model, the classification of the pictures needing data cleaning by the second model can be more targeted, and the feature extraction of the pictures needing data cleaning by the third model is more accurate.
204. And when the computing power of the electronic equipment is not lower than a preset threshold value, the electronic equipment acquires a fourth model, and extracts the features of each data in the first data set and the second data set by using the fourth model, wherein the feature extraction precision of the fourth model is higher than that of a preset feature extraction model.
For example, when the electronic device detects that the current computing power of the electronic device is not lower than the preset threshold, the electronic device may acquire a fourth model, and extract features of each picture in the first data set and the second data set by using the fourth model, where the feature extraction accuracy of the fourth model is higher than that of the preset feature extraction model.
For example, the fourth model may be a single model with a more complex structure, such as inclusion-ResNet-v 2, than the ResNet model used in the present embodiment. Alternatively, the fourth model may be a fusion (stacking) of multiple models. For example, the structure of the fourth model may be as shown in fig. 4. And simultaneously inputting the picture data into a plurality of primary models (Level 1), taking the characteristics extracted by the primary models as the input of the secondary models, and finally taking the output of the secondary models as the output characteristics. The Model 1, Model 2 and Model3 can be selected from common deep learning models such as ResNet, inclusion and MobileNet, and the Model4 can be selected from simpler traditional machine learning models such as linear regression. The fusion of multiple models integrates the advantages of multiple models, the extraction capability of the features is stronger, the subsequent cleaning effect is better, but more resources are consumed, and the method is suitable for being used under the condition that the computing capability of the electronic equipment is sufficient.
In one embodiment, the computing power of the electronic device may be, for example, a CPU occupancy rate and/or a capacity of the remaining operating memory and/or a ratio of the remaining operating memory capacity to a total operating memory capacity.
205. The electronic equipment acquires the correctness information of the category label of each data in the second data set.
For example, after the first data set and the second data set are divided, the electronic device may obtain the correctness information of the category label of each picture in the second data set.
For example, 200 pictures are included in the second data set, and then it can be determined whether the category label of each of the 200 pictures is correct by means of manual inspection. If the category label of the picture is correct, the inspector can mark the picture with corresponding information indicating that the category label is correct, such as a marked number "1" or an English letter "T", through the electronic device. If the category label of the picture is wrong, the inspector can mark the picture with corresponding information indicating the category label error through the electronic device, such as a marked number "0" or an English letter "F". Thus, the electronic device can acquire the information about whether the category label of the 200 pictures is correct or not.
206. And according to the correctness information of the category label of each data in the second data set and the characteristics of each data, the electronic equipment trains a preset two-classification model to obtain a target model.
For example, after the information about the correctness of the category label of each picture in the second data set is obtained, the electronic device may perform learning training on a preset binary model according to the information about the correctness of the category label of each picture in the second data set and the characteristics of each picture in the second data set, so as to obtain a model subjected to learning training, that is, a target model.
For example, the preset binary classification model may be a Support Vector Machine (SVM). After the information about whether the category label of the 200 pictures in the second data set is correct or not is acquired, the electronic device may input the information about whether the category label of the 200 pictures is correct or not and the characteristics of the 200 pictures as input data into a preset SVM model to perform learning training on the SVM model, so as to obtain a target model.
In one embodiment, for example, picture PiFor a picture in the second data set, then picture PiCharacteristic f ofiAnd the picture PiInformation b of whether the category label of (2) is correct or notiCan be expressed as<fi,bi>In the form of<fi,bi>Can be used as a piece of learning sample data of the SVM model.
It can be understood that the target model obtained through the learning training is a model that can output information whether the class label of the picture is correct according to the picture feature.
In some embodiments, the preset binary classification model may also be a model such as a Multi-Layer perceptron (Multi-Layer Perception), a Decision Tree (Decision Tree), or a Random Forest (Random Forest).
207. The electronic equipment obtains first target data, of which the class labels are judged to be correct, in the first data set by using the target model and the characteristics of each data in the first data set.
For example, after obtaining the target model, the electronic device may obtain, by using the target model and the features of each picture in the first data set, a picture in the first data set, where the category label is determined to be correct, that is, the first target data.
For example, the first data set contains 800 pictures. Then, after obtaining the target model, the electronic device may input the features of each of the 800 pictures into the target model, output, by the target model, information whether the category label of each picture is correct according to the features of the picture, and determine the picture with the determined correct category label as the first target data. It is to be understood that the first target data may include a plurality of pictures.
208. And according to the first target data and the correct data of the class label in the second data set, the electronic equipment obtains second target data.
For example, after obtaining the first target data, the electronic device may further obtain a picture with a correct category label in the second data set, and obtain the second target data according to the first target data and the picture with the correct category label in the second data set. It can be understood that the second target data is a clean picture obtained after the data cleaning process.
For example, in 104, the electronic device acquires information about whether the category label of 200 pictures in the second data set is correct, and the category label of 190 pictures in the 200 pictures is correct. At 106, the electronic device determines that the category label of 790 pictures in the 800 pictures in the first data set is correct by using the object model, and then the electronic device may merge the 190 pictures in the second data set with the 790 pictures in the first data set to obtain 980 pictures. The 980 pictures can be considered to be clean pictures after data washing.
In some embodiments, the embodiments of the present application may further include:
the electronic equipment acquires third target data of which the class labels are judged to be errors in the first data set by using the target model and the characteristics of each data in the first data set;
and the electronic equipment deletes the third target data and the data with the wrong category label in the second data set.
For example, after obtaining the target model, the electronic device may input the feature of each of 800 pictures in the first data set into the target model, output, by the target model, information on whether the category label of the picture is correct according to the feature of each picture, and determine, as the third target data, a picture with a determined category label that is incorrect. It is to be understood that the third target data may include a plurality of pictures.
After the third target data is obtained, the electronic device may further obtain a picture with a category label error in the second data set. After that, the electronic device may delete the third target data and the class label error picture in the second data set. It can be understood that the third target data and the second data set with the wrong category label can be regarded as "Dirty data" cleaned by the data cleaning process (Dirty Read). It will be appreciated that such dirty data is considered to carry a class label that is not the same as its actual class label. For example, if a picture of a tree is incorrectly labeled as a floral category, the picture of the tree is dirty data.
In some embodiments, when the application divides the first data set and the second data set, the first data set and the second data set may satisfy the following condition:
the ratio of the number of data contained in the first data set to the number of data contained in the second data set is a preset ratio, and the number of data contained in the first data set is larger than that of the second data set.
For example, the ratio of the amount of data contained in the first data set and the second data set may be a preset ratio, for example, the preset ratio may be 8:2, or 9:1, or 7.5:2.5, and so on, and the amount of data contained in the first data set is greater than the amount of data contained in the second data set.
In another embodiment, the number of data with correct category label and the number of data with incorrect category label in the second data set may both satisfy the following numerical condition: i.e. the number of class-tagged correct data and class-tagged incorrect data contained in the second data set may each be greater than or equal to 100. For example, the number of pictures with correct category labels contained in the second data set is not less than 100, and the number of pictures with wrong category labels is not less than 100.
Referring to fig. 5 to 7, fig. 5 to 7 are schematic views of a data processing method according to an embodiment of the present disclosure.
For example, as shown in fig. 5, the user currently needs to perform data cleaning processing on 1000 pictures in the picture set P, and the 1000 pictures are labeled with the same category label. Then, the electronic device may first acquire the 1000 pictures.
Thereafter, the electronic device may randomly divide the 1000 pictures into a first picture set and a second picture set, as shown in fig. 6. The first picture set comprises 800 pictures, and the second picture set comprises 200 pictures.
Then, the electronic device may perform feature extraction on each picture in the first picture set and the second picture set by using a preset feature extraction model. For example, feature FiIs a picture PiI is an integer greater than or equal to 1.
After the features of the pictures are extracted, whether the category label of each of the 200 pictures in the second picture set is correct or not can be determined in a manual checking mode. If the category label of the picture is correct, the inspector can mark the picture with corresponding information indicating that the category label is correct, such as a marked number "1", through the electronic device. If the category label of the picture is wrong, the inspector can mark the picture with corresponding information indicating the category label error through the electronic device, such as a marked number "0". Thus, the electronic device can acquire the information about whether the category label of the 200 pictures is correct or not. For example, 190 of the 200 pictures are marked with the number "1", i.e. the category labels of the 190 pictures in the second set are checked to be correct by manual inspection.
After the information about whether the category labels of the 200 pictures in the second picture set are correct or not is acquired, the electronic device may input the information about whether the category labels of the 200 pictures are correct or not and the characteristics of the 200 pictures as input data into a preset SVM model to perform learning training on the SVM model, so as to obtain a target model. For example, picture PiFor a picture in the second data set, then picture PiCharacteristic f ofiAnd the picture PiInformation b of whether the category label of (2) is correct or notiCan be expressed as<fi,bi>In the form of<fi,bi>Can be used as a piece of learning sample data of the SVM model.
After obtaining the target model, the electronic device may input the feature of each of 800 pictures in the first picture set into the target model, output, by the target model, information whether the category label of the picture is correct according to the feature of each picture, and determine the picture with the determined correct category label as the first target data. For example, the electronic device eventually determines that the category label of 790 pictures in the first picture set is correct.
After that, the electronic device may merge 190 pictures in the second picture set with correct category labels with 790 pictures in the first picture set that are judged to be correct category labels, so as to obtain 980 pictures. The 980 pictures can be considered to be clean pictures after data washing. For example, as shown in FIG. 7, the electronic device composes the 980 pictures into a set of pictures.
Referring to fig. 8, fig. 8 is a fourth flowchart illustrating a data processing method according to the present embodiment.
In this embodiment, the electronic device may perform data cleaning by using a learning-trained binary model. Due to the learning-trained binary model, correct data of the class labels can be quickly determined. Therefore, the present embodiment can quickly obtain clean data. Compare in the data washing mode of whether the tag information of inspection data has the mistake by the manual work of browsing one by one among the correlation technique, this embodiment has reduced a large amount of manual work load, can improve data washing's efficiency, has reduced data washing's cost.
In addition, the data cleaning work is carried out by utilizing a two-classification mode, and the accuracy close to that of manual cleaning can be achieved. Moreover, the data cleaning process of the data cleaning mode provided by the embodiment can be traced back, and other personnel can check the data cleaning quality through the cleaning process.
Referring to fig. 9, fig. 9 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present disclosure. The data processing apparatus 300 may include: a first obtaining module 301, a dividing module 302, an extracting module 303, a second obtaining module 304, a training module 305, a third obtaining module 306, and a processing module 307.
A first obtaining module 301, configured to obtain multiple data, where the multiple data carry the same category label;
a dividing module 302 for dividing the plurality of data into a first data set and a second data set;
an extracting module 303, configured to extract features of each of the first data set and the second data set;
a second obtaining module 304, configured to obtain information about correctness of the category label of each data in the second data set;
a training module 305, configured to train a preset two-class model according to the correctness information of the category label of each data in the second data set and the characteristics of each data, so as to obtain a target model;
a third obtaining module 306, configured to obtain, by using the target model and features of each data in the first data set, first target data in the first data set, where a category label is determined to be correct;
the processing module 307 is configured to obtain second target data according to the first target data and the data with correct category label in the second data set.
In one embodiment, the first obtaining module 301 may be configured to:
when the data are pictures, acquiring a first model, wherein the first model is a ResNet model obtained according to ImageNet training;
performing learning training on the ResNet model by using the data to obtain a second model;
removing a full-connection layer positioned at the last layer of the second model to obtain a third model, and determining the third model as a preset feature extraction model;
then, the extraction module 303 may be configured to: and extracting the characteristics of each data in the first data set and the second data set by using the preset characteristic extraction model.
In one embodiment, the extraction module 303 may be configured to:
and when the computing power of the electronic equipment is lower than a preset threshold value, extracting the feature of each data in the first data set and the second data set by using the preset feature extraction model.
In one embodiment, the extraction module 303 may be configured to:
and when the computing power of the electronic equipment is not lower than the preset threshold, acquiring a fourth model, and extracting the feature of each data in the first data set and the second data set by using the fourth model, wherein the feature extraction precision of the fourth model is higher than that of the preset feature extraction model.
In one embodiment, the ratio of the amount of data contained in the first data set to the amount of data contained in the second data set is a preset ratio, and the amount of data contained in the first data set is larger than the amount of data contained in the second data set.
In one embodiment, the processing module 307 is further configured to:
acquiring third target data of which the class labels are judged to be errors in the first data set by using the target model and the characteristics of each data in the first data set;
and deleting the third target data and the data with the wrong class label in the second data set.
In one embodiment, the preset two-classification model at least comprises a support vector machine, a multi-layer perceptron, a decision tree or a random forest.
The present embodiment provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed on a computer, the computer is caused to execute the flow in the data processing method provided in this embodiment.
The embodiment of the present application further provides an electronic device, which includes a memory and a processor, where the processor is configured to execute the flow in the data processing method provided in the embodiment by calling the computer program stored in the memory.
For example, the electronic device may be a mobile terminal such as a tablet computer or a smart phone. Referring to fig. 10, fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.
The electronic device 400 may include components such as a display 401, memory 402, processor 403, and the like. Those skilled in the art will appreciate that the electronic device configuration shown in fig. 10 does not constitute a limitation of the electronic device and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.
The display 401 may be used to display information such as text.
The memory 402 may be used to store applications and data. The memory 402 stores applications containing executable code. The application programs may constitute various functional modules. The processor 403 executes various functional applications and data processing by running an application program stored in the memory 402.
The processor 403 is a control center of the electronic device, connects various parts of the whole electronic device by using various interfaces and lines, and performs various functions of the electronic device and processes data by running or executing an application program stored in the memory 402 and calling data stored in the memory 402, thereby performing overall monitoring of the electronic device.
In this embodiment, the processor 403 in the electronic device loads the executable code corresponding to the processes of one or more application programs into the memory 402 according to the following instructions, and the processor 403 runs the application programs stored in the memory 402, so as to execute:
acquiring a plurality of data, wherein the plurality of data carry the same class label;
dividing the plurality of data into a first data set and a second data set;
extracting features of each of the first data set and the second data set;
acquiring the correctness information of the category label of each data in the second data set;
training a preset two-classification model according to the correctness information of the class label of each data in the second data set and the characteristics of each data to obtain a target model;
acquiring first target data with a category label judged to be correct in the first data set by using the target model and the characteristics of each data in the first data set;
and obtaining second target data according to the first target data and the data with correct class labels in the second data set.
Referring to fig. 11, the electronic device 400 may include a display 401, a memory 402, a processor 403, an input unit 404, a power supply 405, and the like.
The display 401 may be used to display information such as text.
The memory 402 may be used to store applications and data. The memory 402 stores applications containing executable code. The application programs may constitute various functional modules. The processor 403 executes various functional applications and data processing by running the application program stored in the memory 402.
The processor 403 is a control center of the electronic device, connects various parts of the whole electronic device by using various interfaces and lines, and performs various functions of the electronic device and processes data by running or executing an application program stored in the memory 402 and calling data stored in the memory 402, thereby performing overall monitoring of the electronic device.
The input unit 404 may be used to receive input numbers, character information, or user characteristic information, such as a fingerprint, and generate keyboard, mouse, joystick, optical, or trackball signal inputs related to user settings and function control.
A power supply 405 may be used to provide power guarantees for the various components.
In this embodiment, the processor 403 in the electronic device loads the executable code corresponding to the processes of one or more application programs into the memory 402 according to the following instructions, and the processor 403 runs the application programs stored in the memory 402, so as to execute:
acquiring a plurality of data, wherein the plurality of data carry the same class label;
dividing the plurality of data into a first data set and a second data set;
extracting features of each of the first data set and the second data set;
acquiring the correctness information of the category label of each data in the second data set;
training a preset two-classification model according to the correctness information of the class label of each data in the second data set and the characteristics of each data to obtain a target model;
acquiring first target data with a category label judged to be correct in the first data set by using the target model and the characteristics of each data in the first data set;
and obtaining second target data according to the first target data and the data with correct class labels in the second data set.
In one embodiment, the processor 403 may be further configured to: when the data are pictures, acquiring a first model, wherein the first model is a ResNet model obtained according to ImageNet training; performing learning training on the ResNet model by using the data to obtain a second model; removing a full-connection layer positioned at the last layer of the second model to obtain a third model, and determining the third model as a preset feature extraction model;
then, the processor 403, when performing the feature extraction on each data in the first data set and the second data set, may perform: and extracting the characteristics of each data in the first data set and the second data set by using the preset characteristic extraction model.
In one embodiment, when the processor 403 performs extracting the feature of each of the first data set and the second data set by using the preset feature extraction model, the following steps may be performed: and when the computing power of the electronic equipment is lower than a preset threshold value, extracting the features of each data in the first data set and the second data set by using the preset feature extraction model.
In one embodiment, the processor 403 may further perform: and when the computing power of the electronic equipment is not lower than the preset threshold, acquiring a fourth model, and extracting the feature of each data in the first data set and the second data set by using the fourth model, wherein the feature extraction precision of the fourth model is higher than that of the preset feature extraction model.
In one embodiment, the ratio of the amount of data contained in the first data set to the amount of data contained in the second data set is a preset ratio, and the amount of data contained in the first data set is larger than the amount of data contained in the second data set.
In one embodiment, the processor 403 may further perform: acquiring third target data of which the class labels are judged to be errors in the first data set by using the target model and the characteristics of each data in the first data set; and deleting the third target data and the data with the wrong class label in the second data set.
In one embodiment, the preset two-classification model at least comprises a support vector machine, a multi-layer perceptron, a decision tree or a random forest.
In the above embodiments, the descriptions of the embodiments have respective emphasis, and parts that are not described in detail in a certain embodiment may refer to the above detailed description of the data processing method, and are not described herein again.
The data processing apparatus provided in the embodiment of the present application and the data processing method in the above embodiment belong to the same concept, and any method provided in the embodiment of the data processing method may be run on the data processing apparatus, and a specific implementation process thereof is described in the embodiment of the data processing method in detail, and is not described herein again.
It should be noted that, for the data processing method described in the embodiment of the present application, it can be understood by those skilled in the art that all or part of the process of implementing the data processing method described in the embodiment of the present application can be completed by controlling the relevant hardware through a computer program, where the computer program can be stored in a computer-readable storage medium, such as a memory, and executed by at least one processor, and during the execution, the process of the embodiment of the data processing method can be included. The storage medium may be a magnetic disk, an optical disk, a Read Only Memory (ROM), a Random Access Memory (RAM), or the like.
In the data processing apparatus according to the embodiment of the present application, each functional module may be integrated into one processing chip, or each module may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium, such as a read-only memory, a magnetic or optical disk, or the like.
The foregoing describes in detail a data processing method, an apparatus, a storage medium, and an electronic device provided in an embodiment of the present application, and specific examples are applied herein to explain principles and implementations of the present application, and the above description of the embodiments is only used to help understand the method and its core idea of the present application; meanwhile, for those skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims (9)

1. A data processing method, comprising:
acquiring a plurality of data, wherein the plurality of data carry the same class label;
when the data are determined to be pictures, acquiring a first model, wherein the first model is a ResNet model obtained according to ImageNet training; performing learning training on the ResNet model by using the data to obtain a second model; removing a full-connection layer positioned at the last layer of the second model to obtain a third model, and determining the third model as a preset feature extraction model;
dividing the plurality of data into a first data set and a second data set;
extracting the feature of each data in the first data set and the second data set by using the preset feature extraction model;
acquiring the correctness information of the category label of each data in the second data set;
training a preset two classification model according to the correctness information of the class label of each data in the second data set and the characteristics of each data to obtain a target model;
inputting the characteristics of each data in the first data set into the target model, outputting the correctness information of the class label of each data in the first data set by using the target model to obtain the data of which the class label is judged to be correct in the first data set, and determining the data of which the class label is judged to be correct in the first data set as first target data;
and obtaining second target data according to the first target data and the data with correct class labels in the second data set.
2. The data processing method of claim 1, wherein extracting features of each of the first data set and the second data set using the predetermined feature extraction model comprises:
and when the computing power of the electronic equipment is lower than a preset threshold value, extracting the features of each data in the first data set and the second data set by using the preset feature extraction model.
3. The data processing method of claim 2, wherein the method further comprises:
and when the computing power of the electronic equipment is not lower than the preset threshold, acquiring a fourth model, and extracting the feature of each data in the first data set and the second data set by using the fourth model, wherein the feature extraction precision of the fourth model is higher than that of the preset feature extraction model.
4. The data processing method according to claim 1, wherein a ratio of the amount of data included in the first data set to the amount of data included in the second data set is a preset ratio, and the amount of data included in the first data set is larger than that of the second data set.
5. The data processing method of claim 1, wherein the method further comprises:
acquiring third target data of which the class labels are judged to be errors in the first data set by using the target model and the characteristics of each data in the first data set;
and deleting the third target data and the data with the wrong category label in the second data set.
6. The data processing method of claim 1, wherein the pre-defined classification models comprise at least a support vector machine, a multi-layer perceptron, a decision tree, or a random forest.
7. A data processing apparatus, comprising:
the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a plurality of data, and the plurality of data carry the same category label; when the data are determined to be pictures, acquiring a first model, wherein the first model is a ResNet model obtained according to ImageNet training; performing learning training on the ResNet model by using the data to obtain a second model; removing a full connection layer positioned at the last layer of the second model to obtain a third model, and determining the third model as a preset feature extraction model;
a dividing module for dividing the plurality of data into a first data set and a second data set;
the extraction module is used for extracting the characteristics of each data in the first data set and the second data set by using the preset characteristic extraction model;
the second acquisition module is used for acquiring the correctness information of the category label of each data in the second data set;
the training module is used for training a preset two-classification model according to the correctness information of the class label of each data in the second data set and the characteristics of each data to obtain a target model;
the third acquisition module is used for inputting the characteristics of each data in the first data set into the target model, outputting the correctness information of the class label of each data in the first data set by using the target model to obtain the data of which the class label is judged to be correct in the first data set, and determining the data of which the class label is judged to be correct in the first data set as the first target data;
and the processing module is used for obtaining second target data according to the first target data and the data with correct class labels in the second data set.
8. A storage medium having stored thereon a computer program, characterized in that the computer program, when executed on a computer, causes the computer to execute the method according to any of claims 1 to 6.
9. An electronic device comprising a memory, a processor, wherein the processor is configured to perform the method of any of claims 1 to 6 by invoking a computer program stored in the memory.
CN201910713784.5A 2019-08-02 2019-08-02 Data processing method and device, storage medium and electronic equipment Active CN110490237B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910713784.5A CN110490237B (en) 2019-08-02 2019-08-02 Data processing method and device, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910713784.5A CN110490237B (en) 2019-08-02 2019-08-02 Data processing method and device, storage medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN110490237A CN110490237A (en) 2019-11-22
CN110490237B true CN110490237B (en) 2022-05-17

Family

ID=68549273

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910713784.5A Active CN110490237B (en) 2019-08-02 2019-08-02 Data processing method and device, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN110490237B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111460195B (en) * 2020-03-26 2023-08-01 Oppo广东移动通信有限公司 Picture processing method and device, storage medium and electronic equipment
CN112734035B (en) * 2020-12-31 2023-10-27 成都佳华物链云科技有限公司 Data processing method and device and readable storage medium
CN113204660B (en) * 2021-03-31 2024-05-17 北京达佳互联信息技术有限公司 Multimedia data processing method, tag identification device and electronic equipment
CN113128979A (en) * 2021-05-17 2021-07-16 中铁高新工业股份有限公司 Scientific research aid decision-making system based on big data

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6188400B2 (en) * 2013-04-26 2017-08-30 オリンパス株式会社 Image processing apparatus, program, and image processing method
CN106650780B (en) * 2016-10-18 2021-02-12 腾讯科技(深圳)有限公司 Data processing method and device, classifier training method and system
CN108764372B (en) * 2018-06-08 2019-07-16 Oppo广东移动通信有限公司 Construction method and device, mobile terminal, the readable storage medium storing program for executing of data set
CN108875821A (en) * 2018-06-08 2018-11-23 Oppo广东移动通信有限公司 The training method and device of disaggregated model, mobile terminal, readable storage medium storing program for executing
CN109213862B (en) * 2018-08-21 2020-11-24 北京京东尚科信息技术有限公司 Object recognition method and device, and computer-readable storage medium
CN109447717A (en) * 2018-11-12 2019-03-08 万惠投资管理有限公司 A kind of determination method and system of label
CN109753498A (en) * 2018-12-11 2019-05-14 中科恒运股份有限公司 data cleaning method and terminal device based on machine learning

Also Published As

Publication number Publication date
CN110490237A (en) 2019-11-22

Similar Documents

Publication Publication Date Title
CN110472082B (en) Data processing method, data processing device, storage medium and electronic equipment
CN110490237B (en) Data processing method and device, storage medium and electronic equipment
US10878336B2 (en) Technologies for detection of minority events
CN109800320B (en) Image processing method, device and computer readable storage medium
CN107423278B (en) Evaluation element identification method, device and system
CN111078552A (en) Method and device for detecting page display abnormity and storage medium
CN111931859B (en) Multi-label image recognition method and device
CN111666940B (en) Chat screenshot content processing method and device, electronic equipment and readable storage medium
CN107273883B (en) Decision tree model training method, and method and device for determining data attributes in OCR (optical character recognition) result
CN109284700B (en) Method, storage medium, device and system for detecting multiple faces in image
CN110134777A (en) Problem De-weight method, device, electronic equipment and computer readable storage medium
CN110059212A (en) Image search method, device, equipment and computer readable storage medium
CN113221918A (en) Target detection method, and training method and device of target detection model
CN111858942A (en) Text extraction method and device, storage medium and electronic equipment
CN112579781A (en) Text classification method and device, electronic equipment and medium
CN113610080B (en) Cross-modal perception-based sensitive image identification method, device, equipment and medium
TWI777163B (en) Form data detection method, computer device and storage medium
CN114332599A (en) Image recognition method, image recognition device, computer equipment, storage medium and product
CN110851349B (en) Page abnormity display detection method, terminal equipment and storage medium
CN113704623A (en) Data recommendation method, device, equipment and storage medium
CN110827261A (en) Image quality detection method and device, storage medium and electronic equipment
CN113111713B (en) Image detection method and device, electronic equipment and storage medium
CN115460433B (en) Video processing method and device, electronic equipment and storage medium
CN106446902A (en) Non-character image recognition method and device
CN116703841A (en) Photovoltaic equipment detection method, system, device and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant