CN110490237A - Data processing method, device, storage medium and electronic equipment - Google Patents

Data processing method, device, storage medium and electronic equipment Download PDF

Info

Publication number
CN110490237A
CN110490237A CN201910713784.5A CN201910713784A CN110490237A CN 110490237 A CN110490237 A CN 110490237A CN 201910713784 A CN201910713784 A CN 201910713784A CN 110490237 A CN110490237 A CN 110490237A
Authority
CN
China
Prior art keywords
data
data set
model
feature
class label
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910713784.5A
Other languages
Chinese (zh)
Other versions
CN110490237B (en
Inventor
罗彤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jinsheng Communication Technology Co Ltd
Guangdong Oppo Mobile Telecommunications Corp Ltd
Original Assignee
Shanghai Jinsheng Communication Technology Co Ltd
Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jinsheng Communication Technology Co Ltd, Guangdong Oppo Mobile Telecommunications Corp Ltd filed Critical Shanghai Jinsheng Communication Technology Co Ltd
Priority to CN201910713784.5A priority Critical patent/CN110490237B/en
Publication of CN110490237A publication Critical patent/CN110490237A/en
Application granted granted Critical
Publication of CN110490237B publication Critical patent/CN110490237B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

This application discloses a kind of data processing method, device, storage medium and electronic equipments.This method comprises: obtaining multiple data, multiple data carry identical class label;Multiple data are divided into the first data set and the second data set;Extract the feature of each data in first data set and second data set;Obtain the correctness information of the class label of each data in second data set;According to the feature of the correctness information and each data of the class label of each data in second data set, preset two disaggregated model of training obtains object module;Using the feature of each data in the object module and first data set, obtains class label in first data set and be judged as correct first object data;According to the correct data of class label in the first object data and second data set, the second target data is obtained.The efficiency of data cleansing can be improved in the application.

Description

Data processing method, device, storage medium and electronic equipment
Technical field
The application belongs to data technique field more particularly to a kind of data processing method, device, storage medium and electronics are set It is standby.
Background technique
Data cleansing, which refers to the process of, to be examined and is verified again to data, and its object is to by the mistake in data set Information deletion.By taking the data cleansing of category images processing as an example, whether the tag along sort of mainly inspection picture is correct, and will divide The picture of class tag error is deleted.However, in the related technology, the efficiency of data cleansing processing is lower.
Summary of the invention
The embodiment of the present application provides a kind of data processing method, device, storage medium and electronic equipment, and data can be improved The efficiency of cleaning.
The embodiment of the present application provides a kind of data processing method, comprising:
Multiple data are obtained, the multiple data carry identical class label;
The multiple data are divided into the first data set and the second data set;
Extract the feature of each data in first data set and second data set;
Obtain the correctness information of the class label of each data in second data set;
According to the spy of the correctness information and each data of the class label of each data in second data set Sign, preset two disaggregated model of training, obtains object module;
Using the feature of each data in the object module and first data set, first data set is obtained Middle class label is judged as correct first object data;
According to the correct data of class label in the first object data and second data set, the second mesh is obtained Mark data.
The embodiment of the present application provides a kind of data processing equipment, comprising:
First obtains module, and for obtaining multiple data, the multiple data carry identical class label;
Division module, for the multiple data to be divided into the first data set and the second data set;
Extraction module, for extracting the feature of each data in first data set and second data set;
Second obtains module, and the correctness for obtaining the class label of each data in second data set is believed Breath;
Training module, for according to the correctness information of the class label of each data in second data set and The feature of each data, preset two disaggregated model of training, obtains object module;
Third obtains module, for utilizing the feature of each data in the object module and first data set, It obtains class label in first data set and is judged as correct first object data;
Processing module, for correctly being counted according to class label in the first object data and second data set According to obtaining the second target data.
The embodiment of the present application provides a kind of storage medium, is stored thereon with computer program, when the computer program exists When being executed on computer, so that the computer executes the process in data processing method provided by the embodiments of the present application.
The embodiment of the present application also provides a kind of electronic equipment, including memory, and processor, the processor is by calling institute The computer program stored in memory is stated, for executing the process in data processing method provided by the embodiments of the present application.
In the embodiment of the present application, electronic equipment can use by two disaggregated models of learning training and carry out data cleansing Work.Since this can quickly determine out the correct data of class label by two disaggregated models of learning training.Therefore, originally Embodiment can be quickly obtained clean data.Compared to the label information for checking data by manually browsing one by one in the related technology Whether wrong data cleansing mode, the efficiency of data cleansing can be improved in the present embodiment.
Detailed description of the invention
With reference to the accompanying drawing, it is described in detail by the specific embodiment to the application, the technical solution of the application will be made And its advantages are apparent.
Fig. 1 is the first flow diagram of data processing method provided by the embodiments of the present application.
Fig. 2 is second of flow diagram of data processing method provided by the embodiments of the present application.
Fig. 3 is the third flow diagram of data processing method provided by the embodiments of the present application.
Fig. 4 is the structural schematic diagram of the 4th model provided by the embodiments of the present application.
Fig. 5 to Fig. 7 is the schematic diagram of a scenario of data processing method provided by the embodiments of the present application.
Fig. 8 is the 4th kind of flow diagram of data processing method provided by the embodiments of the present application.
Fig. 9 is the structural schematic diagram of data processing equipment provided by the embodiments of the present application.
Figure 10 is the structural schematic diagram of electronic equipment provided by the embodiments of the present application.
Figure 11 is another structural schematic diagram of electronic equipment provided by the embodiments of the present application.
Specific embodiment
Diagram is please referred to, wherein identical component symbol represents identical component, the principle of the application is to implement one It is illustrated in computing environment appropriate.The following description be based on illustrated by the application specific embodiment, should not be by It is considered as limitation the application other specific embodiments not detailed herein.
It is understood that the executing subject of the embodiment of the present application can be such as smart phone or tablet computer or desk-top The electronic equipment of computer or server etc..
Fig. 1 and Fig. 2 are please referred to, Fig. 1 is the first flow diagram of data processing method provided by the embodiments of the present application, Fig. 2 is second of flow diagram of data processing method provided by the embodiments of the present application, and process may include:
101, multiple data are obtained, multiple data carry identical class label.
Data cleansing, which refers to the process of, to be examined and is verified again to data, and its object is to by the mistake in data set Information deletion.By taking the data cleansing of category images processing as an example, in the related technology mainly by way of manual inspection come into Row data cleansing.For example, checking whether the tag along sort of picture is correct by manually, and the picture of tag along sort mistake is deleted It removes.However, in the related technology, the efficiency of data cleansing processing is lower.
In the embodiment of the present application, for example, electronic equipment can first obtain multiple data, these data can carry phase Same class label.It is understood that multiple data are to need to carry out the data of data cleansing.For example, electronic equipment An available data set for needing to carry out data cleansing.
For example, it is desired to which the data for carrying out data cleansing are a pictures, the picture for including in the pictures can be tool There is the picture of same category label.For example, the class label for the picture for including in the pictures is flowers classification etc..
102, multiple data are divided into the first data set and the second data set.
For example, these data can be divided by electronic equipment after getting the data for needing to carry out data cleansing One data set and the second data set.
For example, it is desired to which the data for carrying out data cleansing are 1000 pictures, then electronic equipment can scheme this 1000 Piece is divided into the first data set (such as the first pictures) and the second data set (such as second picture collection).
103, the feature of each data in the first data set and the second data set is extracted.
For example, electronic equipment can extract in the first data set after division obtains the first data set and the second data set Each data feature, and extract the second data set in each data feature.
For example, electronic equipment can extract the feature of each picture in the first data set, and extract in the second data set The feature of each picture.
104, the correctness information of the class label of each data in the second data set is obtained.
For example, after obtaining the second data set, the classification of each data in available second data set of electronic equipment The correctness information of label.For example, in available second data set of electronic equipment each picture class label whether Correct information.
For example, including 200 pictures in the second data set, then the mode of manual inspection can be first passed through to determine this Whether the class label of each picture in 200 pictures is correct.If the class label of picture is correct, inspection personnel can be with Being that picture mark is corresponding by electronic equipment indicates the correct information of class label, such as marks digital " 1 " or English words Female " T " etc..If the class label mistake of picture, inspection personnel, which can be that picture mark is corresponding by electronic equipment, to be indicated The information of class label mistake, such as mark digital " 0 " or English alphabet " F ".In this way, electronic equipment can get this The information of the class label correctness of 200 pictures.
105, according to the spy of the correctness information of the class label of data each in the second data set and each data Sign, preset two disaggregated model of training, obtains object module.
For example, in getting the second data set after the correctness information of the class label of each data, electronic equipment It can be according to each number in the correctness information of the class label of each data in second data set and the second data set According to feature, learning training is carried out to preset two disaggregated model, to obtain the model by learning training, i.e. target mould Type.
For example, electronics is set after getting the information of class label correctness of 200 pictures in the second data set It is standby can be defeated using the information of the class label correctness of this 200 picture and the feature of this 200 picture as input data Enter into preset two disaggregated model to carry out learning training to two disaggregated model, to obtain object module.
In one embodiment, for example, picture PiFor the picture in the second data set, then picture PiFeature fiWith picture PiClass label correctness information biIt can be expressed as < fi,bi>form, then<fi,bi> can make For a learning sample data of two disaggregated models.
It is understood that being that can export picture according to picture feature by the object module that learning training obtains The model of the whether correct information of class label.
106, using the feature of each data in object module and the first data set, classification mark in the first data set is obtained Label are judged as correct first object data.
For example, after obtaining object module, electronic equipment can use every in the object module and the first data set The feature of one data obtains class label in the first data set and is judged as correct data, i.e. first object data.
For example, including 800 pictures in the first data set.So, after obtaining object module, electronic equipment can be by this The feature of each picture is input in the object module in 800 pictures, by the object module according to the spy of each picture Sign exports the whether correct information of class label of the picture, and the correct picture of the class label determined is determined as first Target data.It is understood that first object data may include plurality of pictures.
107, according to the correct data of class label in first object data and the second data set, the second number of targets is obtained According to.
For example, it is correct that electronic equipment can also obtain class label in the second data set after obtaining first object data Data, and the correct data of class label in the first object data and second data set are merged to obtain the second number of targets According to.It is understood that second target data is the clean data obtained after data cleansing is handled.
For example, electronic equipment gets the correctness of the class label of 200 pictures in the second data set in 104 Information, the class label for having 190 pictures in this 200 picture is correct.In 106, electronic equipment utilizes object module Determining in 800 pictures of the first data set has the class label of 790 pictures correct, then electronic equipment can be by second Above-mentioned 190 picture in data set and above-mentioned 790 picture in the first data set merge, 980 obtained pictures.This 980 pictures may be considered the clean data obtained after data cleansing.
It is understood that in the present embodiment, electronic equipment can use by two disaggregated models of learning training come into Row data cleaning.It is correctly counted since two disaggregated models by learning training can quickly determine out class label According to.Therefore, the present embodiment can be quickly obtained clean data.Compared in the related technology by manually browsing inspection data one by one The whether wrong data cleansing mode of label information, the efficiency of data cleansing can be improved in the present embodiment.
In the embodiment of the present application, two disaggregated models are introduced to the cleaning of classification data.Use known second data The information of the class label correctness of intensive data and and the feature of the second data intensive data train two disaggregated models, Object module is obtained, the whether correct information of class label for the data that the object module exports in the first data set is reused, Finally obtain the correct data of all categories label, as clean data.
Referring to Fig. 3, Fig. 3 is the third flow diagram of data processing method provided by the embodiments of the present application, process May include:
201, electronic equipment obtains multiple data, and multiple data carry identical class label.
For example, available 1000 pictures for needing to carry out data cleansing of electronic equipment.This 1000 photos carry Identical class label.For example, this 1000 picture has the identical flowers class label manually marked.For example, this 1000 Picture is respectively P1、P2、P3... ..., P1000
202, multiple data are divided into the first data set and the second data set by electronic equipment.
For example, this 1000 picture can be divided into the first number by electronic equipment after getting above-mentioned 1000 picture According to collection and the second data set.For example, electronic equipment can randomly select 800 pictures from this 1000 picture is classified as the first number The second data set is classified as according to collection, and by remaining 200 picture.
After 1000 pictures are divided into the first data set and the second data set, it is current that electronic equipment can detecte it Whether computing capability is lower than preset threshold.
If detect current computing capability lower than preset threshold, it may be considered that the current computing capability of electronic equipment compared with It is weak.In such a case, it is possible into 203.
If detecting current computing capability not less than preset threshold, it may be considered that the computing capability that electronic equipment is current It is relatively strong.In such a case, it is possible into 204.
203, when the computing capability of electronic equipment is lower than preset threshold, which utilizes default feature extraction mould Type extracts the feature of each data in the first data set and the second data set.
For example, electronic equipment detects current computing capability lower than preset threshold, then the electronic equipment is available Default Feature Selection Model, and each figure in the first data set and the second data set is extracted using the default Feature Selection Model The feature of piece.
In one embodiment, the embodiment of the present application can obtain in the following way default Feature Selection Model:
Electronic equipment obtains the first model, which is the ResNet model obtained according to ImageNet training;
Electronic equipment carries out learning training to the ResNet model using the multiple data, obtains the second model;
Electronic equipment removes the full articulamentum for being located at the second model the last layer to obtain third model, and will be described Third model is determined as default Feature Selection Model.
For example, when the data for needing to carry out data cleansing are picture, electronic equipment can when the multiple data are picture First to obtain the first model, wherein first model is the ResNet model obtained according to ImageNet training.
It should be noted that ImageNet project is a large-scale visualization number for the research of visual object identification software According to library.Image URL more than 14,000,000 is by ImageNet manual annotations, to indicate the object in picture.Since two thousand and ten, ImageNet project holds a software match, the i.e. extensive visual identity challenge match (ILSVRC) of ImageNet, software every year Program competitively correct classification and Detection object and scene.
ResNet (Residual Neural Network) has successfully trained 152 layers by using ResNet Unit Neural network, and champion is obtained in ILSVRC2015 match.The instruction for the accelerans network that the structure of ResNet can be exceedingly fast Practice, the accuracy rate of model also has bigger promotion.
That is, ImageNet is an open, free large-scale picture database, wherein containing 2.2 all creations Category images.And ResNet is then one with the trained picture classification model of data in ImageNet.
For example, after getting ResNet model, electronic equipment can be first with needing to carry out the picture of data cleansing (such as Above-mentioned 1000 picture) machine learning training is carried out to ResNet model, to obtain the second model.Obtaining the second model Afterwards, the full articulamentum that electronic equipment can will be located at the second model the last layer removes, to obtain third model, and by this Three models are determined as default Feature Selection Model.It should be noted that the last layer of ResNet model is full articulamentum, this is complete The effect of articulamentum in a model is classified to picture, and in the ResNet model in addition to the full articulamentum of the last layer Other neural net layers effect be extract feature, therefore will the full articulamentum of the last layer of the second model remove after obtain Neural net layer be used as Feature Selection Model.In addition, why to utilize the picture for needing to carry out data cleansing Carry out a learning training again to ResNet model, be because ResNet be a more general disaggregated model, with need into The picture of row data cleansing carries out a learning training again to ResNet model and obtains the second model, can make the second model pair The classification for needing to carry out the picture of data cleansing is more targeted, so that third model is to the figure for needing to carry out data cleansing The feature extraction of piece is more acurrate.
204, when the computing capability of electronic equipment is not less than preset threshold, electronic equipment obtains the 4th model, and utilizes The feature of each data in 4th the first data set of model extraction and the second data set, the wherein feature extraction of the 4th model Precision is higher than default Feature Selection Model.
For example, electronic equipment detects its current computing capability not less than preset threshold, then electronic equipment can obtain The 4th model is taken, and using the feature of each picture in the 4th the first data set of model extraction and the second data set, wherein The feature extraction precision of 4th model is higher than default Feature Selection Model.
For example, the 4th model can be the more complicated list of structure compared to ResNet model used in the present embodiment A model, such as Inception-Resnet-v2.Alternatively, the 4th model can be the fusion (stacking) of multiple models.Example Such as, the structure of the 4th model can be as shown in Figure 4.Image data is inputed into multiple first-level models (Level 1) simultaneously, then The feature that first-level model is extracted finally uses the output of second-level model as output feature as the input of second-level model.Its Middle Model 1, Model 2, Model3 can select common deep learning model, as ResNet, Inception, MobileNet etc., and Model4 can choose better simply conventional machines learning model, such as linear regression.Multi-model melts The advantage for combining a variety of models is closed, it is stronger to the extractability of feature, so that the effect of subsequent cleaning is more preferable, but the money consumed Source is also more, is suitble to use in the case where electronic equipment operational capability is sufficient.
In one embodiment, the operational capability of electronic equipment can be in such as CPU usage and/or remaining operation The capacity and/or remaining running memory capacity deposited ratio etc. shared in running memory total capacity.
205, electronic equipment obtains the correctness information of the class label of each data in the second data set.
For example, after marking off the first data set and the second data set, available second data set of electronic equipment In each picture class label correctness information.
For example, including 200 pictures in the second data set, then the mode of manual inspection can be first passed through to determine this Whether the class label of each picture in 200 pictures is correct.If the class label of picture is correct, inspection personnel can be with Being that picture mark is corresponding by electronic equipment indicates the correct information of class label, such as marks digital " 1 " or English words Female " T " etc..If the class label mistake of picture, inspection personnel, which can be that picture mark is corresponding by electronic equipment, to be indicated The information of class label mistake, such as mark digital " 0 " or English alphabet " F ".In this way, electronic equipment can get this The information of the class label correctness of 200 pictures.
206, according to the spy of the correctness information of the class label of data each in the second data set and each data Sign, preset two disaggregated model of electronic equipment training, obtains object module.
For example, electronics is set in getting the second data set after the correctness information of the class label of each picture It is standby can be according to every in the correctness information of the class label of each picture in second data set and the second data set The feature of one picture carries out learning training to preset two disaggregated model, to obtain the model by learning training, i.e. mesh Mark model.
For example, preset two disaggregated model can be support vector machines (Support Vector Machine, SVM).In After the information for getting the class label correctness of 200 pictures in the second data set, electronic equipment can be 200 by this The information of the class label correctness of picture and the feature of this 200 picture are input to preset SVM mould as input data To carry out learning training to the SVM model in type, to obtain object module.
In one embodiment, for example, picture PiFor the picture in the second data set, then picture PiFeature fiWith picture PiClass label correctness information biIt can be expressed as < fi,bi>form, then<fi,bi> can make For a learning sample data of SVM model.
It is understood that being that can export picture according to picture feature by the object module that learning training obtains The model of the whether correct information of class label.
In some embodiments, preset two disaggregated model can also be such as multi-layer perception (MLP) (Multi-Layer Perception), the models such as decision tree (Decision Tree) or random forest (Random Forest).
207, electronic equipment obtains first data using the feature of each data in object module and the first data set Class label is concentrated to be judged as correct first object data.
For example, after obtaining object module, electronic equipment can use every in the object module and the first data set The feature of one picture obtains class label in the first data set and is judged as correct picture, i.e. first object data.
For example, including 800 pictures in the first data set.So, after obtaining object module, electronic equipment can be by this The feature of each picture is input in the object module in 800 pictures, by the object module according to the spy of each picture Sign exports the whether correct information of class label of the picture, and the correct picture of the class label determined is determined as first Target data.It is understood that first object data may include plurality of pictures.
208, according to the correct data of class label in first object data and the second data set, electronic equipment obtains Two target datas.
For example, it is correct that electronic equipment can also obtain class label in the second data set after obtaining first object data Picture, and the second number of targets is obtained according to the correct picture of class label in the first object data and second data set According to.It is understood that second target data is the clean picture obtained after data cleansing is handled.
For example, electronic equipment gets the correctness of the class label of 200 pictures in the second data set in 104 Information, the class label for having 190 pictures in this 200 picture is correct.In 106, electronic equipment utilizes object module Determining in 800 pictures of the first data set has the class label of 790 pictures correct, then electronic equipment can be by second Above-mentioned 190 picture in data set and above-mentioned 790 picture in the first data set merge, 980 obtained pictures.This 980 pictures may be considered the clean picture obtained after data cleansing.
In some embodiments, the embodiment of the present application can also include:
Electronic equipment is obtained in first data set using the feature of each data in object module and the first data set Class label is judged as the third target data of mistake;
Electronic equipment deletes the data of class label mistake in third target data and the second data set.
For example, electronic equipment can be by each figure in 800 pictures in the first data set after obtaining object module The feature of piece is input in the object module, exports the classification mark of the picture according to the feature of each picture by the object module Label whether correct information, and the picture of the class label mistake determined is determined as third target data.It is understood that It is that third target data may include plurality of pictures.
After obtaining third target data, electronic equipment can also obtain the figure of class label mistake in the second data set Piece.Later, electronic equipment can delete the picture of class label mistake in third target data and the second data set.It can manage It solves, the picture of class label mistake can be considered that data cleansing is handled in the third target data and the second data set " dirty data " (the Dirty Read) cleaned out.It is understood that these dirty datas are considered as its classification carried Label and its actual class label be not identical.For example, then should for example, the picture of trees is mistakenly labeled as flowers classification The picture of trees is dirty data.
In some embodiments, the application is dividing the first data set and when the second data set, first data set and Second data set can satisfy following condition:
The quantity ratio for the data for including in first data set and the second data set is default ratio, and is wrapped in the first data set The quantity of the data contained is greater than the second data set.
For example, the quantity ratio for the data for including in the first data set and the second data set can be default ratio, such as should Default ratio can be 8:2 or 9:1 or 7.5:2.5 etc., and the quantity for the data for including in the first data set is greater than The quantity for the data for including in second data set.
In another embodiment, the correct data of class label and class label mistake for including in the second data set The quantity of data can be all satisfied following value conditions: i.e. the correct data of class label and class for including in the second data set The quantity of the data of distinguishing label mistake can be all larger than or be equal to 100.For example, the class label for including in the second data set is correct Picture quantity be not less than 100, and the quantity of the picture of class label mistake be not less than 100.
Fig. 5 to Fig. 7 is please referred to, Fig. 5 to Fig. 7 is the schematic diagram of a scenario of data processing method provided by the embodiments of the present application.
For example, as shown in figure 5, user currently need in pictures P 1000 pictures carry out data cleansing processing, this 1000 pictures are labeled with identical class label.So, electronic equipment can first obtain this 1000 picture.
Later, this 1000 picture randomly can be divided into the first pictures and second picture collection, such as Fig. 6 by electronic equipment It is shown.Wherein, the first pictures include 800 pictures, and second picture collection includes 200 pictures.
Later, electronic equipment can be used default Feature Selection Model the first pictures and second picture are concentrated it is each Picture carries out feature extraction.For example, feature FiIt is picture PiFeature, i is integer more than or equal to 1.
After extraction obtains the feature of each picture, it can determine what second picture was concentrated by way of manual inspection Whether the class label of each picture in 200 pictures is correct.If the class label of picture is correct, inspection personnel can be with Being that picture mark is corresponding by electronic equipment indicates the correct information of class label, such as marks digital " 1 ".If the class of picture Distinguishing label mistake, then inspection personnel can be the corresponding letter for indicating class label mistake of picture mark by electronic equipment Breath such as marks digital " 0 ".In this way, electronic equipment can get the letter of the class label correctness of this 200 picture Breath.For example, there are 190 pictures to be labeled with digital " 1 " in this 200 picture, i.e., 190 are concentrated with through manually checking second picture The class label of picture is correct.
After getting the information of the class label correctness of 200 pictures of second picture concentration, electronic equipment can The feature of the information of the class label correctness of this 200 picture and this 200 picture as input data, to be input to To carry out learning training to the SVM model in preset SVM model, to obtain object module.For example, picture PiFor the second number According to a picture of concentration, then picture PiFeature fiWith picture PiClass label correctness information biIt can be with table It is shown as < fi,bi>form, then<fi,bi> it can be used as learning sample data of SVM model.
After obtaining object module, electronic equipment can be by each picture in 800 pictures in the first pictures Feature is input in the object module, is according to the class label that the feature of each picture exports the picture by the object module No correct information, and the correct picture of the class label determined is determined as first object data.For example, electronic equipment is most It is correct for determining the first picture eventually and being concentrated with the class label of 790 pictures.
Later, correct 190 picture of class label and the first pictures that electronic equipment can concentrate second picture In class label be judged as correct 790 picture and merge, 980 obtained pictures.This 980 picture may be considered The clean picture obtained after data cleansing.For example, as shown in fig. 7, this 980 picture is synthesized a figure by electronic equipment Piece collection.
Separately referring to Fig. 8, Fig. 8 is the 4th kind of flow diagram of data processing method provided in this embodiment.
In the present embodiment, in the present embodiment, electronic equipment can use by two disaggregated models of learning training and carry out Data cleansing work.It is correctly counted since two disaggregated models by learning training can quickly determine out class label According to.Therefore, the present embodiment can be quickly obtained clean data.Compared in the related technology by manually browsing inspection data one by one The whether wrong data cleansing mode of label information, this embodiment reduces a large amount of labor workloads, and data can be improved The efficiency of cleaning reduces the cost of data cleansing.
In addition, the present embodiment by two classification in the way of carry out data cleansing work, can achieve with manually clean it is close Accuracy.Also, its data cleansing process of data cleansing mode provided in this embodiment can be recalled, and other personnel can pass through Cleaning process checks data cleansing quality.
Referring to Fig. 9, Fig. 9 is the structural schematic diagram of data processing equipment provided by the embodiments of the present application.Data processing dress Setting 300 may include: the first acquisition module 301, and division module 302, extraction module 303, second obtains module 304, training mould Block 305, third obtain module 306, processing module 307.
First obtains module 301, and for obtaining multiple data, the multiple data carry identical class label;
Division module 302, for the multiple data to be divided into the first data set and the second data set;
Extraction module 303, for extracting the feature of each data in first data set and second data set;
Second obtains module 304, for obtaining the correctness of the class label of each data in second data set Information;
Training module 305, for the correctness information according to the class label of each data in second data set And the feature of each data, preset two disaggregated model of training obtain object module;
Third obtains module 306, for the spy using each data in the object module and first data set Sign obtains class label in first data set and is judged as correct first object data;
Processing module 307, for correct according to class label in the first object data and second data set Data, obtain the second target data.
In one embodiment, the first acquisition module 301 can be used for:
When the multiple data are picture, the first model is obtained, first model is trained according to ImageNet The ResNet model arrived;
Learning training is carried out to the ResNet model using the multiple data, obtains the second model;
The full articulamentum for being located at the second model the last layer is removed to obtain third model, and by the third model It is determined as default Feature Selection Model;
So, the extraction module 303 can be used for: utilizing the default Feature Selection Model, extracts first number According to the feature of each data in collection and second data set.
In one embodiment, the extraction module 303 can be used for:
When the computing capability of electronic equipment is lower than preset threshold, using the default Feature Selection Model, described in extraction The feature of each data in first data set and second data set.
In one embodiment, the extraction module 303 can be used for:
When the computing capability of the electronic equipment is not less than the preset threshold, the 4th model is obtained, and described in utilization The feature of each data in first data set described in 4th model extraction and second data set, wherein the 4th model Feature extraction precision is higher than the default Feature Selection Model.
In one embodiment, the quantity ratio for the data for including in first data set and second data set is Default ratio, and the quantity for the data for including in first data set is greater than second data set.
In one embodiment, the processing module 307 can be also used for:
Using the feature of each data in the object module and first data set, first data set is obtained Middle class label is judged as the third target data of mistake;
The data of class label mistake in the third target data and second data set are deleted.
In one embodiment, preset two disaggregated model includes at least support vector machines, multi-layer perception (MLP), determines Plan tree or random forest.
The embodiment of the present application provides a kind of computer-readable storage medium, computer program is stored thereon with, when described When computer program executes on computers, so that the computer is executed as in data processing method provided in this embodiment Process.
The embodiment of the present application also provides a kind of electronic equipment, including memory, and processor, the processor is by calling institute The computer program stored in memory is stated, for executing the process in data processing method provided in this embodiment.
For example, above-mentioned electronic equipment can be the mobile terminals such as tablet computer or smart phone.Referring to Fig. 10, Figure 10 is the structural schematic diagram of electronic equipment provided by the embodiments of the present application.
The electronic equipment 400 may include the components such as display screen 401, memory 402, processor 403.Those skilled in the art Member is appreciated that electronic devices structure shown in Figure 10 does not constitute the restriction to electronic equipment, may include than illustrating more More or less component perhaps combines certain components or different component layouts.
Display screen 401 is displayed for the information such as picture and text.
Memory 402 can be used for storing application program and data.It include that can hold in the application program that memory 402 stores Line code.Application program can form various functional modules.Processor 403 is stored in the application journey of memory 402 by operation Sequence, thereby executing various function application and data processing.
Processor 403 is the control centre of electronic equipment, utilizes each of various interfaces and the entire electronic equipment of connection A part by running or execute the application program being stored in memory 402, and is called and is stored in memory 402 Data execute the various functions and processing data of electronic equipment, to carry out integral monitoring to electronic equipment.
In the present embodiment, the processor 403 in electronic equipment can be according to following instruction, will be one or more The corresponding executable code of the process of application program is loaded into memory 402, and is run by processor 403 and be stored in storage Application program in device 402, thereby executing:
Multiple data are obtained, the multiple data carry identical class label;
The multiple data are divided into the first data set and the second data set;
Extract the feature of each data in first data set and second data set;
Obtain the correctness information of the class label of each data in second data set;
According to the spy of the correctness information and each data of the class label of each data in second data set Sign, preset two disaggregated model of training, obtains object module;
Using the feature of each data in the object module and first data set, first data set is obtained Middle class label is judged as correct first object data;
According to the correct data of class label in the first object data and second data set, the second mesh is obtained Mark data.
Figure 11 is please referred to, electronic equipment 400 may include display screen 401, memory 402, processor 403, input unit 404, the components such as power supply 405.
Display screen 401 is displayed for the information such as picture and text.
Memory 402 can be used for storing application program and data.It include that can hold in the application program that memory 402 stores Line code.Application program can form various functional modules.Processor 403 is stored in the application journey of memory 402 by operation Sequence, thereby executing various function application and data processing.
Processor 403 is the control centre of electronic equipment, utilizes each of various interfaces and the entire electronic equipment of connection A part by running or execute the application program being stored in memory 402, and is called and is stored in memory 402 Data execute the various functions and processing data of electronic equipment, to carry out integral monitoring to electronic equipment.
Input unit 404 can be used for receiving number, character information or the user's characteristic information (such as fingerprint) of input, and Generate keyboard related with user setting and function control, mouse, operating stick, optics or trackball signal input.
Power supply 405 can be used for providing electric power guarantee for each component.
In the present embodiment, the processor 403 in electronic equipment can be according to following instruction, will be one or more The corresponding executable code of the process of application program is loaded into memory 402, and is run by processor 403 and be stored in storage Application program in device 402, thereby executing:
Multiple data are obtained, the multiple data carry identical class label;
The multiple data are divided into the first data set and the second data set;
Extract the feature of each data in first data set and second data set;
Obtain the correctness information of the class label of each data in second data set;
According to the spy of the correctness information and each data of the class label of each data in second data set Sign, preset two disaggregated model of training, obtains object module;
Using the feature of each data in the object module and first data set, first data set is obtained Middle class label is judged as correct first object data;
According to the correct data of class label in the first object data and second data set, the second mesh is obtained Mark data.
In one embodiment, the processor 403 can be also used for: when the multiple data are picture, obtain First model, first model are the ResNet model obtained according to ImageNet training;Using the multiple data to institute It states ResNet model and carries out learning training, obtain the second model;The full articulamentum for being located at the second model the last layer is moved Except obtaining third model, and the third model is determined as default Feature Selection Model;
So, the processor 403 executes each number in the extraction first data set and second data set According to feature when, can execute: utilize the default Feature Selection Model, extract first data set and second data Concentrate the feature of each data.
In one embodiment, the processor 403, which executes, utilizes the default Feature Selection Model, extracts described the It in one data set and second data set when feature of each data, can execute: when the computing capability of electronic equipment is lower than When preset threshold, using the default Feature Selection Model, extract each in first data set and second data set The feature of data.
In one embodiment, the processor 403 can also be performed: when the computing capability of the electronic equipment is not low When the preset threshold, the 4th model is obtained, and utilize the first data set and described second described in the 4th model extraction The feature of each data in data set, wherein the feature extraction precision of the 4th model is higher than the default feature extraction mould Type.
In one embodiment, the quantity ratio for the data for including in first data set and second data set is Default ratio, and the quantity for the data for including in first data set is greater than second data set.
In one embodiment, the processor 403 can also be performed: utilize the object module and described first The feature of each data in data set obtains the third number of targets that class label in first data set is judged as mistake According to;The data of class label mistake in the third target data and second data set are deleted.
In one embodiment, preset two disaggregated model includes at least support vector machines, multi-layer perception (MLP), determines Plan tree or random forest.
In the above-described embodiments, it all emphasizes particularly on different fields to the description of each embodiment, there is no the portion being described in detail in some embodiment Point, it may refer to the detailed description above with respect to data processing method, details are not described herein again.
Data processing method in the data processing equipment provided by the embodiments of the present application and foregoing embodiments belongs to together One design can run either offer method in the data processing method embodiment on the data processing equipment, Specific implementation process is detailed in the data processing method embodiment, and details are not described herein again.
It should be noted that those of ordinary skill in the art can for the data processing method described in the embodiment of the present application With understand realize the embodiment of the present application described in data processing method all or part of the process, be can by computer program come Relevant hardware is controlled to complete, the computer program can be stored in a computer-readable storage medium, such as be stored in It in memory, and is executed by least one processor, in the process of implementation may include the embodiment such as the data processing method Process.Wherein, the storage medium can be magnetic disk, CD, read-only memory (ROM, Read Only Memory), random Access/memory body (RAM, Random Access Memory) etc..
For the data processing equipment of the embodiment of the present application, each functional module be can integrate in a processing core In piece, it is also possible to modules and physically exists alone, can also be integrated in two or more modules in a module.On It states integrated module both and can take the form of hardware realization, can also be realized in the form of software function module.The collection If at module realized in the form of software function module and when sold or used as an independent product, also can store In one computer-readable storage medium, the storage medium is for example read-only memory, disk or CD etc..
Above to a kind of data processing method, device, storage medium and electronic equipment provided by the embodiment of the present application It is described in detail, specific examples are used herein to illustrate the principle and implementation manner of the present application, the above reality The explanation for applying example is merely used to help understand the present processes and its core concept;Meanwhile for those skilled in the art, According to the thought of the application, there will be changes in the specific implementation manner and application range, in conclusion in this specification Hold the limitation that should not be construed as to the application.

Claims (10)

1. a kind of data processing method characterized by comprising
Multiple data are obtained, the multiple data carry identical class label;
The multiple data are divided into the first data set and the second data set;
Extract the feature of each data in first data set and second data set;
Obtain the correctness information of the class label of each data in second data set;
According to the feature of the correctness information and each data of the class label of each data in second data set, instruction Practice preset two disaggregated model, obtains object module;
Using the feature of each data in the object module and first data set, class in first data set is obtained Distinguishing label is judged as correct first object data;
According to the correct data of class label in the first object data and second data set, the second number of targets is obtained According to.
2. data processing method according to claim 1, which is characterized in that the method also includes:
When the multiple data are picture, the first model is obtained, first model is obtained according to ImageNet training ResNet model;
Learning training is carried out to the ResNet model using the multiple data, obtains the second model;
The full articulamentum for being located at the second model the last layer is removed to obtain third model, and the third model is determined To preset Feature Selection Model;
The feature for extracting each data in first data set and second data set, comprising: preset using described Feature Selection Model extracts the feature of each data in first data set and second data set.
3. data processing method according to claim 2, which is characterized in that utilize the default Feature Selection Model, mention Take the feature of each data in first data set and second data set, comprising:
When the computing capability of electronic equipment is lower than preset threshold, using the default Feature Selection Model, described first is extracted The feature of each data in data set and second data set.
4. data processing method according to claim 3, which is characterized in that the method also includes:
When the computing capability of the electronic equipment is not less than the preset threshold, the 4th model is obtained, and utilize the described 4th The feature of each data in first data set described in model extraction and second data set, wherein the feature of the 4th model Extraction accuracy is higher than the default Feature Selection Model.
5. data processing method according to claim 1, which is characterized in that first data set and second data The quantity ratio for the data that concentration includes is default ratio, and the quantity for the data for including in first data set is greater than described the Two data sets.
6. data processing method according to claim 1, which is characterized in that the method also includes:
Using the feature of each data in the object module and first data set, class in first data set is obtained Distinguishing label is judged as the third target data of mistake;
The data of class label mistake in the third target data and second data set are deleted.
7. data processing method according to claim 1, which is characterized in that preset two disaggregated model includes at least Support vector machines, multi-layer perception (MLP), decision tree or random forest.
8. a kind of data processing equipment characterized by comprising
First obtains module, and for obtaining multiple data, the multiple data carry identical class label;
Division module, for the multiple data to be divided into the first data set and the second data set;
Extraction module, for extracting the feature of each data in first data set and second data set;
Second obtains module, for obtaining the correctness information of the class label of each data in second data set;
Training module, for according to the correctness information of the class label of each data in second data set and each The feature of data, preset two disaggregated model of training, obtains object module;
Third obtains module, for the feature using each data in the object module and first data set, obtains Class label is judged as correct first object data in first data set;
Processing module is used for according to the correct data of class label in the first object data and second data set, Obtain the second target data.
9. a kind of storage medium, is stored thereon with computer program, which is characterized in that when the computer program on computers When execution, so that the computer executes the method as described in any one of claims 1 to 7.
10. a kind of electronic equipment, including memory, processor, which is characterized in that the processor is by calling the memory The computer program of middle storage, for executing the method as described in any one of claims 1 to 7.
CN201910713784.5A 2019-08-02 2019-08-02 Data processing method and device, storage medium and electronic equipment Active CN110490237B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910713784.5A CN110490237B (en) 2019-08-02 2019-08-02 Data processing method and device, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910713784.5A CN110490237B (en) 2019-08-02 2019-08-02 Data processing method and device, storage medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN110490237A true CN110490237A (en) 2019-11-22
CN110490237B CN110490237B (en) 2022-05-17

Family

ID=68549273

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910713784.5A Active CN110490237B (en) 2019-08-02 2019-08-02 Data processing method and device, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN110490237B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111460195A (en) * 2020-03-26 2020-07-28 Oppo广东移动通信有限公司 Picture processing method and device, storage medium and electronic equipment
CN112734035A (en) * 2020-12-31 2021-04-30 成都佳华物链云科技有限公司 Data processing method and device and readable storage medium
CN113128979A (en) * 2021-05-17 2021-07-16 中铁高新工业股份有限公司 Scientific research aid decision-making system based on big data
CN113204660A (en) * 2021-03-31 2021-08-03 北京达佳互联信息技术有限公司 Multimedia data processing method, label identification method, device and electronic equipment

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160026900A1 (en) * 2013-04-26 2016-01-28 Olympus Corporation Image processing device, information storage device, and image processing method
CN108764372A (en) * 2018-06-08 2018-11-06 Oppo广东移动通信有限公司 Construction method and device, mobile terminal, the readable storage medium storing program for executing of data set
CN108875821A (en) * 2018-06-08 2018-11-23 Oppo广东移动通信有限公司 The training method and device of disaggregated model, mobile terminal, readable storage medium storing program for executing
CN109213862A (en) * 2018-08-21 2019-01-15 北京京东尚科信息技术有限公司 Object identification method and device, computer readable storage medium
CN109447717A (en) * 2018-11-12 2019-03-08 万惠投资管理有限公司 A kind of determination method and system of label
CN109753498A (en) * 2018-12-11 2019-05-14 中科恒运股份有限公司 data cleaning method and terminal device based on machine learning
US20190236412A1 (en) * 2016-10-18 2019-08-01 Tencent Technology (Shenzhen) Company Limited Data processing method and device, classifier training method and system, and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160026900A1 (en) * 2013-04-26 2016-01-28 Olympus Corporation Image processing device, information storage device, and image processing method
US20190236412A1 (en) * 2016-10-18 2019-08-01 Tencent Technology (Shenzhen) Company Limited Data processing method and device, classifier training method and system, and storage medium
CN108764372A (en) * 2018-06-08 2018-11-06 Oppo广东移动通信有限公司 Construction method and device, mobile terminal, the readable storage medium storing program for executing of data set
CN108875821A (en) * 2018-06-08 2018-11-23 Oppo广东移动通信有限公司 The training method and device of disaggregated model, mobile terminal, readable storage medium storing program for executing
CN109213862A (en) * 2018-08-21 2019-01-15 北京京东尚科信息技术有限公司 Object identification method and device, computer readable storage medium
CN109447717A (en) * 2018-11-12 2019-03-08 万惠投资管理有限公司 A kind of determination method and system of label
CN109753498A (en) * 2018-12-11 2019-05-14 中科恒运股份有限公司 data cleaning method and terminal device based on machine learning

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111460195A (en) * 2020-03-26 2020-07-28 Oppo广东移动通信有限公司 Picture processing method and device, storage medium and electronic equipment
CN112734035A (en) * 2020-12-31 2021-04-30 成都佳华物链云科技有限公司 Data processing method and device and readable storage medium
CN112734035B (en) * 2020-12-31 2023-10-27 成都佳华物链云科技有限公司 Data processing method and device and readable storage medium
CN113204660A (en) * 2021-03-31 2021-08-03 北京达佳互联信息技术有限公司 Multimedia data processing method, label identification method, device and electronic equipment
CN113128979A (en) * 2021-05-17 2021-07-16 中铁高新工业股份有限公司 Scientific research aid decision-making system based on big data

Also Published As

Publication number Publication date
CN110490237B (en) 2022-05-17

Similar Documents

Publication Publication Date Title
CN110490237A (en) Data processing method, device, storage medium and electronic equipment
Cliche et al. Scatteract: Automated extraction of data from scatter plots
CN110472082B (en) Data processing method, data processing device, storage medium and electronic equipment
CN108351828A (en) Technology for device-independent automatic application test
CN110175236A (en) Training sample generation method, device and computer equipment for text classification
CN108256537A (en) A kind of user gender prediction method and system
CN106537387B (en) Retrieval/storage image associated with event
CN102201062A (en) Information processing apparatus, method and program
CN108536784A (en) Comment information sentiment analysis method, apparatus, computer storage media and server
Del Rincón et al. Common-sense reasoning for human action recognition
CN112101335A (en) APP violation monitoring method based on OCR and transfer learning
CN109716275A (en) Based on personalized theme with multi-dimensional model come the method that shows image
CN110363190A (en) A kind of character recognition method, device and equipment
CN107330009A (en) Descriptor disaggregated model creation method, creating device and storage medium
Li et al. T3-vis: visual analytic for training and fine-tuning transformers in NLP
Yang et al. Explaining deep convolutional neural networks via latent visual-semantic filter attention
CN115658523A (en) Automatic control and test method for human-computer interaction interface and computer equipment
CN109857878B (en) Article labeling method and device, electronic equipment and storage medium
CN106997350A (en) A kind of method and device of data processing
Rizvi et al. A hybrid approach and unified framework for bibliographic reference extraction
CN112270318A (en) Automatic scoring method and device, electronic equipment and storage medium
CN107423441A (en) A kind of picture correlating method and its device, electronic equipment
CN110580299B (en) Method, system, equipment and storage medium for generating matching diagram of recommended language of object
CN108170838B (en) Topic evolution visualization display method, application server and computer readable storage medium
CN110442807A (en) A kind of webpage type identification method, device, server and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant