CN105426826A

CN105426826A - Tag noise correction based crowd-sourced tagging data quality improvement method

Info

Publication number: CN105426826A
Application number: CN201510754782.2A
Authority: CN
Inventors: 张静
Original assignee: Individual
Current assignee: Individual
Priority date: 2015-11-09
Filing date: 2015-11-09
Publication date: 2016-03-23

Abstract

The invention relates to a tag noise correction based crowd-sourced tagging data quality improvement method. The method comprises the following steps: running a tag integration algorithm in an initial crowd-sourced tagging data set to form a data set after tag integration, and estimating tagger quality and integrated tag quality information of samples in the process; performing multi-round K-fold cross validation by utilizing the data set after tag integration, and constructing a high-quality data set; determining a tag noise set in combination with the tagger quality and the tag quality of the samples by utilizing a prediction probability of a class tag of each sample in the multi-round K-fold cross validation process; and training a classification model by utilizing the high-quality data set generated in the multi-round K-fold cross validation process, and performing prediction and replacement on the class tag of each sample in the tag noise data set by using the model. With the tag noise correction method, the quantity of potential noise tag samples in the data set after original tag integration is reduced, so that the data quality is improved.

Description

A kind of mass-rent labeled data increased quality method of correcting based on label noise

Technical field

The invention belongs to data label technology field, be specifically related to a kind of mass-rent labeled data increased quality method of correcting based on label noise.

Background technology

Obtain the basic work that high-quality labeled data is the fields such as current information retrieval, machine learning, data mining.For the supervised learning in machine learning, its whole learning process is exactly carry out model training on the data set with class label of a moderate scale, thus obtain can the learning model of Accurate Prediction to not marking sample.Traditionally, the class label in training data is normally provided by the expert of this application.The class label accuracy that expert provides is high, is conducive to building high-quality model.But this expert's mark itself is of a high price.Along with the development of Intelligent Computation Technology, increasing mark demand constantly proposes, and adopts expert's mark can not meet application demand.The appearance of mass-rent system greatly alleviates this problem.A lot of mark task, such as text marking, Images Classification etc., all can be published on internet by mass-rent platform, be marked by the domestic consumer from internet.Domestic consumer completes data mark task and obtains the economic returns that publisher provides.

The appearance of mass-rent mark makes the cost obtaining labeled data diminish and ageing reinforcement.But, mass-rent mark also have its intrinsic defect: mark person is the domestic consumer from internet, compared with mark with traditional expert, its mark quality less than guarantee.In order to solve the problem of poor quality, a kind of method widely adopted allows different mark persons mark with regard to each mark sample, then uses a kind of label integrated approach, obtains the label that each sample is final.The algorithm (RY) that the people such as current existing label Integrated Algorithm comprises: majority voting algorithm, David and Skene algorithm (DS), Raykar propose, ZenCrowd algorithm etc.These label Integrated Algorithms carry out modeling from multiple sides such as the difficulty of the professional knowledge level of user, input degree that user finishes the work, task itself to mass-rent labeling system, and the integrated label of each sample of reasoning.Correlative study finds, although integrated method is varied, does not have certain algorithm to be acknowledged as best performance.In most of the cases, label integrated after the quality of data promote limitation.Here the quality of data is defined as, the matching degree between sample data integrated label value and its label true value.In whole labeled data processing procedure, the label true value of all samples is all unknown, and the integrated target of label is exactly the label correctly inferring each sample, makes it to mate as much as possible with its true value.

The main cause that above-mentioned label Integrated Algorithm cannot promote the quality of data is further the label information that algorithm only make use of from multiple uncertain mark person, and have ignored the characteristic information of data itself.Label value after those are integrated in the present invention and the unmatched data label of label true value are called " noise " label.If can utilize the characteristic information of available data, correct further to these noises, so the quality of data can obtain further lifting.

Summary of the invention

For the above-mentioned technical matters existing for prior art, the invention provides a kind of mass-rent labeled data increased quality method of correcting based on label noise.The general technological system of the method comprises following steps:

(1) at initial mass-rent labeled data collection dupper operation label Integrated Algorithm, obtain label integrated after data set d ⁱ, each data sample of this data centralization all obtains an integrated label.The estimate mark in the process quality of person and the quality of each integrated sample label.Described mark person quality, the label that namely mark person gives sample is equal to the probability of sample label true value.The quality of described integrated sample label, namely the integrated label of sample is equal to the probability of its label true value.

(2) to data set d ⁱcarry out mwheel kfolding cross validation, namely to data set d ⁱafter upsetting at random, be divided into kpart, wherein every a respectively as test set, and remainder k+1part, as training set, trains sorter.This sorter is used to carry out Tag Estimation to each sample in test set.In the cross validation that each is taken turns, build a quality data collection.Altogether build mindividual quality data collection hQ ⁽¹⁾, hQ ⁽²⁾..., hQ ^(M).Each is utilized to take turns in cross-validation process, the label classification prediction probability that each sample obtains, the mark person quality obtained in integrating step (1) and integrated sample label quality, the possibility all samples being belonged to label noise sample sorts, determine the label noise sample of some, these composition of sample label noise data collection d ⁿ.From d ⁱdelete those to belong to d ⁿsample, the clean data set of remaining composition of sample d ^c, triadic relation is: d ⁱ =D ⁿ + D ^c.Described mwith kfor the parameter of the method, wherein mvalue be not less than 1 positive integer, kvalue be not less than 3 positive integer.

(3) utilize described in step (2) mindividual quality data collection hQ ⁽¹⁾, hQ ⁽²⁾..., hQ ^(M)train classification models, and utilize this disaggregated model predict noise data set again d ⁿin the class label of whole samples, and replace original class label with the class label doped, finally form revised noise data collection d ^r.

(4) by described in step (3) d ^rdescribed in step (2) d ^cbe merged into new enhancing data set d ^e. d ^edescribed in step (1) d ⁱthere is identical sample, but d ^elabel quality higher than d ⁱ.

Potential in integrated label of the characteristic attribute combination tag noise management technique that the present invention uses mark sample itself mistakecorrect.The present invention and tradition only carry out having following beneficial effect compared with the integrated method of label:

(1) present invention utilizes the characteristic attribute being marked sample itself revise further on the basis of label integrated approach integrated in potential mistakelabel, improves the label quality of final data collection.

(2) the present invention is suitable for multiple label integrated approach, has versatility.

The inventive method is all applicable to various types of mass-rent data, includes but not limited to: the two-value mark of the tasks such as image, text, video and many-valued mark.

Accompanying drawing explanation

fig. 1for the general frame of the inventive method figure.

fig. 2for a kind of embodiment flow process of the inventive method figure.

Embodiment

In order to more specifically describe the present invention, below in conjunction with accompanying drawingdescribe a kind of embodiment of the present invention in detail.

Step (1): (mass-rent label is integrated)

(1-1) at initial mass-rent data set da kind of label Integrated Algorithm of upper operation.The most frequently used algorithm is majority voting algorithm.This algorithm is for each sample of data centralization i, the label this sample being come to multiple mark person carries out quantity statistics, if classification is c _klabel there is maximum quantity, so the integrated label of this sample is c _k.If the label classification that number is maximum is incessantly a kind of, select a kind as the integrated label of this sample so at random.

(1-2) data set d ⁱin any one sample i, its integrated label is , mark person jgive sample ilabel be , so mark person jmark quality be calculated as:

Wherein ibe d ⁱthe number of middle sample, function for indicator function, namely return 1 when condition is set up otherwise return 0.

Mark person add up to j, then the average mark quality of all mark persons be calculated as:

One has nthe sample of individual mass-rent label i, its integrated label quality qbe calculated as:

The bound of the integrated rear potential noise number of estimation label αwith βbe calculated as respectively:

Step (2): (noise identification) this step needs two parameters, kwith m, wherein kfor following K rolls over the broken number of cross validation, mit is the number of the quality data collection that will build.Generally kbe set to 10, mbe set to 5.

(2-1) step 2-1 is mthe cyclic process of wheel, often takes turns circulation mbuild a high-quality data set hQ ^(m)it is as follows that line correlation of going forward side by side calculates concrete steps:

(2-1-1) by data set d ⁱin sample order upset at random, will d ⁱbe divided into kequal portions.Once using each equal portions as test set, all the other k-1individual equal portions are as training set.Use this k-1individual equal portions data training classifier m, and use this sorter to predict the sample in test set.

(2-1-2) sorter built mto each sample ipredict, dope sample ibelong to classification 1, classification 2 ..., classification hprobability be respectively , ..., .Calculate , wherein hfor classification sum.If this sample iprediction label and its integrated label obtained in step (1) difference, then calculate .Wherein be used for recording the number of times that the prediction label of each sample is not identical with integrated label in step (1). be used for describing the uncertainty degree of sample label.If this sample iprediction label and its integrated label obtained in step (1) identical, then by sample iadd hQ ^(m).

(2-2) exist mafter individual quality data collection builds, right d ⁱin all sample calculate:

And by all samples according to descending sort.

(2-3) calculate the number of sample θ.Finally, by formula calculate final selected noise collection d ⁿnumber of samples n _r.Press in step (2-6) before descending sort n _rindividual sample is from data set d ⁱmiddle deletion, and form noise data collection d ⁿ, remaining d ⁱin data form clean data collection d ^c.

Step 3:(noise is corrected) noise correction procedure is to data set d ⁿin each sample iperform following steps:

(3-1) for mindividual quality data collection hQ ⁽¹⁾, hQ ⁽²⁾..., hQ ^(M)remove sample wherein i, build respectively mindividual classification l ⁽¹⁾, l ⁽²⁾..., l ^(M), finally with them to sample iclass label predict, obtain mindividual predicted value.

(3-2) to this mindividual predicted value carries out majority ballot process, namely carries out quantity statistics to each classification, if classification is c _klabel there is maximum quantity, the integrated label so revising sample is c _k.If the label classification that number is maximum is incessantly a kind of, select a kind as the final integrated label of this sample so at random.

Step (4): (data merging) is by the data set through above-mentioned steps process d ⁿand data set d ^ccarry out merging and form data set d ^e, d ^ewith d ⁱhave identical sample, but the class label of potential noise sample is revised through said process, data set quality is improved.

In the above-described embodiments, the process building sorter can select suitable sorting algorithm according to data type to be dealt with, such as, can select Bayes classifier for text data, decision tree etc., support vector machine can be selected, neural network etc. for view data.Its building process is Machine learning classifiers and builds standard procedure.

Above-described embodiment is not limitation of the present invention, and the present invention is not limited only to above-described embodiment, as long as meet application claims, all belongs to protection scope of the present invention.

Claims

1., based on the mass-rent labeled data increased quality method that label noise is corrected, comprise the following steps:

(1) run on initial mass-rent labeled data collection label Integrated Algorithm formed label integrated after data set, the each sample of this data centralization obtains an integrated label, the quality of each mark person and the quality of each integrated sample label is estimated in label integrating process or after process, wherein, described mark person quality definition provides the probability of correct label for mark person, and the integrated label quality definition of described sample is the probability that this integrated label equals its true tag;

(2) data set after utilizing label integrated carries out many wheels kfolding cross validation, and often taking turns khigh-quality data set is built in the process of folding cross validation, wherein, described in keach construction method of taking turns middle quality data collection of folding cross validation is: before epicycle cross validation starts, make quality data collection for empty, then in cross-validation process, check that whether the integrated label of each sample of data centralization is consistent with the prediction label of epicycle cross validation to this sample, if unanimously just add in quality data by this sample;

(3) many wheels are utilized kprediction probability to each sample label classification in folding cross-validation process, in conjunction with the quality of mark person and label, determine label noise sample, and label noise sample is integrated from label after data centralization be separated, form label noise data collection, remaining part forms clean data set;

(4) many wheels are utilized kthe high-quality data set train classification models that produces in folding cross-validation process, and with this model the label of the sample that label noise data is concentrated predicted again and replace;

(5) the label noise data collection after process and clean data are merged formation quality and strengthen data set; Described keach construction method of taking turns middle quality data collection of folding cross validation is: before epicycle cross validation starts, make quality data collection for empty, then in cross-validation process, check that whether the integrated label of each sample of data centralization is consistent with the prediction label of epicycle cross validation to this sample, if unanimously just added in quality data by this sample.

2. mass-rent labeled data increased quality method according to claim 1, be included in the label Integrated Algorithm that initial mass-rent labeled data collection runs, it is characterized in that: algorithm at least uses the label of each sample of data centralization that mass-rent mark person gives, estimate the true tag of each sample, this label estimated is called integrated label.

3. mass-rent labeled data increased quality method according to claim 1, comprises the quality estimating each mark person, it is characterized in that: the estimation of mark person quality or directly provided by label Integrated Algorithm or calculated by its result.

4. mass-rent labeled data increased quality method according to claim 1, comprises the quality estimating each integrated sample label, it is characterized in that: the estimation of integrated label quality or directly provided by label Integrated Algorithm or calculated by its result.

5. mass-rent labeled data increased quality method according to claim 1, comprises the step identifying noise exemplar, it is characterized in that: kdetermine that each sample label belongs to the probability of each classification in folding cross validation, utilize many wheels kthe sample label predicted in folding cross validation belongs to the probability of each classification, calculate the possible degree that this sample label belongs to noise, and may all samples be sorted degree by this, utilize mark person's quality and integrated label quality, and the number of the possible degree determination noise sample of noise is belonged in conjunction with sample label, according to number and the ordering scenario of noise sample, determine label noise sample.

6. mass-rent labeled data increased quality method according to claim 1, comprise label is integrated after data set be divided into noise data collection and clean data set, it is characterized in that: formed after the data set after clean data set is integrated by label removes label noise sample, the sample of this data centralization, its label no longer changes in subsequent step.

7. mass-rent labeled data increased quality method according to claim 1, comprises and utilizes many wheels kthe high-quality data set train classification models produced in folding cross-validation process, and with this model the label of the sample that label noise data is concentrated predicted again and replace, it is characterized in that: one or more the disaggregated model training carrying out based on supervised learning utilizing high-quality data centralization, build one or more sorter, independently utilize one of them sorter or combine the prediction again and the replacement that utilize multiple sorter label noise sample to enter label.

8. mass-rent labeled data increased quality method according to claim 1, comprise final formation one and strengthen data set, it is characterized in that: the noise data collection after this data set has correction and the merging of clean data set form, data set after it is integrated with label has identical sample, but its integrated label quality obtains raising.

9., according to the mass-rent labeled data increased quality method described in claim 7, it is characterized in that, the method for described train classification models, different according to the field of handled data set, select the suitable disaggregated model training algorithm based on supervised learning.