CN112766337A - Method and system for predicting correct label of crowdsourced data - Google Patents

Method and system for predicting correct label of crowdsourced data Download PDF

Info

Publication number
CN112766337A
CN112766337A CN202110028695.4A CN202110028695A CN112766337A CN 112766337 A CN112766337 A CN 112766337A CN 202110028695 A CN202110028695 A CN 202110028695A CN 112766337 A CN112766337 A CN 112766337A
Authority
CN
China
Prior art keywords
data
label
crowdsourcing
neural network
initial
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110028695.4A
Other languages
Chinese (zh)
Other versions
CN112766337B (en
Inventor
陈益强
卢旺
于汉超
杨晓东
张迎伟
�谷洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
Peng Cheng Laboratory
Original Assignee
Institute of Computing Technology of CAS
Peng Cheng Laboratory
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS, Peng Cheng Laboratory filed Critical Institute of Computing Technology of CAS
Priority to CN202110028695.4A priority Critical patent/CN112766337B/en
Publication of CN112766337A publication Critical patent/CN112766337A/en
Application granted granted Critical
Publication of CN112766337B publication Critical patent/CN112766337B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a method and a system for predicting the correct label of crowdsourcing data, wherein the method utilizes a neural network model, and the model obtains the reference label of the corresponding crowdsourcing data based on the mean value of all initial labels of each crowdsourcing data and obtains the reference label by training; and obtaining a prediction label of each crowdsourcing data by using the neural network model, and iteratively calibrating the current neural network model based on the credibility of each initial label of each crowdsourcing data relative to the prediction label until the neural network model converges or the precision continuously declines. The method and the system can reduce the dependence on the ability of crowdsourcing data workers in deep learning, thereby improving the accuracy and the robustness of a deep learning model.

Description

Method and system for predicting correct label of crowdsourced data
Technical Field
The invention relates to the technical field of data mining analysis, in particular to a method and a system for predicting correct labels of crowdsourced data.
Background
In recent years, the advanced technical level of each branch of machine learning is remarkably improved by deep learning, and the machine learning field is revolutionized. With the continuous increase of the scale of the supervised artificial neural network, the demand of the deep learning technology for an accurate and labeled data set in the process of learning feature representation is also increasing. The crowdsourcing method can acquire a large amount of tagged data in a short time by distributing the tag task to different workers, reduces the tag cost in a large scale, is a fast, effective and cheap data tag acquisition method, and is widely applied to large-scale data tags. However, the crowdsourcing method introduces a large number of non-expert workers, and due to factors such as sample difficulty and worker capability, different degrees of noise exist in the data tag.
In response to the above problems, many scholars and researchers have conducted related studies. For example, chinese patent application CN201711113706.9 discloses a method for obtaining a crowdsourcing cost complexity, in which a task allocation module allocates crowdsourcing tasks to a group of workers selected from a worker pool, the task allocation module performs task allocation to obtain probability distribution of the ability of the workers participating in task processing and variance and expectation of the worker distribution, a parameter learning model performs parameter learning to obtain worker parameters, a result aggregation module obtains task results, and a crowdsourcing process cost complexity is obtained according to the worker parameters. Chinese patent application CN201510958745.3 discloses a crowdsourcing annotation integration method, which utilizes a regularization super-parameter, a spacing distance super-parameter, an annotator voting weight, and the difference between the times of annotating a current prediction item with a corresponding estimation value by an annotator and the times of annotating the current prediction item with a secondary category by the annotator to define a generalized inverse Gaussian distribution, samples to obtain an auxiliary parameter, and utilizes the auxiliary parameter to update the weight of the annotator, thereby obviously enhancing the discrimination capability of a model. Then, a traditional labeling integration majority voting model and a confusion matrix model are integrated, and the purpose of more comprehensively describing the data generation process is further achieved. Chinese patent application CN201910770300.0 discloses a deep learning target detection method and system based on crowdsourcing repeated labels, firstly receiving an original training set picture in an application scene, and collecting data labels; then, preprocessing the original training set picture to obtain preprocessed data; and then training a crowdR-CNN target detection model by using the preprocessed data, and adding a label aggregation layer according to the data labels on the basis of the two-stage model, so that the real type of the target is inferred according to the individual sensitivity of the annotator, and a prediction result is obtained through a crowdR-CNN network according to the detection data. In addition, a crowd-sourcing method based on deep learning is provided, wherein a crowd-sourcing layer (crowd layer) is added behind an output layer, and the crowd-sourcing layer simulates the capability of crowd-sourcing data workers to achieve the purpose of converting a real label and a crowd-sourcing label, so that crowd-sourcing data can be processed end to end.
However, existing deep learning crowd-sourcing methods typically rely on worker competency to make inferences, and sample label inference with worker competency often results in undesirable end results due to inaccuracies in worker competency determination for specific sample data, because worker competency is difficult to estimate.
Therefore, a need exists for a method and system for predicting the correct label of crowd sourced data.
Disclosure of Invention
Therefore, an object of the embodiments of the present invention is to overcome the above-mentioned drawbacks of the prior art, and provide a method and a system for predicting a correct label of crowdsourcing data, so as to reduce the dependency on the ability of a worker in the crowdsourcing data, thereby improving the accuracy and robustness of a deep learning model.
The above purpose is realized by the following technical scheme:
according to a first aspect of the embodiments of the present invention, there is provided a model training method for predicting correct labels of crowd-sourced data, including: obtaining a crowdsourcing data set, wherein each crowdsourcing data in the crowdsourcing data set has a plurality of initial labels; acquiring a reference label of corresponding crowdsourcing data based on the average value of all initial labels of each crowdsourcing data so as to train a neural network model; obtaining a predictive label for each of the crowd-sourced data using the neural network model, and calibrating the neural network model based on a confidence level of each initial label for each of the crowd-sourced data relative to the predictive label; until the neural network model converges or the accuracy continues to decline.
In one embodiment, said calibrating said neural network model based on the trustworthiness of each initial tag of said each crowdsourcing data relative to said predicted tag comprises: taking the credibility of each initial label of each crowdsourcing data relative to the prediction label as a sampling weight of each initial label of each crowdsourcing data; and carrying out weighted sampling on each crowdsourcing data and the initial label corresponding to the sampling weight according to the sampling weight, and retraining the neural network model.
In one embodiment, further comprising: and normalizing each crowdsourcing data and the reference label and each initial label thereof by using the mean value and the standard deviation of the reference label of each crowdsourcing data in the crowdsourcing data set so as to train a neural network model.
In one embodiment, the weighted sampling of each crowdsourcing data and the initial label corresponding to the sampling weight according to the sampling weight, and the retraining the neural network model comprises: equivalently, the loss function of the neural network model becomes:
Figure BDA0002891221210000031
wherein the content of the first and second substances,
Figure BDA0002891221210000032
to normalize the confidence of the jth initial tag of the ith crowd-sourced data after processing,
Figure BDA0002891221210000033
a prediction tag for the ith crowd-sourced data predicted for the neural network model,
Figure BDA0002891221210000034
for the ith piece of crowd-sourced data,
Figure BDA0002891221210000035
the ith initial label of the ith crowdsourcing data after normalization processing is obtained.
In one embodiment, the confidence of each initial label of each crowdsourced data relative to the predicted label is obtained through a gaussian kernel function, and the formula is as follows:
Figure BDA0002891221210000036
wherein the content of the first and second substances,
Figure BDA0002891221210000037
a prediction tag for the ith crowd-sourced data predicted for the neural network model,
Figure BDA0002891221210000038
Figure BDA0002891221210000039
the j initial label of the ith crowdsourced data after normalization processing, e is a natural constant, sigma2Is a preset fixed parameter.
Another aspect of the present invention provides a method for predicting a correct label of crowd-sourced data, comprising: obtaining a crowdsourcing data set, each crowdsourcing data in the crowdsourcing data set having a number of initial labels; and obtaining a prediction label of each crowdsourcing data in the crowdsourcing data set by using the neural network model obtained by any one of the training methods, and taking the prediction label as a correct label of each corresponding crowdsourcing data. .
Another aspect of the present invention provides a system for predicting correct labels for crowd sourced data, comprising: the system comprises an interface module, a data processing module and a data processing module, wherein the interface module is used for acquiring a crowdsourcing data set, the crowdsourcing data set comprises a training data set and a testing data set, and each crowdsourcing data in the crowdsourcing data set is provided with a plurality of initial labels; the training module is used for acquiring a reference label of corresponding training data based on the average value of all initial labels of each training data in the training data set so as to train a neural network model; a calibration module, configured to obtain a prediction label of each training data by using the neural network model, and calibrate the neural network model based on a reliability of each initial label of each training data with respect to the prediction label until the neural network model converges or the accuracy continues to decrease; and the prediction module is used for obtaining a prediction label of each test data in the test data set by using the calibrated neural network model, and taking the prediction label as a correct label of each corresponding crowdsourcing data.
Another aspect of the invention provides a storage medium in which a computer program is stored which, when being executed by a processor, is operable to carry out the method of any one of the preceding claims.
Another aspect of the invention provides an electronic device comprising a processor and a memory, the memory having stored therein a computer program operable to, when executed by the processor, implement the method of any one of the above.
The technical scheme of the invention can comprise the following beneficial effects:
the method can avoid the condition that the accuracy of the deep learning model is low due to low accuracy of estimation of the worker capability caused by the conditions that the sample data amount of an individual worker is small, the accuracy of the worker to different sample data labels is inconsistent and the like, effectively reduces the degree of dependence of the deep learning method using crowdsourcing data on the worker capability, can obtain the more robust deep learning model under the conditions of more workers, less labels of the individual workers and the like, and can keep relatively high accuracy under the actual crowdsourcing environment.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention. It is obvious that the drawings in the following description are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort. In the drawings:
FIG. 1 illustrates a flow diagram of a model training method for predicting correct labels for crowd-sourced data, in accordance with one embodiment of the invention;
FIG. 2 illustrates a block diagram of a system for predicting the correct label of crowd-sourced data, according to one embodiment of the invention;
FIG. 3 shows a diagram of a function data true tag constructed in an experiment and an initial tag given by a worker, according to one embodiment of the invention;
FIG. 4 illustrates an in-experiment simulation dataset deep learning model, according to one embodiment of the invention;
FIG. 5 illustrates a MovieReviews dataset deep learning model in an experiment according to one embodiment of the invention;
FIG. 6 shows R in an experiment according to one embodiment of the present invention2Graph with period variation.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail by embodiments with reference to the accompanying drawings. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to provide a thorough understanding of embodiments of the invention. One skilled in the relevant art will recognize, however, that the invention may be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations or operations have not been shown or described in detail to avoid obscuring aspects of the invention.
The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.
The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.
In order to solve the above problems in the prior art, the present invention provides a method and a system for predicting a correct label of crowdsourcing data, wherein the method employs a neural network model for predicting a correct label of crowdsourcing data, so as to effectively solve the problem that the deep learning crowdsourcing method relies heavily on the ability of workers to infer accuracy.
FIG. 1 shows a flow diagram of a model training method for predicting correct labels for crowd-sourced data, in accordance with an embodiment of the invention. As shown in fig. 1, the model training method comprises two stages: an initial model training phase and a model calibration phase. In the initial model training phase (i.e., steps S110-S120), a crowdsourcing platform is used to obtain a crowdsourcing data set, and a plurality of initial labels of each crowdsourcing data are aggregated, a reference label of the corresponding crowdsourcing data is obtained based on an average value of all the initial labels of each crowdsourcing data, and then the initial model is trained using the crowdsourcing data and the corresponding reference label. In the model calibration stage (i.e., steps S130-S150), a parameter estimation strategy of an Expectation-Maximization algorithm (EM) is adopted, and through step E, a prediction label of each crowdsourced data is obtained by using the current model; through the M step, the current model is calibrated based on the credibility of each initial label of each crowdsourcing data relative to the predicted label, and iteration is repeated until convergence or the situation that the model precision is continuously reduced occurs, so that a more robust model is obtained. The method comprises the following specific steps:
step S110, a crowdsourcing data set is obtained.
Crowdsourcing data sets applied to different learning tasks may be obtained from existing crowdsourcing platforms. Each crowdsourced data in the crowdsourced data set has a number of initial labels from different workers, and the number of initial labels of each crowdsourced data can be different or can be partially the same.
Step S120, obtaining a reference label of the corresponding crowdsourcing data based on a mean value of all initial labels of each crowdsourcing data, so as to train the neural network model.
In one embodiment, the neural network model may be trained by averaging all initial labels of each crowdsourcing data in the crowdsourcing data set, using the average as a corresponding crowdsourcing data slave reference label, and then using each crowdsourcing data in the crowdsourcing data set as an input and using a corresponding reference label of each crowdsourcing data as an output. For example, the ith crowdsourced data x in a crowdsourced data setiIs the set of all initial tags of
Figure BDA0002891221210000061
Crowd-sourced data x to the ithiAll initial labels of (1) average miAs the ith crowdsourcing data xiThe formula is as follows:
Figure BDA0002891221210000062
wherein i represents the ith crowdsourcing data, j represents the jth initial label, yijJ initial label, n, representing i crowdsourced dataiThe number of all initial tags representing the ith crowd-sourced data.
In one embodiment, a mean value of all initial labels of each crowdsourcing data in the crowdsourcing data set may be normalized, and the normalized mean value is used as a reference label of the corresponding crowdsourcing data to train the neural network model. In one embodiment, each crowdsourced data and its reference label and each initial label may be normalized by the mean and standard deviation of the reference labels of all crowdsourced data in the crowdsourced data set, and the formula is as follows:
Figure BDA0002891221210000063
wherein m isiA reference tag representing the ith crowd-sourced data,
Figure BDA0002891221210000064
mean, y, of reference labels representing all crowdsourced data in a crowdsourced data setstdA standard deviation of reference labels representing all crowdsourced data in a crowdsourced data set.
Crowdsourcing data and its reference tags using normalization
Figure BDA0002891221210000065
For the initial model
Figure BDA0002891221210000066
Training to obtain a model
Figure BDA0002891221210000071
The model
Figure BDA0002891221210000072
The loss function of (d) can be expressed as:
Figure BDA0002891221210000073
wherein N is the number of all crowdsourced data in the crowdsourced data set,
Figure BDA0002891221210000074
to use models
Figure BDA0002891221210000075
Obtaining the predictive label of the ith crowdsourcing data after normalization processing,
Figure BDA0002891221210000076
to normalize the processed ith crowdsourced data,
Figure BDA0002891221210000077
and a reference label representing the ith crowdsourcing data after the normalization processing.
Step S130, obtaining a prediction label of each crowdsourcing data by using the neural network model.
As described above, the present invention employs an EM-like framework parameter estimation strategy during the model calibration phase. The EM is a type of optimization algorithm for carrying out maximum likelihood estimation through iteration and is used for carrying out parameter estimation on a probability model containing hidden variables or missing data. The standard computational framework of the EM algorithm consists of alternating E and M steps, and the convergence of the algorithm ensures that the iteration approaches at least local maxima.
This step is similar to the E step in the EM algorithm, i.e., the predictive label of each crowdsourced data in the crowdsourced data set is obtained using the current neural network model.
Step S140, calibrating the current neural network model based on the credibility of each initial label of each crowdsourcing data relative to the predicted label.
In the initial model training process, each initial label of each crowd-sourced data has equal weight, but the credibility of the initial label given by each worker in the crowd-sourced data set is different due to different abilities of the worker. In this regard, in one embodiment, the reliability of each initial label of each crowdsourcing data relative to a predicted label predicted by the model is used as a sampling weight of each initial label of each crowdsourcing data, each crowdsourcing data and the initial label corresponding to the sampling weight are sampled in a weighted manner according to the sampling weight, and the neural network model is retrained.
According to one embodiment of the invention, n may be copied for the ith crowdsourced data in a crowdsourced data setiIs obtained by
Figure BDA0002891221210000078
Corresponding to an initial set of tags
Figure BDA0002891221210000079
Wherein n isiThe number of all initial tags representing the ith crowd-sourced data. Spreading the ith crowdsourcing data and each initial label to obtain
Figure BDA00028912212100000710
Then using the mean value
Figure BDA00028912212100000711
And standard deviation ystdCrowd-sourced data of ith and initial label thereof
Figure BDA00028912212100000712
Normalization processing is carried out to obtain normalized ith crowdsourcing data and initial labels thereof
Figure BDA00028912212100000713
The normalized ith crowdsourcing data
Figure BDA00028912212100000714
J initial tag of
Figure BDA00028912212100000715
Predictive tagging of the crowd-sourced data predicted with the current model
Figure BDA00028912212100000716
Comparing to obtain the ith crowdsourcing data
Figure BDA00028912212100000717
J initial tag of
Figure BDA00028912212100000718
Reliability of (D) is recorded as
Figure BDA00028912212100000719
And then constructing the ith crowdsourcing data and the initial label thereof according to the credibility
Figure BDA00028912212100000720
The sampling weight of each initial label.
In an embodiment, the reliability of each initial label of each piece of data may be normalized, and each piece of crowd-sourced data and the initial label corresponding to the sampling weight may be weighted and sampled according to the normalized reliability as an application weight, so as to retrain the model, where a formula for normalizing the reliability of each initial label of each piece of data is as follows:
Figure BDA0002891221210000081
wherein q isijThe trustworthiness of the jth initial tag for the ith crowd-sourced data.
The credibility of each initial label of each crowdsourcing data is used as the sampling weight of each crowdsourcing data, each crowdsourcing data and the initial label corresponding to the sampling weight are subjected to weighted sampling, and a batch of training data can be obtained for retraining the current model
Figure BDA0002891221210000082
The weighted sampling training process may be understood as the model matching
Figure BDA0002891221210000083
Is changed by the loss function of:
Figure BDA0002891221210000084
wherein the content of the first and second substances,
Figure BDA0002891221210000085
to normalize the confidence of the jth initial tag of the ith crowd-sourced data after processing,
Figure BDA0002891221210000086
to obtain the predictive label of the normalized ith crowdsourcing data using the current neural network model,
Figure BDA0002891221210000087
to normalize the processed ith crowdsourced data,
Figure BDA0002891221210000088
the ith initial label of the ith crowdsourcing data after normalization processing is obtained.
In one embodiment, the confidence of each initial tag relative to the predicted tag for each crowd-sourced data may be obtained by a gaussian kernel function. For example, crowd-sourced data at the ith of a participating tag
Figure BDA0002891221210000089
Given the relatively large number of workers involved, it is assumed that each worker involved in the tagging is intelligent and that the initial tagging they give follows a normal distribution
Figure BDA00028912212100000810
Wherein the content of the first and second substances,
Figure BDA00028912212100000811
representing the ith crowd-sourced data obtained with the current model
Figure BDA00028912212100000812
The prediction tag of (a) is determined,
Figure BDA00028912212100000813
indicating crowdsourcing data at the ith
Figure BDA00028912212100000814
The variance of the upper label can be understood as label fluctuation caused by difficulty of crowd-sourced data, and the difficulty of each crowd-sourced data is assumed to be the same, namely
Figure BDA00028912212100000815
Then an initial label is generated
Figure BDA00028912212100000816
The probability of (c) is:
Figure BDA00028912212100000817
wherein the content of the first and second substances,
Figure BDA00028912212100000818
to normalize the processed predicted label of the ith crowd-sourced data obtained using the current model,
Figure BDA00028912212100000819
the j initial label of the ith crowdsourced data after normalization processing, e is a natural constant, sigma2Is the variance of the label on the crowd-sourced data.
The initial label may be generated based on the above
Figure BDA00028912212100000820
Probability of obtaining an initial label
Figure BDA00028912212100000821
Confidence of, i.e. fix σ2Defining a Gaussian kernel function as an initial tag credibility calculation method, wherein the formula is as follows:
Figure BDA0002891221210000091
wherein the content of the first and second substances,
Figure BDA0002891221210000092
to normalize the processed predicted label of the ith crowd-sourced data obtained using the current model,
Figure BDA0002891221210000093
the j initial label of the ith crowdsourced data after normalization processing, e is a natural constant, sigma2Is a preset fixed parameter.
And step S150, repeating the steps S130-S140 until the neural network model converges or the precision continuously decreases.
And repeatedly using the calibrated neural network model to obtain the prediction label of each crowdsourcing data, and retraining the current neural network model based on the credibility of each initial label of each crowdsourcing data relative to the current prediction label until the neural network model converges or the precision continuously decreases.
Based on crowdsourced data and corresponding tags
Figure BDA0002891221210000094
Obtaining corresponding likelihood functions
Figure BDA0002891221210000095
Figure BDA0002891221210000096
The method of obtaining the parameter estimate is the maximum likelihood method. Thus, the above steps S130-S150 are similar to the M steps in the EM algorithm.
In the embodiment, the calibration updating of the model is realized by utilizing the parameter estimation strategy of the EM-like framework, so that a more robust deep learning model is obtained on the basis of the existing model. However, since the EM framework itself depends on the initial value to some extent, and deep learning has a certain instability in the training process (for example, jitter due to an inappropriate learning rate or jitter of the learning process due to an inappropriate sample data weight at present, etc.), the above strategy similar to the EM framework parameter estimation may have a certain instability, which makes it different from the real EM framework. However, deep learning has a certain tolerance to errors, and it can be seen that, under a relatively well initialized deep model condition (the accuracy of the initial model trained by the label mean in the actual condition is within an acceptable range), the labels of crowd-sourced data can be relatively well predicted, so that a relatively good crowd-sourced data initial label reliability estimation can be obtained, and finally a more robust and accurate deep model is obtained, which is a benign loop process. Therefore, when the final model result converges or the accuracy of several continuous iteration processes decreases due to instability, the iteration process is ended, and the model before convergence or accuracy decrease is used as the model of the correct label of the final prediction crowd-sourced data.
By the embodiment, the problem of low accuracy of worker capability estimation caused by the conditions that the amount of sample data of an individual worker label is small, the accuracy of the worker label to different sample data is inconsistent and the like in crowdsourcing data can be effectively solved, and the accuracy and the robustness of the deep learning model are improved.
In one embodiment, a method of predicting correct labels for crowd sourced data is provided, comprising: obtaining a crowdsourcing data set, wherein each crowdsourcing data in the crowdsourcing data set is provided with a plurality of initial labels, then obtaining a prediction label of each crowdsourcing data in the crowdsourcing data set by using the neural network model trained by the training method, and taking the prediction label as a correct label of the corresponding crowdsourcing data.
In one embodiment, a system for predicting correct labels for crowd sourced data is also provided. Fig. 2 is a schematic structural diagram of a system for predicting correct labels of crowd-sourced data according to one embodiment of the invention. As shown in fig. 2, the system 200 includes an interface module 201, a training module 202, a calibration module 203, and a prediction module 204. Although the block diagrams depict components in a functionally separate manner, such depiction is for illustrative purposes only. The components shown in the figures may be arbitrarily combined or separated into separate software, firmware, and/or hardware components. Moreover, regardless of how such components are combined or divided, they may execute on the same computing device or multiple computing devices, which may be connected by one or more networks.
The interface module 201 is configured to obtain a crowdsourcing data set, where the crowdsourcing data set includes a training data set and a testing data set, and each crowdsourcing data in the crowdsourcing data set has a plurality of initial tags. The training module 202 is configured to obtain a reference label of the corresponding training data based on an average of all initial labels of each training data in the training data set, so as to train the neural network model. The calibration module 203 is configured to obtain a predicted label of each training data by using the neural network model, and calibrate the neural network model based on the reliability of each initial label of each training data with respect to the predicted label until the neural network model converges or the accuracy continues to decrease. The prediction module 204 is configured to obtain a prediction label of each test data in the test data set by using the calibrated neural network model, and use the prediction label as a correct label of the corresponding crowdsourcing data.
In another embodiment of the present invention, a computer-readable storage medium is further provided, on which a computer program or executable instructions are stored, and when the computer program or the executable instructions are executed, the technical solution as described in the foregoing embodiments is implemented, and the implementation principle thereof is similar, and is not described herein again. In embodiments of the present invention, the computer readable storage medium may be any tangible medium that can store data and that can be read by a computing device. Examples of computer readable storage media include hard disk drives, Network Attached Storage (NAS), read-only memory, random-access memory, CD-ROMs, CD-R, CD-RWs, magnetic tapes, and other optical or non-optical data storage devices. The computer readable storage medium may also include computer readable media distributed over a network coupled computer system so that computer programs or instructions may be stored and executed in a distributed fashion.
Experimental part
To further verify the validity of the method and system for predicting the correct label of crowdsourced data proposed by the present invention, the inventors also performed experiments on the simulation dataset and the real dataset [ Rodrigues, fits, Pereira, francisco. The simulation dataset construction process will be described below; the real data set is a public data set MovieReviews [ downloaded from: http:// fprodrigues. com// deep _ movie reviews. tar. gz ], the public data set contained 5006 movie reviews, with the goal of scoring movies according to movie reviews, with scores between 1-10, with AMT platform, data set publishers collected scores of 1500 reviews from 137 workers, with 4.96 answers per review on average, and the remaining 3506 movie reviews were used as tests.
1) Simulation dataset construction
In the simulation data, we construct three sets of functions, y ═ x3,x∈[-2,2],y=5*sin x,x∈[-8,8],
Figure BDA0002891221210000111
The probability distribution parameters of each worker generating tags in different intervals are different for constructing 5 workers for the first function and 4 workers for the last two functions. For example, the first worker for the first function is at [ -2, -1 []According to N (y +5, 10), generating sample data label y1In [ -1,2 ]]Generates its own sample label y according to N (y +1, 0.5)1. The specific constructed data set and the generated worker tags are shown in FIG. 3.
2) Deep learning model construction
For the simulation data set, because the function is relatively simple, a relatively simple deep learning model is selected, the model comprises a 4-layer network, namely a first input layer and a three-layer full-connection layer, and the specific model is shown in fig. 4. The network has MSE as the cost function.
For a simulated data set, the model construction is relatively simple, due to the simple function fitting problem. Firstly, mapping an input real number to a full-connection layer of 50 neurons, then mapping an output 50-dimensional real vector of the layer as an input to the full-connection layer of 50 neurons, and finally connecting to an output layer to correspond to a real output of a function. The concrete model is shown in fig. 4. The network has MSE as the cost function.
For the real data set MovieReviews, the model construction is relatively slightly more complex due to the language processing problem. Firstly, mapping each word to an integer, then unifying each comment to the same length, complementing 0 for parts with insufficient length, then mapping numbers corresponding to each word to word vectors through an Embedding layer to serve as initial features, in order to save time, adopting fixed parameters for mapping numbers corresponding to words in a model to word vectors, selecting parameters in glove300B as an Embedding matrix, and then automatically learning and extracting feature features by a deep network, wherein the part comprises 3 × 3 convolutional layers containing 128 features, 5 × 5 pooling layers, 5 × 5 convolutional layers containing 128 features, 5 × 5 pooling layers containing 5 × 5 neural units, and a full connection layer containing 32 neural units, and finally connecting and outputting. The concrete model is shown in fig. 5. The network has MSE as the cost function.
3) Comparison method
In order to compare the effects of a deep learning crowdsourcing method (denoted as method four) based on label credibility with the existing methods, the invention uses 1 basic method and two latest deep learning crowdsourcing methods as comparison methods, which respectively comprise:
algorithm using mean as sample data label (write method one)
Deep learning crowdsourcing method corrected by Crowdlayer (+ B) (note as method two)
Deep learning crowdsourcing method (denoted method three) with mean initialization and correction using Crowdlayer (+ B)
4) Evaluation index
Corr: pearson product-moment correlation coefficient, which is used to measure the correlation (linear correlation) between two variables X and Y, has a value between-1 and 1. Generally in the regression problem, the larger the value the better.
Mae: the average absolute error is an average value of absolute errors, and can well reflect the actual situation of predicted value errors. Generally, the smaller the value, the better.
RMSE: root mean square error, measure the deviation between observed and true values. Is often used as a measure of the prediction outcome of the machine learning model. Generally, the smaller the value, the better.
·R2The R side is generally used for describing the good and bad fitting degree of data to the model, the maximum value is 1, and the larger the value is, the better the fitting is generally.
5) Analysis of Experimental results
Simulation data set experimental result analysis
Because the simulation data set is relatively simple and the comparison difference of the results is obvious, the experiment is only carried out once, and the specific results are shown in the table 1 and the table 2.
Function 1 Corr Mae RMSE R2
Method 1 0.986 2.186 2.246 0.450
Method two 0.985 2.144 2.214 0.465
Method III 0.988 2.129 2.185 0.479
Method IV 0.999 0.981 0.997 0.892
Function 2 Corr Mae RMSE R2
Method 1 0.960 2.478 2.690 0.432
Method two 0.950 2.694 2.938 0.322
Method III 0.929 2.952 3.298 0.146
Method IV 0.989 0.797 0.960 0.928
Function 3 Corr Mae RMSE R2
Method 1 0.969 2.456 2.699 0.617
Method two 0.976 2.637 2.809 0.585
Method III 0.975 2.890 3.029 0.517
Method IV 0.996 1.261 1.319 0.908
Table 1 simulation data set experimental results
Figure BDA0002891221210000131
Table 2 simulation data set experiment result image
As can be seen from tables 1 and 2, of the three experiments, the method four, i.e., the deep learning crowdsourcing method based on label credibility in the present invention, performs best, and the experimental effect is much better than the method through mean learning and through Crowdlayer correction. In experiment 1, methods two and three performed slightly better than method one, while in experiments 2, 3, methods two and three performed even worse than method one. In consideration of the construction process, it is easy to find that each worker has inconsistent performance capabilities in different intervals in the data set constructed by the three functions, and in the current crowdsourcing method, when the worker capabilities are considered, each worker generally considers that the judgment capabilities of all samples are the same or only considers simple combination of sample data self difficulty and worker comprehensive capabilities, which results in wrong judgment of the worker capabilities of the sample data, so that the crowdsourcing learning result is not even better than that of a model obtained by a mean value label. In the method, under the condition that the ability of a worker is difficult to estimate, the reliability of the label is directly considered, and the initial model trained by the mean label is directly corrected to obtain a more robust and accurate prediction model
Analysis of MovieReviews data set Experimental results
Because the real data set MoiveReviews is a natural language processing data set and is labeled by a real person, various conditions can occur on the label, such as less label amount of a single worker, different labeling capacities of different workers on the same sample data, different labeling capacities of the same worker on different sample data, and the like, so that the training result is unstable due to more wrong labels in real training, 10 experiments are performed on the data set using method one, method three and method four, the overall result is observed and compared, and meanwhile, in order to observe the defects of the existing method in detail, the result of each period of the method three and the method four is displayed. Specific results are shown in table 3 and fig. 5.
From table 3, we can see that there is a certain fluctuation in all the three methods in ten experiments, while relatively speaking, the fluctuation of method three is the largest and the fluctuation of method four is the most stable. In addition, in this regression problem, method four performed best, R of method four2Is superior to the first method and the third method. Although the third method uses the mean label for initialization, the pre-training period is short because the function of Crowdlayer needs to be enhanced, and a single deviation cannot represent the capability of a worker, so the third method has inaccurate estimation capability, and general effects, and in addition, the third method cannot conveniently and quickly estimate the accuracy of the model on the verification set in each iteration due to the addition of the Crowdlayer, and is difficult to control the training process, such as the change of the learning rate and the like. The fourth method is trained on the basis of the first method, so that the result of the first method influences the result of the fourth method to a certain extent, and the fourth method can be initialized well when the result of the first method is good, so that the reliability of the judgment label is relatively accurate, and the calibrated model is more robust and accurate. As can be seen from FIG. 6, method four is R in the training process2The change is regular, when the last model parameter estimation is relatively good, the label reliability estimation is relatively accurate, so that the credible label obtains a larger sampling weight, and finally the accuracy of the calibrated model is better, and because certain jitter may occur in the training process, the estimation of the training parameter at a certain time is not good, so that the reliability estimation is relatively inaccurate, the incredible or bad label weight is increased, the accuracy of the model is continuously poor, and the iteration process needs to be ended in advance. And R from method III2In this variation, we can see that R of method III2Jitter is large and instability is severe. Therefore, according to the deep learning crowdsourcing method based on the label credibility, the capability of a worker is not estimated under many conditions, the label of the sample data is directly observed, and the obtained deep learning model is more robust and accurate.
Number of times 1 2 3 4 5 6 7 8 9 10 Average
Method one 0.396 0.348 0.365 0.423 0.359 0.331 0.430 0.359 0.346 0.417 0.377
Method III 0.406 0.416 0.388 0.386 0.321 0.357 0.239 0.425 0.277 0.404 0.362
Method IV 0.420 0.372 0.391 0.437 0.345 0.364 0.444 0.416 0.395 0.434 0.402
TABLE 3 results of ten experiments with MovieReviews (R)2)
Reference in the specification to "various embodiments," "some embodiments," "one embodiment," or "an embodiment," etc., means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment. Thus, appearances of the phrases "in various embodiments," "in some embodiments," "in one embodiment," or "in an embodiment," or the like, in various places throughout this specification are not necessarily referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. Thus, a particular feature, structure, or characteristic illustrated or described in connection with one embodiment may be combined, in whole or in part, with a feature, structure, or characteristic of one or more other embodiments without limitation, as long as the combination is not logical or operational.
The terms "comprises," "comprising," and "having," and similar referents in this specification, are intended to cover non-exclusive inclusions, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements but may alternatively include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. The word "a" or "an" does not exclude a plurality. Additionally, the various elements of the drawings of the present application are merely schematic illustrations and are not drawn to scale.
Although the present invention has been described by the above embodiments, the present invention is not limited to the embodiments described herein, and various changes and modifications may be made without departing from the scope of the present invention.

Claims (9)

1. A model training method for predicting correct labels for crowd sourced data, comprising:
1) obtaining a crowdsourcing data set, wherein each crowdsourcing data in the crowdsourcing data set has a plurality of initial labels;
2) acquiring a reference label of corresponding crowdsourcing data based on the average value of all initial labels of each crowdsourcing data so as to train a neural network model;
3) obtaining a predictive label for each of the crowd-sourced data using the neural network model, and calibrating the neural network model based on a confidence level of each initial label for each of the crowd-sourced data relative to the predictive label;
4) and repeating the step 3) until the neural network model converges or the precision continuously decreases.
2. The training method of claim 1, wherein said calibrating the neural network model based on the confidence of each initial label of the each crowdsourcing data relative to the predicted label comprises:
taking the credibility of each initial label of each crowdsourcing data relative to the prediction label as a sampling weight of each initial label of each crowdsourcing data;
and carrying out weighted sampling on each crowdsourcing data and the initial label corresponding to the sampling weight according to the sampling weight, and retraining the neural network model.
3. The training method of claim 2, wherein step 2) further comprises:
and normalizing each crowdsourcing data and the reference label and each initial label thereof by using the mean value and the standard deviation of the reference labels of all crowdsourcing data in the crowdsourcing data set so as to train a neural network model.
4. The training method of claim 3, wherein the weighted sampling of the each crowdsourced data and the initial label corresponding to the sampling weight according to the sampling weight, and the retraining of the neural network model comprises:
weighted sampling is equivalent to changing the loss function of the neural network model to:
Figure FDA0002891221200000011
wherein the content of the first and second substances,
Figure FDA0002891221200000012
to normalize the confidence of the jth initial tag of the ith crowd-sourced data after processing,
Figure FDA0002891221200000021
a prediction label for normalized ith crowd-sourced data predicted by the neural network model,
Figure FDA0002891221200000022
to normalize the processed ith crowdsourced data,
Figure FDA0002891221200000023
the ith initial label of the ith crowdsourcing data after normalization processing is obtained.
5. The training method of claim 3, wherein the confidence of each initial label of each crowdsourced data relative to the predicted label is obtained by a Gaussian kernel function, and the formula is as follows:
Figure FDA0002891221200000024
wherein the content of the first and second substances,
Figure FDA0002891221200000025
a prediction tag for the ith crowd-sourced data predicted for the neural network model,
Figure FDA0002891221200000026
Figure FDA0002891221200000027
the j initial label of the ith crowdsourced data after normalization processing, e is a natural constant, sigma2Is a preset fixed parameter.
6. A method for predicting correct labeling of crowd sourced data, comprising:
obtaining a crowdsourcing data set, each crowdsourcing data in the crowdsourcing data set having a number of initial labels;
obtaining a predictive label for each crowdsourcing data in the crowdsourcing data set using a neural network model obtained by the training method of any one of claims 1-5, and using the predictive label as a correct label for each corresponding crowdsourcing data.
7. A system for predicting correct labeling of crowd sourced data, comprising:
the system comprises an interface module, a data processing module and a data processing module, wherein the interface module is used for acquiring a crowdsourcing data set, the crowdsourcing data set comprises a training data set and a testing data set, and each crowdsourcing data in the crowdsourcing data set is provided with a plurality of initial labels;
the training module is used for acquiring a reference label of corresponding training data based on the average value of all initial labels of each training data in the training data set so as to train a neural network model;
a calibration module, configured to obtain a label of each training data by using the neural network model, and calibrate the neural network model based on a reliability of each initial label of each training data with respect to the predicted label until the neural network model converges or the accuracy continues to decrease;
and the prediction module is used for obtaining a prediction label of each test data in the test data set by using the calibrated neural network model, and taking the prediction label as a correct label of each corresponding crowdsourcing data.
8. A storage medium in which a computer program is stored which, when being executed by a processor, is operative to carry out the method of any one of claims 1-6.
9. An electronic device comprising a processor and a memory, the memory having stored therein a computer program which, when executed by the processor, is operable to carry out the method of any of claims 1-6.
CN202110028695.4A 2021-01-11 2021-01-11 Method and system for predicting correct tags for crowd-sourced data Active CN112766337B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110028695.4A CN112766337B (en) 2021-01-11 2021-01-11 Method and system for predicting correct tags for crowd-sourced data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110028695.4A CN112766337B (en) 2021-01-11 2021-01-11 Method and system for predicting correct tags for crowd-sourced data

Publications (2)

Publication Number Publication Date
CN112766337A true CN112766337A (en) 2021-05-07
CN112766337B CN112766337B (en) 2024-01-12

Family

ID=75701195

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110028695.4A Active CN112766337B (en) 2021-01-11 2021-01-11 Method and system for predicting correct tags for crowd-sourced data

Country Status (1)

Country Link
CN (1) CN112766337B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113419868A (en) * 2021-08-23 2021-09-21 南方科技大学 Temperature prediction method, device, equipment and storage medium based on crowdsourcing

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108898218A (en) * 2018-05-24 2018-11-27 阿里巴巴集团控股有限公司 A kind of training method of neural network model, device and computer equipment
CN109543756A (en) * 2018-11-26 2019-03-29 重庆邮电大学 A kind of tag queries based on Active Learning and change method
CN110070183A (en) * 2019-03-11 2019-07-30 中国科学院信息工程研究所 A kind of the neural network model training method and device of weak labeled data
CN110580499A (en) * 2019-08-20 2019-12-17 北京邮电大学 deep learning target detection method and system based on crowdsourcing repeated labels
CN110929807A (en) * 2019-12-06 2020-03-27 腾讯科技(深圳)有限公司 Training method of image classification model, and image classification method and device
CN111275079A (en) * 2020-01-13 2020-06-12 浙江大学 Crowdsourcing label speculation method and system based on graph neural network

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108898218A (en) * 2018-05-24 2018-11-27 阿里巴巴集团控股有限公司 A kind of training method of neural network model, device and computer equipment
CN109543756A (en) * 2018-11-26 2019-03-29 重庆邮电大学 A kind of tag queries based on Active Learning and change method
CN110070183A (en) * 2019-03-11 2019-07-30 中国科学院信息工程研究所 A kind of the neural network model training method and device of weak labeled data
CN110580499A (en) * 2019-08-20 2019-12-17 北京邮电大学 deep learning target detection method and system based on crowdsourcing repeated labels
CN110929807A (en) * 2019-12-06 2020-03-27 腾讯科技(深圳)有限公司 Training method of image classification model, and image classification method and device
CN111275079A (en) * 2020-01-13 2020-06-12 浙江大学 Crowdsourcing label speculation method and system based on graph neural network

Non-Patent Citations (7)

* Cited by examiner, † Cited by third party
Title
GUOWEI XU, 等: "Learning Effective Embeddings From Crowdsourced Labels: An Educational Case Study", 《ARXIV:1908.00086V1》 *
MING WU 等: "Learning deep networks with crowdsourcing for relevance evaluation", 《EURASIP JOURNAL ON WIRELESS COMMUNICATIONS AND NETWORKING》, pages 1 *
RYAN DRAPEAU: "MicroTalk: Using Argumentation to Improve Crowdsourcing Accuracy", 《PROCEEDINGS, THE FOURTH AAAI CONFERENCE ON HUMAN COMPUTATION AND CROWDSOURCING》 *
RYAN DRAPEAU: "MicroTalk: Using Argumentation to Improve Crowdsourcing Accuracy", 《PROCEEDINGS, THE FOURTH AAAI CONFERENCE ON HUMAN COMPUTATION AND CROWDSOURCING》, 31 December 2016 (2016-12-31) *
WEI WANG 等: "Obtaining High-Quality Label by Distinguishing between Easy and Hard Items in Crowdsourcing", 《PROCEEDINGS OF THE 26TH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE (IJCAI-17)》 *
WEI WANG 等: "Obtaining High-Quality Label by Distinguishing between Easy and Hard Items in Crowdsourcing", 《PROCEEDINGS OF THE 26TH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE (IJCAI-17)》, 19 August 2017 (2017-08-19) *
李易南等: "面向众包数据的特征扩维标签质量提高方法", 《智能系统学报》, no. 02 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113419868A (en) * 2021-08-23 2021-09-21 南方科技大学 Temperature prediction method, device, equipment and storage medium based on crowdsourcing
CN113419868B (en) * 2021-08-23 2021-11-16 南方科技大学 Temperature prediction method, device, equipment and storage medium based on crowdsourcing
WO2023024213A1 (en) * 2021-08-23 2023-03-02 南方科技大学 Crowdsourcing-based temperature prediction method and apparatus, and device and storage medium

Also Published As

Publication number Publication date
CN112766337B (en) 2024-01-12

Similar Documents

Publication Publication Date Title
US11341424B2 (en) Method, apparatus and system for estimating causality among observed variables
CN114169442B (en) Remote sensing image small sample scene classification method based on double prototype network
CN110210540B (en) Cross-social media user identity recognition method and system based on attention mechanism
CN111027576A (en) Cooperative significance detection method based on cooperative significance generation type countermeasure network
CN115080749B (en) Weak supervision text classification method, system and device based on self-supervision training
CN113065525A (en) Age recognition model training method, face age recognition method and related device
CN112766337B (en) Method and system for predicting correct tags for crowd-sourced data
Shen et al. Nonlinear structural equation models for network topology inference
CN113822144A (en) Target detection method and device, computer equipment and storage medium
CN110458867B (en) Target tracking method based on attention circulation network
CN111161238A (en) Image quality evaluation method and device, electronic device, and storage medium
CN114821337B (en) Semi-supervised SAR image building area extraction method based on phase consistency pseudo tag
CN113724325B (en) Multi-scene monocular camera pose regression method based on graph convolution network
CN114970732A (en) Posterior calibration method and device for classification model, computer equipment and medium
CN114386527A (en) Category regularization method and system for domain adaptive target detection
CN114611621A (en) Cooperative clustering method based on attention hypergraph neural network
CN114078203A (en) Image recognition method and system based on improved PATE
CN113872703A (en) Method and system for predicting multi-network metadata in quantum communication network
CN111898598A (en) Target detection method based on text in dynamic scene
Alfelt Closed-form estimator for the matrix-variate Gamma distribution
CN112131446B (en) Graph node classification method and device, electronic equipment and storage medium
CN112651505B (en) Truth value discovery method and system for knowledge verification
CN114511023B (en) Classification model training method and classification method
CN115471717B (en) Semi-supervised training and classifying method device, equipment, medium and product of model
CN116910682B (en) Event detection method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant