CN117436449A

CN117436449A - Crowd-sourced named entity recognition model and system based on multi-source domain adaptation and reinforcement learning

Info

Publication number: CN117436449A
Application number: CN202311442418.3A
Authority: CN
Inventors: 田泽庶; 张宏莉; 王星; 叶麟
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2023-11-01
Filing date: 2023-11-01
Publication date: 2024-01-23

Abstract

A crowdsourcing named entity recognition model and system based on multi-source domain adaptation and reinforcement learning belong to the technical field of crowdsourcing named entity recognition. The invention aims to solve the problems that the reliability of a marker often does not fully consider the problem that the reliability of the marker causes negative influence on model training of low-quality marker data and the existing crowded naming entity identification method has difficulty in processing the extremely low-quality data submitted by the low-quality marker. The invention enhances the understanding of the reliability of the annotators in the adaptation method in the crowd-sourced named entity recognition field, provides a data preprocessing example selector based on reinforcement learning, and shows the effectiveness of the data preprocessing example selector in solving the named entity recognition challenges in the crowd-sourced annotation by taking the reliability of the annotators into consideration and discarding low-quality annotations by adopting the example selector based on reinforcement learning, thereby improving the performance of the named entity recognition model on the crowd-sourced data set. The invention is used for efficiently extracting the named entity information in the unsupervised crowdsourcing data.

Description

Crowd-sourced named entity recognition model and system based on multi-source domain adaptation and reinforcement learning

Technical Field

The invention relates to a crowd-sourced named entity recognition method and system based on multi-source domain adaptation and reinforcement learning, and belongs to the technical field of crowd-sourced named entity recognition.

Background

Named entity recognition is a critical task in natural language processing to identify named entities from text, such as person names, place names, organization names, and the like. Named entity recognition is the basis for many natural language processing tasks. In tasks such as information extraction, question and answer systems, text classification and the like, accurate identification of named entities is a prerequisite for building high-quality models and systems. However, the problem of identifying the traditional named entity faces the problem of lacking large-scale high-quality labeling data, and hiring an expert to label the data is time-consuming and labor-consuming, while the crowdsourcing method can fully utilize the scale and diversity of crowdsourcing by distributing tasks to a plurality of labels. Compared with the traditional expert labeling-based method, the crowdsourcing method can collect a large amount of labeling data more quickly, labeling can be performed from different angles and backgrounds, and the comprehensiveness and diversity of labeling are improved. The traditional crowdsourcing named entity identification method focuses on label aggregation, namely labels provided by a plurality of crowdsourcing labels are aggregated into labels with high confidence by using a strategy, so that noise caused by crowdsourcing labels is reduced. The latest crowdsourcing named entity identification method uses the thought of a domain adaptation method, labels of different labels are regarded as different source domains, expert labels are regarded as target domains, and a better identification effect is achieved by using the domain adaptation method with a good effect.

Although domain-adaptive crowdsourcing named entity identification methods have many advantages, there are some major drawbacks that limit their effectiveness in practical applications:

1. the existing methods for solving the crowdsourcing problem by using the domain adaptation model often do not fully consider the reliability of the annotators. In crowdsourcing tasks, there may be a large difference in the expertise level and quality of labeling of the annotators. Conventional domain adaptation models generally assume that the annotators are reliable, and that the data of all annotators is considered equally important and trained. However, this assumption often does not hold in practice, which may lead to low quality annotator data negatively impacting model training.

2. Existing crowdsourcing named entity identification methods have difficulty in handling very low quality data submitted by low quality annotators. Due to the openness and anonymity of crowdsourcing tasks, there is a proportion of low quality labels. These annotators may lack expertise, deliberately submit false annotations, or have other problems, resulting in extremely low quality annotation data. Conventional crowdsourcing methods have no efficient mechanism to handle these very low quality data, which can adversely affect the performance of the model.

The prior document with document number of CN115292296A discloses a method for improving the quality of crowdsourcing annotation data based on federal learning, which comprises the following steps: the method comprises the steps of randomly selecting K crowdsourcing platforms from a plurality of crowdsourcing platforms, randomly dividing data of a user into K parts, uploading the K parts to the K crowdsourcing platforms in a one-to-one correspondence mode, constructing a training data set on the basis of the data uploaded by the user in each crowdsourcing platform, marking the training data set of the j crowdsourcing platform as a classifier of the K crowdsourcing platforms, training the classifier of each crowdsourcing platform respectively, after each training is completed, mutually transmitting network parameters of the classifier of each crowdsourcing platform, then finding out the crowdsourcing platform with higher labeling quality in each crowdsourcing platform, and aggregating the network parameters transmitted by the crowdsourcing platform with higher labeling quality to obtain aggregation parameters as final network parameters after each training. The prior art can reduce label noise, improve crowdsourcing label data quality and has higher privacy protection. But there is no response to both of the above problems.

Disclosure of Invention

The invention aims to solve the technical problems that:

the invention provides a multi-source domain adaptation and reinforcement learning-based crowd-sourced named entity identification method and system, which are used for solving two problems mentioned in the background: the existing method for solving the crowdsourcing problem by using the domain adaptation model often does not fully consider the problems that the reliability of the annotators causes negative influence on model training of the data of the low-quality annotators, and the existing crowdsourcing named entity identification method has difficulty in processing the extremely low-quality data submitted by the low-quality annotators.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a crowd-sourced named entity recognition model based on multi-source domain adaptation and reinforcement learning, the model comprising a crowd-sourced named entity recognition master model that treats labels of a plurality of crowd-sourced annotators as multi-source domains and targets expert labels as target domains, comprising the following components:

(1) The annotator representation layer: for generating annotators a _i And an expert representation, and creating parameters through a Parameter Generation Network (PGN) using the annotator and the expert representation; the parameters obtained by the representation of the annotators and the parameters obtained by the representation of the experts created by the parameter generating network are respectively integrated into an Adapter module (Adapter Chinese meaning is an Adapter) of the text representation layer, so that the text representation layer has the perception capability of the annotators;

(2) Text presentation layer: the text presentation layer is an improved BERT model, called an Adapter-BERT model, the improvement being that an Adapter module is added to each transducer layer in the BERT model;

the improved BERT model is not involved in training due to parameter freezing in the training process, only parameters in the Adapter module are involved in training, the number of training parameters of the BERT model is reduced, the representation of a marker is involved in training of the Adapter and BERT model on the premise of keeping original knowledge, and new knowledge is learned to improve the recognition accuracy of crowdsourcing named entity;

the text representation layer is used for receiving a sentence x= { X _i -converting it into a text representation with label information (tensor representation) and a text representation with expert information (tensor representation) using an Adapter-BERT model, respectively;

(3) Text represents distance layer: respectively taking the text representation with the marker information and the text representation with the expert information output by the text representation layer as a multi-source domain and a target domain;

the text representation distance layer is used for calculating the distance between the text representations of the multi-source domain and the target domain, and the distance is used as a part of training loss;

(4) Reconstruction layer:

reclassifying the text representations with the taggant information in the text representation distance layer to the taggant, preventing weakening of the taggant features of the text representations in optimizing the distance of the text representations from the distance layer;

reclassifying text representations with annotator information in a distance layer of the text representation to the desired annotatorsCalculation of a _i And->Cross entropy loss between as part of training loss to prevent weakening of annotator features of the text representation in optimizing distance of the text representation from the layer;

(5) Bidirectional long and short term memory network (BiLSTM) and Conditional Random Field (CRF) layer: inputting text representation with annotator information in a text representation distance layer, extracting context features from the text representation by using BiLSTM, and generating sequence marks, namely prediction labels, from the output of the BiLSTM by using a state feature function and a transfer feature function of the CRF layer; a cross entropy penalty between crowd-sourced and predicted tags in the dataset is calculated and taken as part of the training penalty.

The specific process of model training and prediction is as follows:

in the training phase, training text x and ID a of its annotator are contained for a crowd-sourced training data _i Firstly, inputting the annotator ID and all the annotator IDs into an annotator representation layer, wherein the annotator representation layer can respectively generate vector representations of the annotator IDs corresponding to the textAnd a representation matrix W of all annotator IDs ^annotator The representation matrix of all annotators ID generates a fitting expert representation by means of the attentional mechanism +.>Then, inputting vector representation of the marker ID and fitting expert representation into a parameter generation network to obtain parameters, and injecting the parameters into an Adapter module of each transducer layer of the Adapter BERT of the text representation layer, so that the text representation layer with two different parameters can be obtained;

after obtaining the two text presentation layers, training text is input into the two text presentation layers simultaneously, and a text presentation with annotator information and a text presentation with expert information are obtained: obtaining a text representation of the annotatorAnd fitting expert text representation +.>Then the text representation with the taggant information and the text representation with the expert information are input to the text representation distance layer, the L2 distance (two-norm distance) is calculated to get the distance loss +.>The text representation with the annotator information (annotator text representation) is also input to the reconstruction layer, and the convolution feature h is extracted by a convolution layer (CNN) with a convolution kernel size of 2 ^ann And then a linear layer is used for obtaining the final classificationFeature O of (2) ^ann The feature can be processed by softmax to obtain different labels, which are classified as desired label ID a _i The probability of (2) can be expressed asThe final reconstruction loss can be calculated from the cross entropy loss and expressed asFinally, the text representation of the annotator is also input into a BiLSTM layer and a CRF layer, the BiLSTM layer is used for respectively calculating the result of the sequence to be predicted according to the forward direction and the reverse direction, the two results are spliced, the spliced result is input into a label sequence finally predicted by a conditional random field, the predicted result is reserved, the label sequence is compared with crowdsourcing annotation provided by the annotator, and the cross entropy is calculated to obtain the predicted loss->Wherein a is a set of all annotators, including a1, a 2..am, m is the number of annotators;

and finally, multiplying the three losses by weight coefficients respectively to carry out addition to obtain the total loss needing back propagation to carry out back propagation:

in the prediction stage, only IDs of all annotators are needed to generate corresponding parameters through an annotator representation layer and a parameter generation network, a fitting expert text representation layer is generated, a text to be predicted is directly input into the fitting expert text representation layer to obtain fitting expert text vector representation, and then BiLSTM and CRF layers are input to obtain a prediction label.

The model also includes an instance selector that removes low quality labels in training data used to train the crowd-sourced named entity recognition master model (used to remove lower quality label data).

The instance selector uses a Markov decision process model, training is carried out by integrating the previous state information into the current state, and the training of the instance selector is carried out by adopting a strategy gradient algorithm;

the specific flow is as follows:

firstly, a trained multi-source domain adaptive crowdsourcing named entity recognition model M is used _NER For each sample d in the dataset _j Predicting to obtain a predicted label Y _p And with crowdsourcing tag Y _a Calculating a calculated similarity score _j And saving the similarity score into the set Φ;

then using the trained multi-source domain adaptive crowdsourcing named entity recognition model to identify each sample d in the dataset _j Calculating the text representation R of its annotators _j And the average state of the sample set removed in the previous round is represented by R ^* Splicing to obtain enhanced text representationThen the state representation of reinforcement learning can be obtained>S is then passed through a gradient strategy network _j Mapping into action score a _j ＝π(S _j ；θ _selector ) And store the collection->'O' represents a tag; not equal to 'O' means that the tag is not 'O';

then select a collectionThe samples of the last p percent with the smallest action score form a screening removal set ψ _i Then remove this portion of the data from the training set, and retrain M on the removed training set _NER Verifying and evaluating on the verification set to obtain an f1 value;if the training period is not the first training period, calculating the difference value between the f1 value of the round and the f1 value obtained in the previous round as the reinforcement learning reward +.>And acquiring the data screened in the round and not the data omega screened in the previous round _i ＝Ψ _i -(Ψ _i ∩Ψ _i-1 ) Data screened in the upper run and not data omega screened in the present run _i - ₁ ＝Ψ _i-1 -(Ψ _i ∩Ψ _i-1 ) Then according to the rewards r _i Is awarded and punished omega by positive and negative _i And omega _i-1 And updates the parameter θ of the policy network _selector ：

In the above formula, mu is the super-parameter learning rate, and the learning rate plays a role in controlling the gradient step length in the machine learning process;

a is the action score a _j The method comprises the steps of carrying out a first treatment on the surface of the S refers to the state representation S above _j The subscript j is not added here because j represents one data sample, and the formula here works on all data samples, specifically for each j;the meaning of θ in (a) is the same as that described above; inverted triangle theta means that the derivative is derived from theta, meaning that the derivative is derived;

and finally, when the model reaches the maximum training round number or the endurance setting program stops training, screening the data by using a round of strategy gradient model with the maximum f1 value, obtaining a screened training set, and adapting the multi-source domain to the crowdsourcing model M _NER Training is performed again on the training set to achieve the best training effect.

In the reconstruction layer, the reclassifying of the text representations with the annotator information in the text representation distance layer to the annotators is achieved by CNN.

Alpha, beta and gamma in the formula (I) are super parameters for balancing three losses, and the value can be 1:1:1.

a multi-source domain adaptation and reinforcement learning-based crowdsourcing named entity recognition system is provided with a program module corresponding to the implementation steps of the technical scheme, and the steps in the multi-source domain adaptation and reinforcement learning-based crowdsourcing named entity recognition model are executed in the running process.

A computer readable storage medium storing a computer program configured to implement the steps of the multi-source domain adaptation and reinforcement learning based crowdsourcing named entity recognition model described above when invoked by a processor.

The invention has the following beneficial technical effects:

the method and the system generate the synthesized expert representation by considering the reliability of the annotators, learn the reliability distribution of different annotators through training, and obtain the synthesized expert representation. The invention adopts the example selector based on reinforcement learning to discard low-quality labels, thereby improving the performance of a named entity recognition model on a crowdsourcing data set, deepening the understanding of the reliability of labels in a crowdsourcing named entity recognition field adaptation method, creatively providing the data preprocessing example selector based on reinforcement learning, and completely being applicable to the crowdsourcing named entity recognition model based on multi-source field adaptation, and being capable of solving two problems mentioned in the background: the existing method for solving the crowdsourcing problem by using the domain adaptation model often does not fully consider the problems that the reliability of the annotators causes negative influence on model training of the data of the low-quality annotators, the existing crowdsourcing naming entity identification method has difficulty in processing the extremely low-quality data submitted by the low-quality annotators, and the like, and shows the effectiveness of the method in solving the naming entity identification challenges in the crowdsourcing annotation.

The invention provides a multi-source domain adaptation and reinforcement learning-based crowdsourcing named entity identification method which is used for efficiently extracting named entity information from unsupervised crowdsourcing data.

Drawings

The legends in the drawings are respectively: FIG. 1 is a block diagram of a crowdsourcing named entity recognition model based on multi-source domain adaptation; FIG. 2 is a flow chart of an algorithm for a multi-source domain adaptation based crowdsourcing named entity recognition model.

Detailed Description

The implementation of the invention is described below with reference to the accompanying drawings: the invention regards labels of a plurality of crowdsourcing labels as multi-source domains, targets experts as target domains, and then uses the multi-source domain adaptation concept to solve the problem of crowdsourcing named entity identification. The multi-source domain adaptive crowdsourcing named entity recognition model provided by the invention comprises the following five components.

(1) Annotator representation layer (annotator embedding layer): the representations of the annotators and the experts are intended to be generated and used to create parameters through a Parameter Generation Network (PGN). These parameters may be integrated into the Adapter of the text presentation layer, enabling the text presentation layer to have annotator awareness capabilities.

(2) Text presentation layer: a sentence is received and converted to a tensor representation using an Adapter-BERT model. Two supplementary modules named adapters are contained in the transducer layer of each BERT. In the fine tuning phase, only the parameters of the adapter are modified so that the potential spaces of the multi-source domains are aligned.

(3) Text represents distance layer: the distance between the source domain and target domain text representations is calculated and taken as part of the training penalty. Through training, the reliability distribution of different annotators can be learned, and a synthesized expert representation is obtained.

(4) Reconstruction layer: classifying the textual representation as a annotator prevents weakening of annotator features of the textual representation during optimization of distance.

(5) Bidirectional long and short term memory network (BiLSTM) and Conditional Random Field (CRF) layer: the context features are extracted from the text representation using BiLSTM, and the CRF layer generates sequence tags from the output of BiLSTM using state and transfer feature functions.

The specific flow is as follows:

in the training phase, training text x and ID a of its annotator are contained for a crowd-sourced training data _i Firstly, inputting the annotator ID and all the annotator IDs into an annotator representation layer, wherein the annotator representation layer can respectively generate vector representations of the annotator IDs corresponding to the textAnd a representation matrix W of all annotator IDs ^annotator The representation matrix of all annotators ID generates a fitting expert representation by means of the attentional mechanism +.>The vector representation of the annotator ID and the fitting expert representation are then input to a parameter generation network, resulting in parameters being injected into the Adapter module of each transducer layer of the Adapter-BERT of the text representation layer, so that a text representation layer with two different parameters can be obtained.

After two text presentation layers are obtained, training text is entered into both text presentation layers simultaneously to obtain a annotator text representationAnd fitting expert text representation +.>Then the annotator text representation and the fitting expert text representation are input to the text representation distance layer, and the L2 distance is calculated to obtain the distance loss +.>The annotator text representation is also input to the reconstruction layer, finally classified into annotator IDs by a convolution layer and a linear layer, and the cross entropy loss is calculated by the annotator IDs obtained by classification and the original annotator IDs to obtain reconstruction loss->Finally, the text representation of the annotator is also input into a BiLSTM layer and a CRF layer, the BiLSTM layer is used for respectively calculating the result of the sequence to be predicted according to the forward direction and the reverse direction, the two results are spliced, the spliced result is input into a label sequence finally predicted by a conditional random field, the predicted result is reserved, the label sequence is compared with crowdsourcing annotation provided by the annotator, and the cross entropy is calculated to obtain the predicted loss->

The invention also provides an example selector based on reinforcement learning, which is used for removing the annotation data with lower quality. The instance selector uses a Markov decision process model to train by fusing previous state information into the current state. Reinforcement learning algorithms are used to train the instance selector, which employ strategic gradient algorithms for training.

The specific flow is as follows:

firstly, a trained multi-source domain adaptive crowdsourcing named entity recognition model M is used _NER For each sample d in the dataset _j Predicting to obtain a predicted label Y _p And with crowdsourcing tag Y _a Calculating a calculated similarity score _j And save the similarity score into the set Φ.

Then using the trained multi-source domain adaptive crowdsourcing named entity recognition model to identify each sample d in the dataset _j Calculating the text representation R of its annotators _j And the average state of the sample set removed in the previous round is represented by R ^* Splicing to obtain enhanced text representationThen the state representation of reinforcement learning can be obtained>S is then passed through a gradient strategy network _j Mapping into action score a _j ＝π(S _j ；θ _selector ) And store the collection->

Then select a collectionThe samples of the last p percent with the smallest action score form a screening removal set ψ _i Then remove this portion of the data from the training set, and retrain M on the removed training set _NER And verifying and evaluating on the verification set to obtain an f1 value. Then if the training period is not the first training period, calculating the difference value between the f1 value of the round and the f1 value obtained in the previous round as the reinforcement learning reward r _i ＝F ₁ ⁱ -F ₁ ^i-1 And the data screened in this round is obtained and is not the data omega screened in the previous round _i ＝Ψ _i -(Ψ _i ∩Ψ _i-1 ) Data screened in the upper run and not data omega screened in the present run _i - ₁ ＝Ψ _i-1 -(Ψ _i ∩Ψ _i-1 ) Then according to the rewards r _i Is awarded and punished omega by positive and negative _i And omega _i-1 And updates the parameter θ of the policy network _selector ：

And finally, when the model reaches the maximum training round number or the endurance setting program stops training, screening the data by using a round of strategy gradient model with the maximum f1 value, obtaining a screened training set, and adapting the multi-source domain to the crowdsourcing model M _NER Training is performed again on the training set, so that the best training effect is achieved.

The following table is an algorithm for example selector based on reinforcement learning:

aiming at the technical effect and feasibility of the crowdsourcing named entity recognition model based on multi-source domain adaptation and reinforcement learning, the following verification is carried out:

(1) Data set: to evaluate our RL-RMDA method, we use two published reference crowdsourcing datasets: FMKKM11 and colll 03.FMKKM11 is a social network crowd-sourced named entity recognition dataset based on Twitter, first constructed by Finin et al, which includes 9821 text and 269 crowd-sourced markers. The from slide et al then provides it with additional expert labels. For our experiments we divide these expert labels uniformly and randomly into development and test sets. On the other hand, coNLL03 is a news naming entity identification dataset proposed by Sang et al. Later, rodrigues et al provided it with a crowdsourcing version that included 5985 pieces of text and 47 crowdsourcing labels, here we used the training set, development set and test set provided by Rodrigues et al.

(2) Evaluation index: standard colll-2003 evaluation metrics, including entity level accuracy (P), recall (R), and macro-F1 values (F1), were used to evaluate the performance of our crowdsourcing named entity recognition model. Wherein the entity only scores on all matches.

(3) Experiment setting: we used an RTX-3090 graphics card with 24G video memory for model training and performed all experiments on a single GPU server equipped with a 12 core CPU and 128GB system memory.

For a crowdsourcing named entity recognition model based on a multi-source domain adaptation technology, BERT-base-based is selected as a pre-training model. Using the transformers library, we load the weights of the BERT and inject the corresponding parameters into the Adapter. The intermediate hidden layer size of the Adapter is set to 128 and we ensure that the Adapter is added to each layer of the BERT. The dimension in which the annotator is embedded is set to 8, while the hidden state of the BiLSTM is set to 400. In terms of model optimization, we select an optimizer AdamW optimization strategy with a learning rate of 1×10 ^-3 . The learning rate of the parameter training of the transducer part is 1 multiplied by 10 ^-5 And the weight decay was set to 0.01. The maximum training round number of the model is 25 rounds, and the tolerance of the early-stop strategy is set to 5 rounds. Based on the performance on the validation set, we choose the one training round that performs best. The number of data samples per batch is 64, and on word level representation, we take a time step dropout strategy with a dropout rate of 0.1 to reduce the risk of model overfitting.

For the reinforcement learning based instance selection part, we use a learning rate of 1×10 ^-2 The optimizer Adam of (a) performs parameter updating. The maximum iteration round number is 10 rounds, and the early stop tolerance is still 5 rounds. We base on 5 pre-set random seeds [22,6622,6699,333555,999111 ]]Experiments were performed and the average performance of the model was reported.

(4) Baseline model: to evaluate the performance of our proposed RL-RMDA, we compared it to the following baseline model:

ABBC (Adapter-BERT-BiLSTM-CRF): this is a traditional NER approach, irrespective of crowdsourcing. It treats all crowdsourcing annotations as expert annotations.

HMM-Crowd [1]: this is an improved HMM model, considering the quality of the annotations, representing each annotator as a vector.

LC and LC-cat [1]: these methods map annotator IDs to vectors, but in a different way of integration: LC uses LSTM layer, while LC-cat uses CRF layer.

DL-CL [2]: this is an innovative expectation maximization approach that incorporates crowdsourcing, allowing training directly on noise crowdsourcing data.

BSC-seq [3]: this is a new bayesian approach to reduce errors by modeling the dependencies between sequence tags to aggregate the tags.

CLasDA [4]: the method integrates multi-source domain adaptation and crowdsourcing named entity identification, and uses PGN.adapter.BERT for text coding. All annotators are considered identical. In preprocessing, it deletes annotations where all words share the same tag.

Crowd-OEI [5]: the method expands CLasDA, integrates a mixup data enhancement strategy, and improves Chinese viewpoint expression recognition under crowdsourcing annotation.

Neural-Hidden-CRF [6]: this is a graph model tailored to weak supervision sequence labels. The BERT is used for rich contextual understanding and hidden CRF layers are used to identify internal tag dependencies.

From the provided figures, the invention shows the effect of the algorithm of the invention on the crowdsourcing named entity identification data set, and the invention is not only good for a group of data, but also good for a plurality of crowdsourcing named entity identification data sets in different fields.

Table 1 comparison of Performance with reference model

Table 1 is a table of experimental results, comparing on two data sets, the results of the best reference model being marked with a bold and the best results being highlighted, the superscript indicating that the test corpus used is different from the invention. The advantages of the present invention are apparent from the above table.

Compared with the existing crowdsourcing named entity recognition model, the method provided by the invention is superior to the existing method in recognition accuracy.

The method of the invention verifies the technical effect claimed by the invention through simulation experiments and practical application.

The model (algorithm or method) provided by the invention is a technical kernel of the bottom layer of the invention, and various products can be derived based on the algorithm.

The model provided by the invention is used for developing a crowdsourcing named entity recognition system based on multi-source domain adaptation and reinforcement learning by using a program language, the system is provided with a program module corresponding to the steps in the technical scheme model, and the steps in the crowdsourcing named entity recognition model based on multi-source domain adaptation and reinforcement learning are executed during running.

A computer program of a developed system (software) is stored on a computer readable storage medium, the computer program being configured to implement the steps of the multi-source domain adaptation and reinforcement learning based crowdsourcing named entity recognition model described above when invoked by a processor. I.e. the invention is embodied on a carrier as a computer program product.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present application may be performed in parallel, sequentially, or in a different order, so long as the desired result of the technical solution disclosed in the present application is achieved, which is within the scope of the present invention.

The references cited in the present invention are as follows:

[1]Nguyen AT,Wallace B C,Li J J,et al.Aggregating and predicting sequence labels fromcrowd annotations[C]//Proceedings of the conference.Association for Computational Linguistics.Meeting.NIH Public Access,2017,2017:299.

[2]Rodrigues F,Pereira F.Deep learning from crowds[C]//Proceedings of the AAAI conference on artificial intelligence.2018,32(1).

[3]Simpson E,Gurevych I.ABayesian Approach for Sequence Tagging with Crowds[C]//Proceedings of the 2019Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing(EMNLP-IJCNLP).2019:1093-1104.

[4]Zhang X,Xu G,Sun Y,et al.Crowdsourcing Learning as Domain Adaptation:ACase Study on Named Entity Recognition[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing(Volume 1:Long Papers).2021:5558-5570.

[5]Zhang X,Xu G,Sun Y,et al.Identifying Chinese Opinion Expressions with Extremely-Noisy Crowdsourcing Annotations[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics(Volume 1:Long Papers).2022:2801-2813.

[6]Chen Z,Sun H,Zhang W,et al.Neural-Hidden-CRF:ARobust Weakly-Supervised Sequence Labeler[C]//Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.2023:274-285.

Claims

1. the utility model provides a crowdsourcing named entity recognition model based on multisource domain adaptation and reinforcement learning which characterized in that: the model comprises a crowdsourcing named entity identification main model, wherein the crowdsourcing named entity identification main model regards labels of a plurality of crowdsourcing labels as multi-source domains and experts as target domains, and the crowdsourcing named entity identification main model comprises the following components:

(4) Reconstruction layer:

2. The multi-source domain adaptation and reinforcement learning based crowdsourcing named entity recognition model of claim 1, wherein: the specific process of model training and prediction is as follows:

after obtaining the two text presentation layers, training text is input into the two text presentation layers simultaneously, and a text presentation with annotator information and a text presentation with expert information are obtained: obtaining a text representation of the annotatorAnd fitting expert text representation +.>Text representation with taggant information thenAnd a text representation with expert information is input to the text representation distance layer, the L2 distance (two-norm distance) is calculated to obtain the distance loss +.>The text representation with the annotator information (annotator text representation) is also input to the reconstruction layer, and the convolution feature h is extracted by a convolution layer (CNN) with a convolution kernel size of 2 ^ann Then, the final feature O for classification is obtained through a linear layer ^ann The feature can be processed by softmax to obtain different labels, which are classified as desired label ID a _i The probability of (2) can be expressed asThe final reconstruction loss can be calculated from the cross entropy loss and expressed asFinally, the text representation of the annotator is also input into a BiLSTM layer and a CRF layer, the BiLSTM layer is used for respectively calculating the result of the sequence to be predicted according to the forward direction and the reverse direction, the two results are spliced, the spliced result is input into a label sequence finally predicted by a conditional random field, the predicted result is reserved, the label sequence is compared with crowdsourcing annotation provided by the annotator, and the cross entropy is calculated to obtain the predicted loss->Wherein a is a set of all annotators, including a1, a 2..am, m is the number of annotators;

3. A multi-source domain adaptation and reinforcement learning based crowdsourcing named entity recognition model as claimed in claims 1 and 2, wherein: the model also includes an instance selector that removes low quality labels in training data used to train the crowd-sourced named entity recognition master model (used to remove lower quality label data).

4. A multi-source domain adaptation and reinforcement learning based crowd-sourced named entity recognition model of claim 3, wherein: the instance selector uses a Markov decision process model, training is carried out by integrating the previous state information into the current state, and the training of the instance selector is carried out by adopting a strategy gradient algorithm;

the specific flow is as follows:

Then select a collectionThe samples of the last p percent with the smallest action score form a screening removal set ψ _i Then remove this portion of the data from the training set, and retrain M on the removed training set _NER Verifying and evaluating on the verification set to obtain an f1 value; then if the training period is not the first training period, calculating the difference value between the f1 value of the round and the f1 value obtained in the previous round as the reinforcement learning reward r _i ＝F ₁ ⁱ -F ₁ ^i-1 And the data screened in this round is obtained and is not the data omega screened in the previous round _i ＝Ψ _i -(Ψ _i ∩Ψ _i-1 ) Data screened in the upper run and not data omega screened in the present run _i - ₁ ＝Ψ _i-1 -(Ψ _i ∩Ψ _i-1 ) Then according to the rewards r _i Is awarded and punished omega by positive and negative _i And omega _i-1 And updates the parameter θ of the policy network _selector ：

a is the action score a _j The method comprises the steps of carrying out a first treatment on the surface of the S refers to the state representation S above _j Here, no subscript is addedj is because j represents one data sample, and the formula herein works for all data samples, specifically for each j;the meaning of θ in (a) is the same as that described above;

5. The multi-source domain adaptation and reinforcement learning based crowdsourcing named entity recognition model of claim 1, wherein: in the reconstruction layer, the reclassifying of the text representations with the annotator information in the text representation distance layer to the annotators is achieved by CNN.

6. The multi-source domain adaptation and reinforcement learning based crowdsourcing named entity recognition model of claim 2, wherein:alpha, beta and gamma in the formula (I) are super parameters for balancing three losses, and the value can be 1:1:1.

7. a crowdsourcing named entity recognition system based on multi-source domain adaptation and reinforcement learning is characterized in that: the system having program modules corresponding to the implementation steps of any of the preceding claims 1-6, the steps in the above multi-source domain adaptation and reinforcement learning based crowdsourcing named entity identification model being performed at run-time.

8. A computer-readable storage medium, characterized by: the computer readable storage medium stores a computer program configured to implement the steps of the multi-source domain adaptation and reinforcement learning based crowdsourcing named entity recognition model of any one of claims 1-6 when invoked by a processor.