CN117436449A - Crowd-sourced named entity recognition model and system based on multi-source domain adaptation and reinforcement learning - Google Patents

Crowd-sourced named entity recognition model and system based on multi-source domain adaptation and reinforcement learning Download PDF

Info

Publication number
CN117436449A
CN117436449A CN202311442418.3A CN202311442418A CN117436449A CN 117436449 A CN117436449 A CN 117436449A CN 202311442418 A CN202311442418 A CN 202311442418A CN 117436449 A CN117436449 A CN 117436449A
Authority
CN
China
Prior art keywords
representation
text
layer
training
crowdsourcing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311442418.3A
Other languages
Chinese (zh)
Inventor
田泽庶
张宏莉
王星
叶麟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CN202311442418.3A priority Critical patent/CN117436449A/en
Publication of CN117436449A publication Critical patent/CN117436449A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/092Reinforcement learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)

Abstract

A crowdsourcing named entity recognition model and system based on multi-source domain adaptation and reinforcement learning belong to the technical field of crowdsourcing named entity recognition. The invention aims to solve the problems that the reliability of a marker often does not fully consider the problem that the reliability of the marker causes negative influence on model training of low-quality marker data and the existing crowded naming entity identification method has difficulty in processing the extremely low-quality data submitted by the low-quality marker. The invention enhances the understanding of the reliability of the annotators in the adaptation method in the crowd-sourced named entity recognition field, provides a data preprocessing example selector based on reinforcement learning, and shows the effectiveness of the data preprocessing example selector in solving the named entity recognition challenges in the crowd-sourced annotation by taking the reliability of the annotators into consideration and discarding low-quality annotations by adopting the example selector based on reinforcement learning, thereby improving the performance of the named entity recognition model on the crowd-sourced data set. The invention is used for efficiently extracting the named entity information in the unsupervised crowdsourcing data.

Description

Crowd-sourced named entity recognition model and system based on multi-source domain adaptation and reinforcement learning
Technical Field
The invention relates to a crowd-sourced named entity recognition method and system based on multi-source domain adaptation and reinforcement learning, and belongs to the technical field of crowd-sourced named entity recognition.
Background
Named entity recognition is a critical task in natural language processing to identify named entities from text, such as person names, place names, organization names, and the like. Named entity recognition is the basis for many natural language processing tasks. In tasks such as information extraction, question and answer systems, text classification and the like, accurate identification of named entities is a prerequisite for building high-quality models and systems. However, the problem of identifying the traditional named entity faces the problem of lacking large-scale high-quality labeling data, and hiring an expert to label the data is time-consuming and labor-consuming, while the crowdsourcing method can fully utilize the scale and diversity of crowdsourcing by distributing tasks to a plurality of labels. Compared with the traditional expert labeling-based method, the crowdsourcing method can collect a large amount of labeling data more quickly, labeling can be performed from different angles and backgrounds, and the comprehensiveness and diversity of labeling are improved. The traditional crowdsourcing named entity identification method focuses on label aggregation, namely labels provided by a plurality of crowdsourcing labels are aggregated into labels with high confidence by using a strategy, so that noise caused by crowdsourcing labels is reduced. The latest crowdsourcing named entity identification method uses the thought of a domain adaptation method, labels of different labels are regarded as different source domains, expert labels are regarded as target domains, and a better identification effect is achieved by using the domain adaptation method with a good effect.
Although domain-adaptive crowdsourcing named entity identification methods have many advantages, there are some major drawbacks that limit their effectiveness in practical applications:
1. the existing methods for solving the crowdsourcing problem by using the domain adaptation model often do not fully consider the reliability of the annotators. In crowdsourcing tasks, there may be a large difference in the expertise level and quality of labeling of the annotators. Conventional domain adaptation models generally assume that the annotators are reliable, and that the data of all annotators is considered equally important and trained. However, this assumption often does not hold in practice, which may lead to low quality annotator data negatively impacting model training.
2. Existing crowdsourcing named entity identification methods have difficulty in handling very low quality data submitted by low quality annotators. Due to the openness and anonymity of crowdsourcing tasks, there is a proportion of low quality labels. These annotators may lack expertise, deliberately submit false annotations, or have other problems, resulting in extremely low quality annotation data. Conventional crowdsourcing methods have no efficient mechanism to handle these very low quality data, which can adversely affect the performance of the model.
The prior document with document number of CN115292296A discloses a method for improving the quality of crowdsourcing annotation data based on federal learning, which comprises the following steps: the method comprises the steps of randomly selecting K crowdsourcing platforms from a plurality of crowdsourcing platforms, randomly dividing data of a user into K parts, uploading the K parts to the K crowdsourcing platforms in a one-to-one correspondence mode, constructing a training data set on the basis of the data uploaded by the user in each crowdsourcing platform, marking the training data set of the j crowdsourcing platform as a classifier of the K crowdsourcing platforms, training the classifier of each crowdsourcing platform respectively, after each training is completed, mutually transmitting network parameters of the classifier of each crowdsourcing platform, then finding out the crowdsourcing platform with higher labeling quality in each crowdsourcing platform, and aggregating the network parameters transmitted by the crowdsourcing platform with higher labeling quality to obtain aggregation parameters as final network parameters after each training. The prior art can reduce label noise, improve crowdsourcing label data quality and has higher privacy protection. But there is no response to both of the above problems.
Disclosure of Invention
The invention aims to solve the technical problems that:
the invention provides a multi-source domain adaptation and reinforcement learning-based crowd-sourced named entity identification method and system, which are used for solving two problems mentioned in the background: the existing method for solving the crowdsourcing problem by using the domain adaptation model often does not fully consider the problems that the reliability of the annotators causes negative influence on model training of the data of the low-quality annotators, and the existing crowdsourcing named entity identification method has difficulty in processing the extremely low-quality data submitted by the low-quality annotators.
The technical scheme adopted by the invention for solving the technical problems is as follows:
a crowd-sourced named entity recognition model based on multi-source domain adaptation and reinforcement learning, the model comprising a crowd-sourced named entity recognition master model that treats labels of a plurality of crowd-sourced annotators as multi-source domains and targets expert labels as target domains, comprising the following components:
(1) The annotator representation layer: for generating annotators a i And an expert representation, and creating parameters through a Parameter Generation Network (PGN) using the annotator and the expert representation; the parameters obtained by the representation of the annotators and the parameters obtained by the representation of the experts created by the parameter generating network are respectively integrated into an Adapter module (Adapter Chinese meaning is an Adapter) of the text representation layer, so that the text representation layer has the perception capability of the annotators;
(2) Text presentation layer: the text presentation layer is an improved BERT model, called an Adapter-BERT model, the improvement being that an Adapter module is added to each transducer layer in the BERT model;
the improved BERT model is not involved in training due to parameter freezing in the training process, only parameters in the Adapter module are involved in training, the number of training parameters of the BERT model is reduced, the representation of a marker is involved in training of the Adapter and BERT model on the premise of keeping original knowledge, and new knowledge is learned to improve the recognition accuracy of crowdsourcing named entity;
the text representation layer is used for receiving a sentence x= { X i -converting it into a text representation with label information (tensor representation) and a text representation with expert information (tensor representation) using an Adapter-BERT model, respectively;
(3) Text represents distance layer: respectively taking the text representation with the marker information and the text representation with the expert information output by the text representation layer as a multi-source domain and a target domain;
the text representation distance layer is used for calculating the distance between the text representations of the multi-source domain and the target domain, and the distance is used as a part of training loss;
(4) Reconstruction layer:
reclassifying the text representations with the taggant information in the text representation distance layer to the taggant, preventing weakening of the taggant features of the text representations in optimizing the distance of the text representations from the distance layer;
reclassifying text representations with annotator information in a distance layer of the text representation to the desired annotatorsCalculation of a i And->Cross entropy loss between as part of training loss to prevent weakening of annotator features of the text representation in optimizing distance of the text representation from the layer;
(5) Bidirectional long and short term memory network (BiLSTM) and Conditional Random Field (CRF) layer: inputting text representation with annotator information in a text representation distance layer, extracting context features from the text representation by using BiLSTM, and generating sequence marks, namely prediction labels, from the output of the BiLSTM by using a state feature function and a transfer feature function of the CRF layer; a cross entropy penalty between crowd-sourced and predicted tags in the dataset is calculated and taken as part of the training penalty.
The specific process of model training and prediction is as follows:
in the training phase, training text x and ID a of its annotator are contained for a crowd-sourced training data i Firstly, inputting the annotator ID and all the annotator IDs into an annotator representation layer, wherein the annotator representation layer can respectively generate vector representations of the annotator IDs corresponding to the textAnd a representation matrix W of all annotator IDs annotator The representation matrix of all annotators ID generates a fitting expert representation by means of the attentional mechanism +.>Then, inputting vector representation of the marker ID and fitting expert representation into a parameter generation network to obtain parameters, and injecting the parameters into an Adapter module of each transducer layer of the Adapter BERT of the text representation layer, so that the text representation layer with two different parameters can be obtained;
after obtaining the two text presentation layers, training text is input into the two text presentation layers simultaneously, and a text presentation with annotator information and a text presentation with expert information are obtained: obtaining a text representation of the annotatorAnd fitting expert text representation +.>Then the text representation with the taggant information and the text representation with the expert information are input to the text representation distance layer, the L2 distance (two-norm distance) is calculated to get the distance loss +.>The text representation with the annotator information (annotator text representation) is also input to the reconstruction layer, and the convolution feature h is extracted by a convolution layer (CNN) with a convolution kernel size of 2 ann And then a linear layer is used for obtaining the final classificationFeature O of (2) ann The feature can be processed by softmax to obtain different labels, which are classified as desired label ID a i The probability of (2) can be expressed asThe final reconstruction loss can be calculated from the cross entropy loss and expressed asFinally, the text representation of the annotator is also input into a BiLSTM layer and a CRF layer, the BiLSTM layer is used for respectively calculating the result of the sequence to be predicted according to the forward direction and the reverse direction, the two results are spliced, the spliced result is input into a label sequence finally predicted by a conditional random field, the predicted result is reserved, the label sequence is compared with crowdsourcing annotation provided by the annotator, and the cross entropy is calculated to obtain the predicted loss->Wherein a is a set of all annotators, including a1, a 2..am, m is the number of annotators;
and finally, multiplying the three losses by weight coefficients respectively to carry out addition to obtain the total loss needing back propagation to carry out back propagation:
in the prediction stage, only IDs of all annotators are needed to generate corresponding parameters through an annotator representation layer and a parameter generation network, a fitting expert text representation layer is generated, a text to be predicted is directly input into the fitting expert text representation layer to obtain fitting expert text vector representation, and then BiLSTM and CRF layers are input to obtain a prediction label.
The model also includes an instance selector that removes low quality labels in training data used to train the crowd-sourced named entity recognition master model (used to remove lower quality label data).
The instance selector uses a Markov decision process model, training is carried out by integrating the previous state information into the current state, and the training of the instance selector is carried out by adopting a strategy gradient algorithm;
the specific flow is as follows:
firstly, a trained multi-source domain adaptive crowdsourcing named entity recognition model M is used NER For each sample d in the dataset j Predicting to obtain a predicted label Y p And with crowdsourcing tag Y a Calculating a calculated similarity score j And saving the similarity score into the set Φ;
then using the trained multi-source domain adaptive crowdsourcing named entity recognition model to identify each sample d in the dataset j Calculating the text representation R of its annotators j And the average state of the sample set removed in the previous round is represented by R * Splicing to obtain enhanced text representationThen the state representation of reinforcement learning can be obtained>S is then passed through a gradient strategy network j Mapping into action score a j =π(S j ;θ selector ) And store the collection->'O' represents a tag; not equal to 'O' means that the tag is not 'O';
then select a collectionThe samples of the last p percent with the smallest action score form a screening removal set ψ i Then remove this portion of the data from the training set, and retrain M on the removed training set NER Verifying and evaluating on the verification set to obtain an f1 value;if the training period is not the first training period, calculating the difference value between the f1 value of the round and the f1 value obtained in the previous round as the reinforcement learning reward +.>And acquiring the data screened in the round and not the data omega screened in the previous round i =Ψ i -(Ψ i ∩Ψ i-1 ) Data screened in the upper run and not data omega screened in the present run i - 1 =Ψ i-1 -(Ψ i ∩Ψ i-1 ) Then according to the rewards r i Is awarded and punished omega by positive and negative i And omega i-1 And updates the parameter θ of the policy network selector
In the above formula, mu is the super-parameter learning rate, and the learning rate plays a role in controlling the gradient step length in the machine learning process;
a is the action score a j The method comprises the steps of carrying out a first treatment on the surface of the S refers to the state representation S above j The subscript j is not added here because j represents one data sample, and the formula here works on all data samples, specifically for each j;the meaning of θ in (a) is the same as that described above; inverted triangle theta means that the derivative is derived from theta, meaning that the derivative is derived;
and finally, when the model reaches the maximum training round number or the endurance setting program stops training, screening the data by using a round of strategy gradient model with the maximum f1 value, obtaining a screened training set, and adapting the multi-source domain to the crowdsourcing model M NER Training is performed again on the training set to achieve the best training effect.
In the reconstruction layer, the reclassifying of the text representations with the annotator information in the text representation distance layer to the annotators is achieved by CNN.
Alpha, beta and gamma in the formula (I) are super parameters for balancing three losses, and the value can be 1:1:1.
a multi-source domain adaptation and reinforcement learning-based crowdsourcing named entity recognition system is provided with a program module corresponding to the implementation steps of the technical scheme, and the steps in the multi-source domain adaptation and reinforcement learning-based crowdsourcing named entity recognition model are executed in the running process.
A computer readable storage medium storing a computer program configured to implement the steps of the multi-source domain adaptation and reinforcement learning based crowdsourcing named entity recognition model described above when invoked by a processor.
The invention has the following beneficial technical effects:
the method and the system generate the synthesized expert representation by considering the reliability of the annotators, learn the reliability distribution of different annotators through training, and obtain the synthesized expert representation. The invention adopts the example selector based on reinforcement learning to discard low-quality labels, thereby improving the performance of a named entity recognition model on a crowdsourcing data set, deepening the understanding of the reliability of labels in a crowdsourcing named entity recognition field adaptation method, creatively providing the data preprocessing example selector based on reinforcement learning, and completely being applicable to the crowdsourcing named entity recognition model based on multi-source field adaptation, and being capable of solving two problems mentioned in the background: the existing method for solving the crowdsourcing problem by using the domain adaptation model often does not fully consider the problems that the reliability of the annotators causes negative influence on model training of the data of the low-quality annotators, the existing crowdsourcing naming entity identification method has difficulty in processing the extremely low-quality data submitted by the low-quality annotators, and the like, and shows the effectiveness of the method in solving the naming entity identification challenges in the crowdsourcing annotation.
The invention provides a multi-source domain adaptation and reinforcement learning-based crowdsourcing named entity identification method which is used for efficiently extracting named entity information from unsupervised crowdsourcing data.
Drawings
The legends in the drawings are respectively: FIG. 1 is a block diagram of a crowdsourcing named entity recognition model based on multi-source domain adaptation; FIG. 2 is a flow chart of an algorithm for a multi-source domain adaptation based crowdsourcing named entity recognition model.
Detailed Description
The implementation of the invention is described below with reference to the accompanying drawings: the invention regards labels of a plurality of crowdsourcing labels as multi-source domains, targets experts as target domains, and then uses the multi-source domain adaptation concept to solve the problem of crowdsourcing named entity identification. The multi-source domain adaptive crowdsourcing named entity recognition model provided by the invention comprises the following five components.
(1) Annotator representation layer (annotator embedding layer): the representations of the annotators and the experts are intended to be generated and used to create parameters through a Parameter Generation Network (PGN). These parameters may be integrated into the Adapter of the text presentation layer, enabling the text presentation layer to have annotator awareness capabilities.
(2) Text presentation layer: a sentence is received and converted to a tensor representation using an Adapter-BERT model. Two supplementary modules named adapters are contained in the transducer layer of each BERT. In the fine tuning phase, only the parameters of the adapter are modified so that the potential spaces of the multi-source domains are aligned.
(3) Text represents distance layer: the distance between the source domain and target domain text representations is calculated and taken as part of the training penalty. Through training, the reliability distribution of different annotators can be learned, and a synthesized expert representation is obtained.
(4) Reconstruction layer: classifying the textual representation as a annotator prevents weakening of annotator features of the textual representation during optimization of distance.
(5) Bidirectional long and short term memory network (BiLSTM) and Conditional Random Field (CRF) layer: the context features are extracted from the text representation using BiLSTM, and the CRF layer generates sequence tags from the output of BiLSTM using state and transfer feature functions.
The specific flow is as follows:
in the training phase, training text x and ID a of its annotator are contained for a crowd-sourced training data i Firstly, inputting the annotator ID and all the annotator IDs into an annotator representation layer, wherein the annotator representation layer can respectively generate vector representations of the annotator IDs corresponding to the textAnd a representation matrix W of all annotator IDs annotator The representation matrix of all annotators ID generates a fitting expert representation by means of the attentional mechanism +.>The vector representation of the annotator ID and the fitting expert representation are then input to a parameter generation network, resulting in parameters being injected into the Adapter module of each transducer layer of the Adapter-BERT of the text representation layer, so that a text representation layer with two different parameters can be obtained.
After two text presentation layers are obtained, training text is entered into both text presentation layers simultaneously to obtain a annotator text representationAnd fitting expert text representation +.>Then the annotator text representation and the fitting expert text representation are input to the text representation distance layer, and the L2 distance is calculated to obtain the distance loss +.>The annotator text representation is also input to the reconstruction layer, finally classified into annotator IDs by a convolution layer and a linear layer, and the cross entropy loss is calculated by the annotator IDs obtained by classification and the original annotator IDs to obtain reconstruction loss->Finally, the text representation of the annotator is also input into a BiLSTM layer and a CRF layer, the BiLSTM layer is used for respectively calculating the result of the sequence to be predicted according to the forward direction and the reverse direction, the two results are spliced, the spliced result is input into a label sequence finally predicted by a conditional random field, the predicted result is reserved, the label sequence is compared with crowdsourcing annotation provided by the annotator, and the cross entropy is calculated to obtain the predicted loss->
And finally, multiplying the three losses by weight coefficients respectively to carry out addition to obtain the total loss needing back propagation to carry out back propagation:
in the prediction stage, only IDs of all annotators are needed to generate corresponding parameters through an annotator representation layer and a parameter generation network, a fitting expert text representation layer is generated, a text to be predicted is directly input into the fitting expert text representation layer to obtain fitting expert text vector representation, and then BiLSTM and CRF layers are input to obtain a prediction label.
The invention also provides an example selector based on reinforcement learning, which is used for removing the annotation data with lower quality. The instance selector uses a Markov decision process model to train by fusing previous state information into the current state. Reinforcement learning algorithms are used to train the instance selector, which employ strategic gradient algorithms for training.
The specific flow is as follows:
firstly, a trained multi-source domain adaptive crowdsourcing named entity recognition model M is used NER For each sample d in the dataset j Predicting to obtain a predicted label Y p And with crowdsourcing tag Y a Calculating a calculated similarity score j And save the similarity score into the set Φ.
Then using the trained multi-source domain adaptive crowdsourcing named entity recognition model to identify each sample d in the dataset j Calculating the text representation R of its annotators j And the average state of the sample set removed in the previous round is represented by R * Splicing to obtain enhanced text representationThen the state representation of reinforcement learning can be obtained>S is then passed through a gradient strategy network j Mapping into action score a j =π(S j ;θ selector ) And store the collection->
Then select a collectionThe samples of the last p percent with the smallest action score form a screening removal set ψ i Then remove this portion of the data from the training set, and retrain M on the removed training set NER And verifying and evaluating on the verification set to obtain an f1 value. Then if the training period is not the first training period, calculating the difference value between the f1 value of the round and the f1 value obtained in the previous round as the reinforcement learning reward r i =F 1 i -F 1 i-1 And the data screened in this round is obtained and is not the data omega screened in the previous round i =Ψ i -(Ψ i ∩Ψ i-1 ) Data screened in the upper run and not data omega screened in the present run i - 1 =Ψ i-1 -(Ψ i ∩Ψ i-1 ) Then according to the rewards r i Is awarded and punished omega by positive and negative i And omega i-1 And updates the parameter θ of the policy network selector
And finally, when the model reaches the maximum training round number or the endurance setting program stops training, screening the data by using a round of strategy gradient model with the maximum f1 value, obtaining a screened training set, and adapting the multi-source domain to the crowdsourcing model M NER Training is performed again on the training set, so that the best training effect is achieved.
The following table is an algorithm for example selector based on reinforcement learning:
aiming at the technical effect and feasibility of the crowdsourcing named entity recognition model based on multi-source domain adaptation and reinforcement learning, the following verification is carried out:
(1) Data set: to evaluate our RL-RMDA method, we use two published reference crowdsourcing datasets: FMKKM11 and colll 03.FMKKM11 is a social network crowd-sourced named entity recognition dataset based on Twitter, first constructed by Finin et al, which includes 9821 text and 269 crowd-sourced markers. The from slide et al then provides it with additional expert labels. For our experiments we divide these expert labels uniformly and randomly into development and test sets. On the other hand, coNLL03 is a news naming entity identification dataset proposed by Sang et al. Later, rodrigues et al provided it with a crowdsourcing version that included 5985 pieces of text and 47 crowdsourcing labels, here we used the training set, development set and test set provided by Rodrigues et al.
(2) Evaluation index: standard colll-2003 evaluation metrics, including entity level accuracy (P), recall (R), and macro-F1 values (F1), were used to evaluate the performance of our crowdsourcing named entity recognition model. Wherein the entity only scores on all matches.
(3) Experiment setting: we used an RTX-3090 graphics card with 24G video memory for model training and performed all experiments on a single GPU server equipped with a 12 core CPU and 128GB system memory.
For a crowdsourcing named entity recognition model based on a multi-source domain adaptation technology, BERT-base-based is selected as a pre-training model. Using the transformers library, we load the weights of the BERT and inject the corresponding parameters into the Adapter. The intermediate hidden layer size of the Adapter is set to 128 and we ensure that the Adapter is added to each layer of the BERT. The dimension in which the annotator is embedded is set to 8, while the hidden state of the BiLSTM is set to 400. In terms of model optimization, we select an optimizer AdamW optimization strategy with a learning rate of 1×10 -3 . The learning rate of the parameter training of the transducer part is 1 multiplied by 10 -5 And the weight decay was set to 0.01. The maximum training round number of the model is 25 rounds, and the tolerance of the early-stop strategy is set to 5 rounds. Based on the performance on the validation set, we choose the one training round that performs best. The number of data samples per batch is 64, and on word level representation, we take a time step dropout strategy with a dropout rate of 0.1 to reduce the risk of model overfitting.
For the reinforcement learning based instance selection part, we use a learning rate of 1×10 -2 The optimizer Adam of (a) performs parameter updating. The maximum iteration round number is 10 rounds, and the early stop tolerance is still 5 rounds. We base on 5 pre-set random seeds [22,6622,6699,333555,999111 ]]Experiments were performed and the average performance of the model was reported.
(4) Baseline model: to evaluate the performance of our proposed RL-RMDA, we compared it to the following baseline model:
ABBC (Adapter-BERT-BiLSTM-CRF): this is a traditional NER approach, irrespective of crowdsourcing. It treats all crowdsourcing annotations as expert annotations.
HMM-Crowd [1]: this is an improved HMM model, considering the quality of the annotations, representing each annotator as a vector.
LC and LC-cat [1]: these methods map annotator IDs to vectors, but in a different way of integration: LC uses LSTM layer, while LC-cat uses CRF layer.
DL-CL [2]: this is an innovative expectation maximization approach that incorporates crowdsourcing, allowing training directly on noise crowdsourcing data.
BSC-seq [3]: this is a new bayesian approach to reduce errors by modeling the dependencies between sequence tags to aggregate the tags.
CLasDA [4]: the method integrates multi-source domain adaptation and crowdsourcing named entity identification, and uses PGN.adapter.BERT for text coding. All annotators are considered identical. In preprocessing, it deletes annotations where all words share the same tag.
Crowd-OEI [5]: the method expands CLasDA, integrates a mixup data enhancement strategy, and improves Chinese viewpoint expression recognition under crowdsourcing annotation.
Neural-Hidden-CRF [6]: this is a graph model tailored to weak supervision sequence labels. The BERT is used for rich contextual understanding and hidden CRF layers are used to identify internal tag dependencies.
From the provided figures, the invention shows the effect of the algorithm of the invention on the crowdsourcing named entity identification data set, and the invention is not only good for a group of data, but also good for a plurality of crowdsourcing named entity identification data sets in different fields.
Table 1 comparison of Performance with reference model
Table 1 is a table of experimental results, comparing on two data sets, the results of the best reference model being marked with a bold and the best results being highlighted, the superscript indicating that the test corpus used is different from the invention. The advantages of the present invention are apparent from the above table.
Compared with the existing crowdsourcing named entity recognition model, the method provided by the invention is superior to the existing method in recognition accuracy.
The method of the invention verifies the technical effect claimed by the invention through simulation experiments and practical application.
The model (algorithm or method) provided by the invention is a technical kernel of the bottom layer of the invention, and various products can be derived based on the algorithm.
The model provided by the invention is used for developing a crowdsourcing named entity recognition system based on multi-source domain adaptation and reinforcement learning by using a program language, the system is provided with a program module corresponding to the steps in the technical scheme model, and the steps in the crowdsourcing named entity recognition model based on multi-source domain adaptation and reinforcement learning are executed during running.
A computer program of a developed system (software) is stored on a computer readable storage medium, the computer program being configured to implement the steps of the multi-source domain adaptation and reinforcement learning based crowdsourcing named entity recognition model described above when invoked by a processor. I.e. the invention is embodied on a carrier as a computer program product.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present application may be performed in parallel, sequentially, or in a different order, so long as the desired result of the technical solution disclosed in the present application is achieved, which is within the scope of the present invention.
The references cited in the present invention are as follows:
[1]Nguyen AT,Wallace B C,Li J J,et al.Aggregating and predicting sequence labels fromcrowd annotations[C]//Proceedings of the conference.Association for Computational Linguistics.Meeting.NIH Public Access,2017,2017:299.
[2]Rodrigues F,Pereira F.Deep learning from crowds[C]//Proceedings of the AAAI conference on artificial intelligence.2018,32(1).
[3]Simpson E,Gurevych I.ABayesian Approach for Sequence Tagging with Crowds[C]//Proceedings of the 2019Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing(EMNLP-IJCNLP).2019:1093-1104.
[4]Zhang X,Xu G,Sun Y,et al.Crowdsourcing Learning as Domain Adaptation:ACase Study on Named Entity Recognition[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing(Volume 1:Long Papers).2021:5558-5570.
[5]Zhang X,Xu G,Sun Y,et al.Identifying Chinese Opinion Expressions with Extremely-Noisy Crowdsourcing Annotations[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics(Volume 1:Long Papers).2022:2801-2813.
[6]Chen Z,Sun H,Zhang W,et al.Neural-Hidden-CRF:ARobust Weakly-Supervised Sequence Labeler[C]//Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.2023:274-285.

Claims (8)

1. the utility model provides a crowdsourcing named entity recognition model based on multisource domain adaptation and reinforcement learning which characterized in that: the model comprises a crowdsourcing named entity identification main model, wherein the crowdsourcing named entity identification main model regards labels of a plurality of crowdsourcing labels as multi-source domains and experts as target domains, and the crowdsourcing named entity identification main model comprises the following components:
(1) The annotator representation layer: for generating annotators a i And an expert representation, and creating parameters through a Parameter Generation Network (PGN) using the annotator and the expert representation; the parameters obtained by the representation of the annotators and the parameters obtained by the representation of the experts created by the parameter generating network are respectively integrated into an Adapter module (Adapter Chinese meaning is an Adapter) of the text representation layer, so that the text representation layer has the perception capability of the annotators;
(2) Text presentation layer: the text presentation layer is an improved BERT model, called an Adapter-BERT model, the improvement being that an Adapter module is added to each transducer layer in the BERT model;
the improved BERT model is not involved in training due to parameter freezing in the training process, only parameters in the Adapter module are involved in training, the number of training parameters of the BERT model is reduced, the representation of a marker is involved in training of the Adapter and BERT model on the premise of keeping original knowledge, and new knowledge is learned to improve the recognition accuracy of crowdsourcing named entity;
the text representation layer is used for receiving a sentence x= { X i -converting it into a text representation with label information (tensor representation) and a text representation with expert information (tensor representation) using an Adapter-BERT model, respectively;
(3) Text represents distance layer: respectively taking the text representation with the marker information and the text representation with the expert information output by the text representation layer as a multi-source domain and a target domain;
the text representation distance layer is used for calculating the distance between the text representations of the multi-source domain and the target domain, and the distance is used as a part of training loss;
(4) Reconstruction layer:
reclassifying the text representations with the taggant information in the text representation distance layer to the taggant, preventing weakening of the taggant features of the text representations in optimizing the distance of the text representations from the distance layer;
reclassifying text representations with annotator information in a distance layer of the text representation to the desired annotatorsCalculation of a i And->Cross entropy loss between as part of training loss to prevent weakening of annotator features of the text representation in optimizing distance of the text representation from the layer;
(5) Bidirectional long and short term memory network (BiLSTM) and Conditional Random Field (CRF) layer: inputting text representation with annotator information in a text representation distance layer, extracting context features from the text representation by using BiLSTM, and generating sequence marks, namely prediction labels, from the output of the BiLSTM by using a state feature function and a transfer feature function of the CRF layer; a cross entropy penalty between crowd-sourced and predicted tags in the dataset is calculated and taken as part of the training penalty.
2. The multi-source domain adaptation and reinforcement learning based crowdsourcing named entity recognition model of claim 1, wherein: the specific process of model training and prediction is as follows:
in the training phase, training text x and ID a of its annotator are contained for a crowd-sourced training data i Firstly, inputting the annotator ID and all the annotator IDs into an annotator representation layer, wherein the annotator representation layer can respectively generate vector representations of the annotator IDs corresponding to the textAnd a representation matrix W of all annotator IDs annotator The representation matrix of all annotators ID generates a fitting expert representation by means of the attentional mechanism +.>Then, inputting vector representation of the marker ID and fitting expert representation into a parameter generation network to obtain parameters, and injecting the parameters into an Adapter module of each transducer layer of the Adapter BERT of the text representation layer, so that the text representation layer with two different parameters can be obtained;
after obtaining the two text presentation layers, training text is input into the two text presentation layers simultaneously, and a text presentation with annotator information and a text presentation with expert information are obtained: obtaining a text representation of the annotatorAnd fitting expert text representation +.>Text representation with taggant information thenAnd a text representation with expert information is input to the text representation distance layer, the L2 distance (two-norm distance) is calculated to obtain the distance loss +.>The text representation with the annotator information (annotator text representation) is also input to the reconstruction layer, and the convolution feature h is extracted by a convolution layer (CNN) with a convolution kernel size of 2 ann Then, the final feature O for classification is obtained through a linear layer ann The feature can be processed by softmax to obtain different labels, which are classified as desired label ID a i The probability of (2) can be expressed asThe final reconstruction loss can be calculated from the cross entropy loss and expressed asFinally, the text representation of the annotator is also input into a BiLSTM layer and a CRF layer, the BiLSTM layer is used for respectively calculating the result of the sequence to be predicted according to the forward direction and the reverse direction, the two results are spliced, the spliced result is input into a label sequence finally predicted by a conditional random field, the predicted result is reserved, the label sequence is compared with crowdsourcing annotation provided by the annotator, and the cross entropy is calculated to obtain the predicted loss->Wherein a is a set of all annotators, including a1, a 2..am, m is the number of annotators;
and finally, multiplying the three losses by weight coefficients respectively to carry out addition to obtain the total loss needing back propagation to carry out back propagation:
in the prediction stage, only IDs of all annotators are needed to generate corresponding parameters through an annotator representation layer and a parameter generation network, a fitting expert text representation layer is generated, a text to be predicted is directly input into the fitting expert text representation layer to obtain fitting expert text vector representation, and then BiLSTM and CRF layers are input to obtain a prediction label.
3. A multi-source domain adaptation and reinforcement learning based crowdsourcing named entity recognition model as claimed in claims 1 and 2, wherein: the model also includes an instance selector that removes low quality labels in training data used to train the crowd-sourced named entity recognition master model (used to remove lower quality label data).
4. A multi-source domain adaptation and reinforcement learning based crowd-sourced named entity recognition model of claim 3, wherein: the instance selector uses a Markov decision process model, training is carried out by integrating the previous state information into the current state, and the training of the instance selector is carried out by adopting a strategy gradient algorithm;
the specific flow is as follows:
firstly, a trained multi-source domain adaptive crowdsourcing named entity recognition model M is used NER For each sample d in the dataset j Predicting to obtain a predicted label Y p And with crowdsourcing tag Y a Calculating a calculated similarity score j And saving the similarity score into the set Φ;
then using the trained multi-source domain adaptive crowdsourcing named entity recognition model to identify each sample d in the dataset j Calculating the text representation R of its annotators j And the average state of the sample set removed in the previous round is represented by R * Splicing to obtain enhanced text representationThen the state representation of reinforcement learning can be obtained>S is then passed through a gradient strategy network j Mapping into action score a j =π(S j ;θ selector ) And store the collection->
Then select a collectionThe samples of the last p percent with the smallest action score form a screening removal set ψ i Then remove this portion of the data from the training set, and retrain M on the removed training set NER Verifying and evaluating on the verification set to obtain an f1 value; then if the training period is not the first training period, calculating the difference value between the f1 value of the round and the f1 value obtained in the previous round as the reinforcement learning reward r i =F 1 i -F 1 i-1 And the data screened in this round is obtained and is not the data omega screened in the previous round i =Ψ i -(Ψ i ∩Ψ i-1 ) Data screened in the upper run and not data omega screened in the present run i - 1 =Ψ i-1 -(Ψ i ∩Ψ i-1 ) Then according to the rewards r i Is awarded and punished omega by positive and negative i And omega i-1 And updates the parameter θ of the policy network selector
In the above formula, mu is the super-parameter learning rate, and the learning rate plays a role in controlling the gradient step length in the machine learning process;
a is the action score a j The method comprises the steps of carrying out a first treatment on the surface of the S refers to the state representation S above j Here, no subscript is addedj is because j represents one data sample, and the formula herein works for all data samples, specifically for each j;the meaning of θ in (a) is the same as that described above;
and finally, when the model reaches the maximum training round number or the endurance setting program stops training, screening the data by using a round of strategy gradient model with the maximum f1 value, obtaining a screened training set, and adapting the multi-source domain to the crowdsourcing model M NER Training is performed again on the training set to achieve the best training effect.
5. The multi-source domain adaptation and reinforcement learning based crowdsourcing named entity recognition model of claim 1, wherein: in the reconstruction layer, the reclassifying of the text representations with the annotator information in the text representation distance layer to the annotators is achieved by CNN.
6. The multi-source domain adaptation and reinforcement learning based crowdsourcing named entity recognition model of claim 2, wherein:alpha, beta and gamma in the formula (I) are super parameters for balancing three losses, and the value can be 1:1:1.
7. a crowdsourcing named entity recognition system based on multi-source domain adaptation and reinforcement learning is characterized in that: the system having program modules corresponding to the implementation steps of any of the preceding claims 1-6, the steps in the above multi-source domain adaptation and reinforcement learning based crowdsourcing named entity identification model being performed at run-time.
8. A computer-readable storage medium, characterized by: the computer readable storage medium stores a computer program configured to implement the steps of the multi-source domain adaptation and reinforcement learning based crowdsourcing named entity recognition model of any one of claims 1-6 when invoked by a processor.
CN202311442418.3A 2023-11-01 2023-11-01 Crowd-sourced named entity recognition model and system based on multi-source domain adaptation and reinforcement learning Pending CN117436449A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311442418.3A CN117436449A (en) 2023-11-01 2023-11-01 Crowd-sourced named entity recognition model and system based on multi-source domain adaptation and reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311442418.3A CN117436449A (en) 2023-11-01 2023-11-01 Crowd-sourced named entity recognition model and system based on multi-source domain adaptation and reinforcement learning

Publications (1)

Publication Number Publication Date
CN117436449A true CN117436449A (en) 2024-01-23

Family

ID=89556479

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311442418.3A Pending CN117436449A (en) 2023-11-01 2023-11-01 Crowd-sourced named entity recognition model and system based on multi-source domain adaptation and reinforcement learning

Country Status (1)

Country Link
CN (1) CN117436449A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108228564A (en) * 2018-01-04 2018-06-29 苏州大学 The name entity recognition method of confrontation study is carried out in crowdsourcing data
AU2020103654A4 (en) * 2019-10-28 2021-01-14 Nanjing Normal University Method for intelligent construction of place name annotated corpus based on interactive and iterative learning
CN113361278A (en) * 2021-06-21 2021-09-07 中国人民解放军国防科技大学 Small sample named entity identification method based on data enhancement and active learning
CN113378548A (en) * 2021-06-29 2021-09-10 哈尔滨工业大学 Named entity recognition active learning method based on conditional random field
WO2022036616A1 (en) * 2020-08-20 2022-02-24 中山大学 Method and apparatus for generating inferential question on basis of low labeled resource
CN116776881A (en) * 2023-05-25 2023-09-19 上海威克鲍尔通信科技有限公司 Active learning-based domain entity identification system and identification method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108228564A (en) * 2018-01-04 2018-06-29 苏州大学 The name entity recognition method of confrontation study is carried out in crowdsourcing data
AU2020103654A4 (en) * 2019-10-28 2021-01-14 Nanjing Normal University Method for intelligent construction of place name annotated corpus based on interactive and iterative learning
WO2022036616A1 (en) * 2020-08-20 2022-02-24 中山大学 Method and apparatus for generating inferential question on basis of low labeled resource
CN113361278A (en) * 2021-06-21 2021-09-07 中国人民解放军国防科技大学 Small sample named entity identification method based on data enhancement and active learning
CN113378548A (en) * 2021-06-29 2021-09-10 哈尔滨工业大学 Named entity recognition active learning method based on conditional random field
CN116776881A (en) * 2023-05-25 2023-09-19 上海威克鲍尔通信科技有限公司 Active learning-based domain entity identification system and identification method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
XIN ZHANG 等: "Crowdsourcing Learning as Domain Adaptation: A Case Study on Named Entity Recognition", PROCEEDINGS OF THE 59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, 6 August 2021 (2021-08-06) *
程钟慧;陈珂;陈刚;徐世泽;傅丁莉;: "基于强化学习协同训练的命名实体识别方法", 软件工程, no. 01, 5 January 2020 (2020-01-05) *

Similar Documents

Publication Publication Date Title
US10963794B2 (en) Concept analysis operations utilizing accelerators
CN111177374B (en) Question-answer corpus emotion classification method and system based on active learning
CN105210064B (en) Classifying resources using deep networks
CN112819023B (en) Sample set acquisition method, device, computer equipment and storage medium
Zhang et al. Big data versus the crowd: Looking for relationships in all the right places
CN111222305A (en) Information structuring method and device
CN113392209B (en) Text clustering method based on artificial intelligence, related equipment and storage medium
CN112231447A (en) Method and system for extracting Chinese document events
CN117009490A (en) Training method and device for generating large language model based on knowledge base feedback
CN116097250A (en) Layout aware multimodal pre-training for multimodal document understanding
CN112883193A (en) Training method, device and equipment of text classification model and readable medium
CN115952292B (en) Multi-label classification method, apparatus and computer readable medium
CN114218379A (en) Intelligent question-answering system-oriented method for attributing questions which cannot be answered
CN110310012B (en) Data analysis method, device, equipment and computer readable storage medium
CN113934835B (en) Retrieval type reply dialogue method and system combining keywords and semantic understanding representation
CN110222737A (en) A kind of search engine user satisfaction assessment method based on long memory network in short-term
CN115600595A (en) Entity relationship extraction method, system, equipment and readable storage medium
CN117436449A (en) Crowd-sourced named entity recognition model and system based on multi-source domain adaptation and reinforcement learning
CN114997175A (en) Emotion analysis method based on field confrontation training
CN114943216A (en) Case microblog attribute-level viewpoint mining method based on graph attention network
CN114898426A (en) Synonym label aggregation method, device, equipment and storage medium
CN112269877A (en) Data labeling method and device
CN117591666B (en) Abstract extraction method for bridge management and maintenance document
CN116304058B (en) Method and device for identifying negative information of enterprise, electronic equipment and storage medium
CN113392651B (en) Method, device, equipment and medium for training word weight model and extracting core words

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination