CN114239584A - Named entity identification method based on self-supervision learning - Google Patents

Named entity identification method based on self-supervision learning Download PDF

Info

Publication number
CN114239584A
CN114239584A CN202111539122.4A CN202111539122A CN114239584A CN 114239584 A CN114239584 A CN 114239584A CN 202111539122 A CN202111539122 A CN 202111539122A CN 114239584 A CN114239584 A CN 114239584A
Authority
CN
China
Prior art keywords
entity
output
embedding
vector
main network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111539122.4A
Other languages
Chinese (zh)
Inventor
周仁杰
胡强
万健
张纪林
殷昱煜
蒋从锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN202111539122.4A priority Critical patent/CN114239584A/en
Publication of CN114239584A publication Critical patent/CN114239584A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Abstract

The invention discloses a named entity identification method based on self-supervision learning, which comprises the following steps: preprocessing a data set, constructing a positive example sentence pair and a negative example sentence pair by using the processed data set, and respectively encoding sentences in the positive example sentence pair and the negative example sentence pair by using an embedding encoder; different definitions of entities in different contexts are learned according to different entity features and similarity matrixes in a named entity recognition model based on self-supervision learning, feature vectors of the named entity recognition model are fully learned according to the similarity of positive example sentence pairs and negative example sentence pairs, and the language difference of different corpora is met. The invention improves the accuracy of named entity recognition, and solves the problem of entity type recognition error caused by word abbreviation in the output result through the knowledge map, thereby more accurately predicting the entity and the entity type, and leading the word embedding vector with the ambiguous word to better represent the paraphrase of the word in the current context.

Description

Named entity identification method based on self-supervision learning
Technical Field
The invention relates to a named entity identification method, in particular to a named entity identification method based on self-supervision learning.
Background
The big data era comes, and research on named entity recognition gradually becomes one of the advanced fields of interdiscipline of cognitive science, information science and intelligent science and international emerging. In recent years, the western developed countries attach increasing importance to named entity identification, and open source information extraction becomes one of the important bases for national defense policy, strategic decision and command operation of each country. Named entity recognition is also rapidly becoming one of the international leading hot spots in the field of informatics in academia.
Existing named entity recognition methods mostly extract entities and entity types based on text. The main task of named entity recognition is to recognize and classify proper nouns such as names of people and places and numerical phrases such as meaningful time and date in the text. There are three main methods for named entity identification: rule-based methods, statistical-based methods, and supervised learning-based methods.
The rule-based method mainly extracts entities in the text through text rules by pre-constructing some special rules. The rule-based method has higher accuracy in some specific fields, but also causes great limitation, such as poor cross-domain portability, because the rule-based method has higher accuracy only in some specific fields; the statistics-based method mainly performs statistics on text information and excavates word features from a text corpus. The method based on statistics has higher requirements on the corpus, but the general corpus suitable for the evaluation of the large named entity recognition task is less at present, so the development of the method is limited to a certain extent; the method based on supervised learning mainly obtains a classifier from training data through training, applies the classifier to new entity recognition, solves the limitation of a rule-based method in a specific field to a certain extent and solves the problem of high requirement on a general corpus to a certain extent, but does not well learn the expression of an ambiguous word in the current context in a word embedding stage.
The invention further learns the ambiguous words by utilizing the self-supervision learning, provides a named entity recognition method based on the self-supervision learning and constructs a complete named entity recognition model.
Disclosure of Invention
The invention aims to solve the problem that the existing named entity recognition technology does not well learn the paraphrase of an ambiguous word in the current context in the word embedding stage, and provides a named entity recognition method based on self-supervised learning.
The technical scheme adopted by the invention is as follows:
step 1: preprocessing the data set;
1-1, forming words and conjunctions with entity types marked in a data set into sentences;
1-2 the sentence s of step 1-1iTranslating into sentence a in arbitrary languagesiThen sentence a is repeatediBy a anditranslation into regular sentence in same language
Figure BDA0003413389280000021
Step 2: constructing positive and negative example sentence pair sets of the sentences processed in the step 1, wherein the positive example sentence pair sets are
Figure BDA0003413389280000022
Figure BDA0003413389280000023
Set composition, negative example sentence pair set
Figure BDA0003413389280000024
The negative example sentence pair consists of an original sentence and sentences translated from other sentences in the corpus;
and step 3: respectively carrying out embedding coding on the sentences in the positive example sentence pairs and the negative example sentence pairs by using an embedding coder;
and 4, step 4: embedding the embeded words into vectors and inputting the embedded vectors into a deep neural network layer DNN;
and 5: performing similarity calculation on the output vectors of the positive example sentence pairs and the negative example sentence pairs obtained in the step 4, and splicing the calculation results into a brand-new similarity matrix M according to rowssim(ii) a And optimizing the embedding encoder f in the step 3 by using a contrast loss function l through back propagation and a gradient descent algorithmkThe parameter (1) of (1);
step 6: obtaining sentences formed by words of the marked entity types, constructing a data set, and further dividing the data set into a training set and a testing set;
and 7: establishing a named entity recognition model based on self-supervision learning, wherein the named entity recognition model comprises a main network and a correction module which are sequentially cascaded; then, training the main network by using a training set, testing the trained main network by using a testing set, and finally correcting the output result of the tested main network by using a correction module;
the main network comprises an optimized embedding encoder f in the step 5kA bidirectional LSTM layer and a CRF layer;
the correction module comprises a phrase retrieval module and an entity type modification module; the phrase retrieval module is used for acquiring a potential entity set of the main network input item and screening out the potential entity set existing in the public knowledge graph
Figure BDA0003413389280000025
And then combining the potential entity withAnd constructing the entity type into a potential entity set PE; the potential entity set comprises words, phrases formed by a plurality of words, and entity types corresponding to the words and the phrases; the entity type modification module is used for receiving the potential entity set PE output by the phrase retrieval module and the entity type label output by the main network, then comparing the entity type output by the main network with the entity type corresponding to each potential entity in the main network input item in the potential entity set PE, if the entity type output by the main network is consistent with the entity type corresponding to each potential entity in the main network input item in the potential entity set PE, the entity type modification module does not need to modify, and if the entity type output by the main network is inconsistent with the entity type output by the main network input item in the potential entity set PE, the entity type modification module modifies the output result of the main network;
and 8: and the named entity recognition of the text is realized by utilizing a tested named entity recognition model based on self-supervision learning.
It is a further object of the present invention to provide a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the above-mentioned method.
It is a further object of the present invention to provide a computing device comprising a memory having stored therein executable code and a processor that, when executing the executable code, implements the method described above.
The technical scheme provided by the invention has the following beneficial effects:
according to the invention, positive example sentence pairs and negative example sentence pairs are constructed by translating sentences in multiple intermediate languages, the negative example sentence pairs are stored in a queue in practical use, the data of the current mini-batch enters the queue, the oldest mini-batch data in the past is shifted out of the queue, and the queue is used for decoupling the size of the queue and the batch size, namely the size of the queue is not limited by the constraint of the batch size, so that the problem that a large amount of negative example mini-batch data is needed in self-supervision learning is well solved;
the similarity of word embedding vectors in sentences in a vector representation space is measured by using the similarity function, and the parameters of the embedding encoder are slowly updated in a momentum moving average mode, so that the loss of feature consistency caused by the drastic change of the parameters of the embedding encoder can be avoided, the embedding encoder can be kept in an updated state all the time, and the embedding encoder can better accord with the definition of ambiguous words in the current context when the ambiguous words are encoded in a word embedding encoding stage by using the similarity function and the momentum moving average mode;
the invention solves the problem of entity type recognition error caused by word abbreviation in the output result by disclosing the knowledge map, and further improves the accuracy rate of entity recognition.
Drawings
FIG. 1 is a flow chart of the method of the present invention;
FIG. 2 is a flow chart of an embedding encoder optimization for self-supervised learning;
FIG. 3 is a diagram of a named entity recognition model architecture based on self-supervised learning according to the present invention;
FIG. 4 is a diagram of a correction module in the named entity recognition model based on self-supervised learning according to the present invention;
Detailed Description
Embodiments of the present invention will be described in further detail below with reference to the accompanying drawings. The specific flow description is shown in fig. 1, wherein:
step 1: preprocessing the data set;
1-1, forming words and conjunctions with entity types marked in a data set into sentences;
the entity is proper nouns such as a name of a person, a place name, a mechanism name and the like in the text;
1-2 the sentence s of step 1-1iTranslating into sentence a in arbitrary languagesiThen sentence a is repeatediBy a anditranslation into regular sentence in same language
Figure BDA0003413389280000031
Step 2: constructing positive and negative example sentence pair sets of the sentences processed in the step 1, wherein the positive example sentence pair sets are
Figure BDA0003413389280000041
Figure BDA0003413389280000042
Set composition, negative example sentence pair set
Figure BDA0003413389280000043
The negative example sentence pair consists of an original sentence and sentences translated from other sentences in the corpus;
and step 3: the method comprises the following steps of using an embedding encoder to respectively conduct embedding encoding on sentences in a positive example sentence pair and a negative example sentence pair, wherein the embedding encoding specifically comprises the following steps:
sentence siInput to the embedding encoder fq(query-encoder) carries out word embedding encoding and obtains an encoded result qi(ii) a At the same time, the sentence siCorresponding positive and negative example sentences
Figure BDA0003413389280000044
Input to the embedding encoder fk(key-encoder) carries out word embedding encoding and obtains the encoded result
Figure BDA0003413389280000045
The embedding encoder fq、fkInitialization parameter theta ofq、θkThe same;
and 4, step 4: inputting the imbedding coded word embedding vector to a deep neural network layer (DNN);
the deep neural network layer includes a first fully-connected layer, a Relu layer, and a second fully-connected layer.
(1) First fully-connected layer: converting an embedding vector output by an unaptimized embedding encoder into an output vector with the same dimensionality through one layer of linear change;
odense1=Wxinput+b
wherein o isdense1Representing the output vector, xinputRepresenting an embedding vector output by an unaptimized embedding encoder, wherein W represents a weight matrix, and b represents a bias vector;
(2) relu layer: the convergence speed of the model can be maintained in a stable state by inputting the output vector of the first full-connection layer into a Relu activation function;
odense2=max(odense1,0)
wherein o isdense2An output vector representing the Relu layer;
(3) second full connection layer: converting the output vector of the Relu layer into an output vector with the same dimensionality as the number of the types of the predicted entities;
and 5: performing similarity calculation on the output vectors of the positive example sentence pairs and the negative example sentence pairs obtained in the step 4, and splicing the calculation results into a brand-new similarity matrix M according to rowssim(ii) a And optimizing the embedding encoder f in the step 3 by using a contrast loss function l through back propagation and a gradient descent algorithmkThe parameter (1). The specific operation is as follows:
5-1 vector of outputs of DNN
Figure BDA0003413389280000046
Similarity calculation is carried out through a similarity function sim (·), and the normal case similarity of similar sentences is obtained
Figure BDA0003413389280000047
Negative example similarity of dissimilar sentences
Figure BDA0003413389280000048
Figure BDA0003413389280000049
Then r is+And r-Aggregating according to rows to obtain a similarity matrix Msim
Figure BDA0003413389280000051
5-2 the similarity of the positive and negative example sentence pairs in the vector representation space is measured using the following comparison loss function l:
Figure BDA0003413389280000052
wherein tau is a hyper-parameter, the function of tau is to adjust the similarity to an input magnitude of a conformity function, exp (-) represents an exponential function with a natural constant e as a base, and sum (-) represents that matrix elements are added in rows;
5-3 optimization of the embedding encoder f by the contrast loss function l through back propagation, gradient descent algorithmkThe parameter (1) of (1);
wherein f iskUpdating theta by using momentum moving average modekAnd deposit the pass f using a queuekThe encoded mini-batch data (key) is stored in the queue as the current sentence siNegative example sentence pair
Figure BDA0003413389280000053
The data of the current mini-batch will enter the queue, the data of the earliest past mini-batch will be removed from the queue, and the momentum moving average mode is as follows:
θk←mθk+(1-m)θq
wherein m is momentum;
FIG. 2 is a flow chart of an embedding encoder optimization for self-supervised learning;
step 6: obtaining sentences formed by words of the marked entity types, constructing a data set, and further dividing the data set into a training set and a testing set;
and 7: building a named entity recognition model based on self-supervision learning, wherein the named entity recognition model comprises a main network and a correction module which are sequentially cascaded as shown in FIG. 3; then, training the main network by using a training set, testing the trained main network by using a testing set, and finally correcting the output result of the tested main network by using a correction module;
a) the main network comprises an optimized embedding encoder f in the step 5kA bidirectional LSTM layer and a CRF layer;
1) optimized embedding encoder fkUsed for encoding each word in the sentence into a word embedding vector; inputting complete sentences and outputting word embedding vectors of each word in the sentences;
2) the bidirectional LSTM layer is used for learning dependency information among words; the input is a word embedding vector, and the output is a word embedding vector containing dependency information among words;
LSTM has only one hidden state h compared with RNNtOne more cell state c for LSTMtThe hidden state and the cell state can store all valid information at and before time t. The LSTM implements protection and control of information through three gate control units, which are an input gate, a forgetting gate, and an output gate, respectively. The first step of LSTM is to discard some information reserved in the long sequence training process, and the step is completed by a forgetting gate which reads ht-1And xtAnd outputting a value between 0 and 1 by the sigmod activation function, wherein 0 represents complete abandoning, 1 represents complete reserving, and the forgetting gate is calculated as follows:
ft=σ(Wfht-1+Ufxt+bf)
wherein xtIs an embedding encoder fkOutput word-embedded vector, ht-1Hidden state of LSTM at time t-1, WfAnd UfAre respectively a forgetting door middle ht-1And xtWeight matrix of bfTo forget the offset vector of the gate, σ (-) denotes the sigmod activation function, ftIs the output of the forgetting gate;
the second step is to update the cell state ct. In updating ctPreviously it was necessary to determine which information needs to be updated by means of the input gate and to determine the candidate update content (candidate value vector z) by means of a tanh layer. The calculation mode through the input gate and the tanh layer is similar to that of the forgetting gate, and the calculation mode is as follows:
it=σ(Wiht-1+Uixt+bi)
z=tanh(Wzht-1+Uzxt+bz)
wherein WiAnd UiAre respectively an input gate ht-1And xtWeight matrix of biIs the offset vector of the input gate,itis the output of the input gate; wzAnd UzRespectively h in the candidate value vectort-1And xtWeight matrix of bzAn offset vector that is a vector of candidate values;
cell state c is then updated by matrix dot productt
ct=ft⊙ct-1+it⊙z
Wherein |, indicates a dot product operation of the matrix;
the last step through LSTM is to update the hidden state ht. Update htThe cell state c needs to be determinedtThe state h is updated by processing the tanh layer to obtain a value between-1 and 1, and multiplying the value by the output point of the output gatetThe output gate is similar to the forgetting gate and the input gate in calculation mode.
ot=σ(Woht-1+Uoxt+bo)
ht=ot⊙tanh(ct)
Wherein WoAnd UoRespectively in the output gatet-1And xtWeight matrix of boIs an offset vector of the output gate, otIs the output of the output gate;
for many sequence tagging tasks, it makes sense to access both past and future information, whereas the hidden state of the one-way LSTM can only obtain information from the past. In order to obtain both past and future information, a bi-directional LSTM is used. The output of the bi-directional LSTM is the score of each token belonging to each class of tags, which needs to be normalized by softmax after calculation:
Figure BDA0003413389280000061
wherein gamma isiNormalized result of the label score representing the ith token, xiA label score vector representing the ith token, n being the size of the label category;
3) a CRF layer for further correcting the recognition result; inputting an output vector of a bidirectional LSTM layer, and outputting an entity label of each word;
CRF is a typical discriminant model that functions to further correct recognition results. For the named entity recognition task, some meaningless characters may exist in the output result, and the model does not consider the dependency relationship between the labels. The CRF can reasonably combine the context information to extract the dependency relationship between the labels, so that the identified entity meets the labeling rule;
in CRF, there are two very important scores, namely an emulsion score and a Transition score. Wherein the emision score is derived from the output of the bi-directional LSTM model, specifically predicting for each token a score for each class of labels; while Transition score is the probability of Transition from a certain class of tags to another, the Transition matrix is the Transition probability that can be trained to change the internal tags. With the occurrence score and the Transition score, the Path score Path of the current output sequence can be calculated, as shown in the formula:
Figure BDA0003413389280000071
Ti,j=emi+transi,j
em thereiniAnd transi,jAn emulation score of the ith token and a Transition score, T, of the label transferred from the ith token to the jth token in a sentence, respectivelyi,jThe sum of the number of the emulation score and the Transition score in a sentence. CRF training is performed by:
Figure BDA0003413389280000072
wherein, PathrealPath score for the correct Path in the training process, PathiFor the path score of the ith possible path, loss represents the loss function of the CRF layer;
b) the correction module comprises a phrase retrieval module and an entity type modification module, as shown in FIG. 4;
1) a phrase retrieval module for obtaining the potential entity set of the main network input item and screening out the public knowledge map
Figure BDA0003413389280000074
The potential entity of (a); the potential entity set comprises words and phrases formed by a plurality of words;
the input is a sentence, and the output is a potential entity set PE in the sentence; the specific steps of the retrieval are as follows:
i. finding out The permutation and combination of all word group synthesis phrases in The sentence, for example, The sentence "The European Commission" can obtain The set Pe ═ The, { The, European, Commission, The European, European Commission, The European Commission };
inputting each potential entity in the set Pe obtained in the step i into the public knowledge graph
Figure BDA0003413389280000073
If the potential entity and the entity type corresponding to the potential entity can be retrieved from the public knowledge graph, adding the potential entity and the entity type into a potential entity set PE;
the set of potential entities PE, for example { The European Commission: Organization, … };
2) the entity type modification module is used for receiving the potential entity set PE output by the phrase retrieval module and the entity type labels output by the main network, then comparing the entity type labels output by the main network with the entity types corresponding to all potential entities in the main network input items in the potential entity set PE, if the entity types are consistent, the modification is not needed, and if the entity types are not consistent, the output result of the main network is modified;
and 8: and the named entity recognition of the text is realized by utilizing a tested named entity recognition Model (MBBCD) based on self-supervised learning.
The performance evaluation of the invention adopts a Conll2003 English public data set, and the following table shows the data volume condition of the data set:
number of articles Number of sentences Number of words
Training set 946 14987 203621
Development set 216 3466 51362
Test set 231 3684 46435
The data set comprises four entity types, namely place name, person name, organization name and other entities, and the entity labeling method adopts a BIO labeling method: the BIO notation specifies that all named entities begin with the B label, I denotes inside the named entity, O denotes outside the named entity, e.g., if a word in the corpus is labeled B/I-XXX, B/I denotes that the word belongs to the beginning or inside of the named entity, i.e., the word is part of the named entity, and XXX denotes the type of the named entity. The following table shows the specific distribution of the number of entities in the training set, development set, and test set in the data set:
place name Name of a person Organization name Other entities
Training set 7140 6600 6321 3438
Development set 1837 1842 1341 922
Test set 1668 1617 1661 702
In step 7, the DBpedia English knowledge graph is adopted to correct the entity type recognition caused by the word abbreviation in the output result, and the following table is the entity recognition result of the invention on the test set:
Figure BDA0003413389280000081
Figure BDA0003413389280000091
in the entity recognition result table, CNN is used for character-level coding, Glove is used for providing pre-trained word vectors, and a named entity recognition Model (MBBCD) based on self-supervised learning is the named entity recognition method based on self-supervised learning provided by the invention. Precision, Recall and Micro-F1 are adopted as performance evaluation indexes of entity identification in the experiment. The marking method in the named entity simultaneously determines the entity boundary and the entity type, and only when the entity boundary and the entity type are simultaneously and accurately marked, the identification result of the current entity is correct. Based on the data True Posives (TP), False Posives (FP) and False Negatives (FN), the Precision rate (Precision), Recall rate (Recall) and F1 value (F1-score) of the named entity recognition task can be calculated. TP is defined to correctly identify entity boundaries and entity types, FP is defined to correctly identify entities but entity boundaries or entity types are misjudged, and FN is defined to be entities that should be identified but are not actually identified.
According to the definition of Precision: for a given data set, the accuracy rate is the ratio of the number of samples with correct classification to the total number of samples, and the calculation mode of the accuracy rate in the named entity recognition task can be obtained:
Figure BDA0003413389280000092
according to the definition of Recall rate Recall: the recall ratio is used for explaining the calculation mode of the recall ratio in the named entity recognition task which is available in the ratio of the positive cases to the total proportion and judged to be true in the classifier:
Figure BDA0003413389280000093
according to the definition of the Micro-average F1 value Micro-F1: the Micro-F1 value is a harmonic average index of the precision rate and the recall rate and is a comprehensive index for balancing the influence of the precision rate and the recall rate. Therefore, the calculation mode of the Micro-F1 value can be obtained:
Figure BDA0003413389280000094

Claims (9)

1. a named entity recognition method based on self-supervision learning is characterized by comprising the following steps:
step 1: preprocessing the data set;
1-1, forming words and conjunctions with entity types marked in a data set into sentences;
1-2 the sentence s of step 1-1iTranslating into sentence a in arbitrary languagesiThen sentence a is repeatediBy a anditranslation into regular sentence in same language
Figure FDA0003413389270000011
Step 2: constructing positive and negative example sentence pair sets of the sentences processed in the step 1, wherein the positive example sentence pair sets are
Figure FDA0003413389270000012
Set composition, negative example sentence pair set
Figure FDA0003413389270000013
The negative example sentence pair consists of an original sentence and sentences translated from other sentences in the corpus;
and step 3: respectively carrying out embedding coding on the sentences in the positive example sentence pairs and the negative example sentence pairs by using an embedding coder;
and 4, step 4: embedding the embeded words into vectors and inputting the embedded vectors into a deep neural network layer DNN;
and 5: performing similarity calculation on the output vectors of the positive example sentence pairs and the negative example sentence pairs obtained in the step 4, and splicing the calculation results into a brand-new similarity matrix M according to rowssim(ii) a And optimizing the embedding encoder f in the step 3 by using a contrast loss function l through back propagation and a gradient descent algorithmkThe parameter (1) of (1);
step 6: obtaining sentences formed by words of the marked entity types, constructing a data set, and further dividing the data set into a training set and a testing set;
and 7: establishing a named entity recognition model based on self-supervision learning, wherein the named entity recognition model comprises a main network and a correction module which are sequentially cascaded; then, training the main network by using a training set, testing the trained main network by using a testing set, and finally correcting the output result of the tested main network by using a correction module;
the main network comprises an optimized embedding encoder f in the step 5kA bidirectional LSTM layer and a CRF layer;
the correction module comprises a phrase retrieval module and an entity type modification module; the phrase retrieval module is used for acquiring a potential entity set of the main network input item and screening out the potential entity set existing in the public knowledge graph
Figure FDA0003413389270000014
Then constructing the potential entity and the entity type into a potential entity set PE; the potential entities comprise words and phrases formed by a plurality of words; the entity type modification module is used for receiving the potential entity set PE output by the phrase retrieval module and the entity type label output by the main network, then comparing the entity type output by the main network with the entity type corresponding to each potential entity in the main network input item in the potential entity set PE, if the entity type output by the main network is consistent with the entity type corresponding to each potential entity in the main network input item in the potential entity set PE, the entity type modification module does not need to modify, and if the entity type output by the main network is inconsistent with the entity type output by the main network input item in the potential entity set PE, the entity type modification module modifies the output result of the main network;
and 8: and the named entity recognition of the text is realized by utilizing a tested named entity recognition model based on self-supervision learning.
2. The method as claimed in claim 1, wherein the embedding encoder f is a named entity recognition method based on self-supervised learningq、fkInitialization parameter theta ofq、θkThe same is true.
3. The named entity recognition method based on self-supervised learning as recited in claim 1, wherein the step 3 specifically comprises:
sentence siInput to the embedding encoder fqPerforming word embedding coding and obtaining a coded result qi(ii) a At the same time, the sentence siCorresponding positive and negative example sentences
Figure FDA0003413389270000021
Figure FDA0003413389270000022
Input to the embedding encoder fkPerforming word embedding coding and obtaining a coded result
Figure FDA0003413389270000023
4. The named entity recognition method based on self-supervised learning as recited in claim 1, wherein the deep neural network layer comprises a first fully-connected layer, a Relu layer and a second fully-connected layer;
(1) first fully-connected layer: will embed encoder fq、fkConverting the output embedding vector into an output vector with the same dimensionality through one layer of linear change;
odense1=Wxinput+b
wherein o isdense1Representing the output vector, xinputRepresenting an embedding vector output by an unaptimized embedding encoder, wherein W represents a weight matrix, and b represents a bias vector;
(2) relu layer: inputting the output vector of the first full-connection layer into a Relu activation function to maintain the convergence speed of the model in a stable state;
odense2=max(odense1,0)
wherein o isdense2An output vector representing the Relu layer;
(3) second full connection layer: and converting the output vector of the Relu layer into an output vector with the same dimension as the number of types of the predicted entity.
5. The named entity recognition method based on self-supervised learning as recited in claim 1, wherein the specific operations in step 5 are as follows:
5-1 vector of outputs of DNN
Figure FDA0003413389270000024
Similarity calculation is carried out through a similarity function sim (·), and the normal case similarity of similar sentences is obtained
Figure FDA0003413389270000025
Negative example similarity of dissimilar sentences
Figure FDA0003413389270000026
Figure FDA0003413389270000027
Then r is+And r-Aggregating according to rows to obtain a similarity matrix Msim
Figure FDA0003413389270000028
5-2 the similarity of the positive and negative example sentence pairs in the vector representation space is measured using the following comparison loss function l:
Figure FDA0003413389270000031
wherein tau is a hyper-parameter, exp (-) represents an exponential function with a natural constant e as a base, and sum (-) represents the addition of matrix elements in rows;
5-3 optimization of the embedding encoder f by the contrast loss function l through back propagation, gradient descent algorithmkThe parameter (1).
6. The method as claimed in claim 1, wherein the embedding encoder f is a named entity recognition method based on self-supervised learningkUpdating theta by using momentum moving average modekThe momentum moving average mode is as follows:
θk←mθk+(1-m)θq
where m is the momentum.
7. The method as claimed in claim 1, wherein the optimized embedding encoder f is a named entity recognition method based on self-supervised learningkUsed for encoding each word in the sentence into a word embedding vector; inputting complete sentences and outputting word embedding vectors of each word in the sentences;
the bidirectional LSTM layer is used for learning dependency information among words; the input is a word embedding vector, and the output is a word embedding vector containing dependency information among words;
the CRF layer is used for further correcting the recognition result; the input is the output vector of the bi-directional LSTM layer and the output is the entity type label for each word.
8. A computer-readable storage medium, on which a computer program is stored which, when executed in a computer, causes the computer to carry out the method of any one of claims 1-7.
9. A computing device comprising a memory having executable code stored therein and a processor that, when executing the executable code, implements the method of any of claims 1-7.
CN202111539122.4A 2021-12-15 2021-12-15 Named entity identification method based on self-supervision learning Pending CN114239584A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111539122.4A CN114239584A (en) 2021-12-15 2021-12-15 Named entity identification method based on self-supervision learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111539122.4A CN114239584A (en) 2021-12-15 2021-12-15 Named entity identification method based on self-supervision learning

Publications (1)

Publication Number Publication Date
CN114239584A true CN114239584A (en) 2022-03-25

Family

ID=80756701

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111539122.4A Pending CN114239584A (en) 2021-12-15 2021-12-15 Named entity identification method based on self-supervision learning

Country Status (1)

Country Link
CN (1) CN114239584A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115688777A (en) * 2022-09-28 2023-02-03 北京邮电大学 Named entity recognition system for nested and discontinuous entities of Chinese financial text

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115688777A (en) * 2022-09-28 2023-02-03 北京邮电大学 Named entity recognition system for nested and discontinuous entities of Chinese financial text
CN115688777B (en) * 2022-09-28 2023-05-05 北京邮电大学 Named entity recognition system for nested and discontinuous entities of Chinese financial text

Similar Documents

Publication Publication Date Title
CN108628823B (en) Named entity recognition method combining attention mechanism and multi-task collaborative training
CN108733792B (en) Entity relation extraction method
CN109902145B (en) Attention mechanism-based entity relationship joint extraction method and system
CN109657239B (en) Chinese named entity recognition method based on attention mechanism and language model learning
CN109726389B (en) Chinese missing pronoun completion method based on common sense and reasoning
CN112733541A (en) Named entity identification method of BERT-BiGRU-IDCNN-CRF based on attention mechanism
CN111738003B (en) Named entity recognition model training method, named entity recognition method and medium
CN106599032B (en) Text event extraction method combining sparse coding and structure sensing machine
CN109800437A (en) A kind of name entity recognition method based on Fusion Features
CN111966812B (en) Automatic question answering method based on dynamic word vector and storage medium
CN112052684A (en) Named entity identification method, device, equipment and storage medium for power metering
CN111222318A (en) Trigger word recognition method based on two-channel bidirectional LSTM-CRF network
CN111104509A (en) Entity relation classification method based on probability distribution self-adaption
CN113948217A (en) Medical nested named entity recognition method based on local feature integration
Deng et al. Self-attention-based BiGRU and capsule network for named entity recognition
CN111753088A (en) Method for processing natural language information
CN113821635A (en) Text abstract generation method and system for financial field
Ye et al. Chinese named entity recognition based on character-word vector fusion
CN113160917B (en) Electronic medical record entity relation extraction method
CN108875024B (en) Text classification method and system, readable storage medium and electronic equipment
CN112699685B (en) Named entity recognition method based on label-guided word fusion
CN114239584A (en) Named entity identification method based on self-supervision learning
CN115906846A (en) Document-level named entity identification method based on double-graph hierarchical feature fusion
CN115600595A (en) Entity relationship extraction method, system, equipment and readable storage medium
CN115600597A (en) Named entity identification method, device and system based on attention mechanism and intra-word semantic fusion and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination