CN113190656B - Chinese named entity extraction method based on multi-annotation frame and fusion features - Google Patents
Chinese named entity extraction method based on multi-annotation frame and fusion features Download PDFInfo
- Publication number
- CN113190656B CN113190656B CN202110511025.8A CN202110511025A CN113190656B CN 113190656 B CN113190656 B CN 113190656B CN 202110511025 A CN202110511025 A CN 202110511025A CN 113190656 B CN113190656 B CN 113190656B
- Authority
- CN
- China
- Prior art keywords
- chinese
- entity
- labeling
- sequence
- word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3346—Query execution using probabilistic model
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Abstract
The invention discloses a Chinese named entity extraction method based on a multi-label framework and fusion characteristics. Then, word information and word segmentation mark information are introduced for each Chinese character through dictionary matching, and dictionary characteristics are constructed. On the basis, phonetic characters are annotated by using Chinese phonetic software according to the meaning of the Chinese characters in the matched words, so that phonetic features are constructed. And then, based on a dot-multiplying attention mechanism, combining dictionary features and pinyin features into Chinese character codes to obtain Chinese character semantic codes combining the dictionary features and the pinyin features, and improving the recognition capability of boundaries of Chinese named entities. Finally, combining the advantages of sequence labeling and index labeling, and utilizing a multi-task learning model to jointly learn two labeling tasks, so as to improve the accuracy of extracting the Chinese named entities.
Description
Technical Field
The invention belongs to the field of artificial intelligence and natural language processing, and particularly relates to a Chinese named entity extraction method based on a multi-annotation framework and fusion characteristics.
Background
With the rapid development of internet technology, the data information of each industry is in explosive growth, so that the development of industrial big data intelligent analysis mining service and innovative application is promoted, and the development of digital economy in China is further promoted. The data information contains a large amount of unstructured text, and extraction of structured effective information from the unstructured text is an important point in industry, and one basic task in the field of natural language processing is as follows: and (5) extracting named entities.
Early studies of named entity recognition were primarily dictionary and rule-based methods that relied primarily on linguists and domain experts to manually construct domain dictionary and rule templates based on dataset features. An advantage of this rule-based approach is that the iteration rules can be updated continuously as needed to extract the target entity. However, the method has the defects that the cost of manually establishing rules is high in the face of some complicated fields and application scenes, and the problem of rule conflict is easy to generate along with the expansion of a rule base, so that the existing rule base is difficult to maintain and expand, and cannot adapt to the change of data and fields.
Subsequently, named entity recognition studies based on statistical machine learning are attracting attention. Named entity recognition is defined as a sequence annotation problem in statistical machine learning methods. The statistical machine learning method applied to NER mainly comprises a maximum entropy model, a hidden Markov model, a maximum entropy Markov model, a conditional random field and the like. This method relies on the characteristics of manual construction and is cumbersome in process.
With the development of deep learning in recent years, more and more work based on deep neural networks (Deep Neural Network, DNN) has emerged in the field of named entity recognition. The DNN-based named entity recognition method does not need complicated characteristic engineering, and the model effect is far superior to that of the traditional rule and statistical machine learning method.
Chinese named entity recognition is more difficult than English because Chinese lacks separators such as space characters in English text, has no obvious morphological change characteristics, and is easy to cause boundary ambiguity. In addition, chinese has the phenomenon of word ambiguity, and the same word is expressed in different meanings in different fields or in different contexts, so that the word meaning needs to be understood by fully utilizing the context information. Simultaneously, the Chinese has the linguistic characteristics of omission, shorthand and the like, and the linguistic characteristics bring greater challenges to the recognition of the Chinese named entities. The existing Chinese named entity extraction methods lack the utilization of word information, and have single labeling frame and large limitation, so that the extraction precision of the Chinese named entity is affected.
Disclosure of Invention
The invention aims to: aiming at the problems and the defects existing in the prior art, the invention aims to provide a Chinese named entity extraction method based on multiple annotation frames and fusion characteristics, so as to solve the problems that the existing Chinese named entity extraction method is limited to a single annotation frame due to single annotation frame and is difficult to identify entity boundaries due to lack of utilization of word information.
The technical scheme is as follows: in order to achieve the purpose of the invention, the technical scheme adopted by the invention is a Chinese named entity extraction method based on a multi-label framework and fusion characteristics, which comprises the following steps:
(1) Carrying out word matching on each Chinese character in the input Chinese character sequence in an external dictionary, mapping the words into word vectors by utilizing a word vector lookup table, and mapping word segmentation marks of the Chinese characters in the words into word mark vectors by utilizing a word segmentation mark vector lookup table, wherein the word segmentation mark vectors and the word vectors are spliced to form dictionary features;
(2) Phonetic characters are annotated according to the meanings of the Chinese characters in the matched words, and phonetic features are obtained through the phonetic mapping of the phonetic vector lookup table;
(3) Based on a dot-multiplication attention mechanism, fusing the dictionary features and the pinyin features into Chinese character codes obtained by a Chinese pre-training language model BERT, and providing Chinese character semantic codes combining the dictionary features and the pinyin features for the follow-up;
(4) The Chinese character semantic codes are respectively input into two independent two-way long-short-term memory network models to perform feature sequence modeling, and are respectively output to obtain first feature sequence codesCoding for the second characteristic sequence->
(5) Sequence labeling is used as an auxiliary task, pointer labeling is used as a main task, and the first characteristic sequence is encodedAs input to the sequence labeling auxiliary task, the second characteristic sequence encodes +.>As the input of the pointer labeling main task, the multi-task learning model is utilized to perform joint learning on the sequence labeling auxiliary task and the pointer labeling main task;
(6) Calculation of log likelihood loss of sequence labeling auxiliary task in conditional random fieldPointer labeling of entity type classification cross entropy loss of entity segment header Chinese characters in main task>Entity type classification cross entropy loss of entity segment tail Chinese characters in pointer labeling main task>For said->The weighted summation obtains the training target which needs to be minimized by the model, and performs end-to-end combined trainingThe test stage extracts the entity fragments and types of the entity fragments from the sentences through the pointer marking main task.
Further, in the step (1), the external dictionary and the word vector lookup table are derived from a pre-training word vector disclosed on the internet, and the word segmentation marker vector lookup table is composed of one-hot vectors.
Further, in the step (2), the pinyin vector lookup table is obtained by training word2vec based on an external Chinese corpus, and the text in the external Chinese corpus is converted into pinyin by using Chinese pinyin software.
Further, in the step (5), the auxiliary task of sequence labeling marks the entity in the input sentence by using BMOES without entity type, and is responsible for extracting the Chinese named entity fragment, and the extracted entity fragment is without type; the pointer marking main task only marks the entity types of the head Chinese characters and the tail Chinese characters of the entity fragments in the sentences, and is responsible for extracting the Chinese named entities, and the extracted entities have types.
Further, in the step (6), the test stage takes the label corresponding to the maximum value of the prediction probability distribution of each Chinese character entity type as the prediction label of the Chinese character, then matches the entity segment tail Chinese character with the same entity type as the entity segment head Chinese character and the nearest position distance, and extracts the text segment between the entity segment head Chinese character and the entity segment tail Chinese character as the entity.
The beneficial effects are that: the invention can effectively solve the problem that the boundaries of the Chinese named entities are difficult to identify, fully plays the advantages of different labeling frameworks, and improves the accuracy of extracting the Chinese named entities. Firstly, the invention enhances the recognition of entity boundaries by a model by constructing dictionary and pinyin characteristics, and encodes Chinese characters by a Chinese pre-training language model BERT to provide context semantic support for an upper model; secondly, performing feature sequence modeling by using a recursion structure of the two-way long-short-term memory network model, and learning sequence position information, thereby solving the problem that the sequence position information is easy to lose due to the lack of sequence-dependent modeling of the pre-training language model BERT; thirdly, the sequence annotation and the pointer annotation are subjected to joint learning through a multi-task learning model, and the limitation of a single annotation frame is broken through by combining the advantages of different annotation frames, so that the accuracy of extracting the Chinese named entities is further improved.
Drawings
FIG. 1 is an overall frame diagram of the method of the present invention;
FIG. 2 is an exemplary diagram of dictionary and pinyin feature construction in the method of the present invention;
FIG. 3 is a diagram showing an exemplary sequence of labels in the method of the present invention;
FIG. 4 is a diagram illustrating an example of pointer labels in the method of the present invention;
FIG. 5 (a) (b) is a graph of experimental results of the effect of dictionary match window size on accuracy on Ottonos 4 dataset and MSRA dataset, respectively, in the method of the present invention;
fig. 6 (a) (b) is a graph of experimental results of the effect of dictionary matching window size on accuracy on Resume dataset and Weibo dataset, respectively, in the method of the present invention.
Detailed Description
The present invention is further illustrated in the accompanying drawings and detailed description which are to be understood as being merely illustrative of the invention and not limiting of its scope, and various modifications of the invention, which are equivalent to those skilled in the art upon reading the invention, will fall within the scope of the invention as defined in the appended claims.
The invention provides a Chinese named entity extraction method based on multiple annotation frames and fusion features, which solves the problems that the existing Chinese named entity extraction method is difficult to identify entity boundaries and is limited to a single annotation frame. As shown in FIG. 1, the complete flow of the invention comprises 6 parts of a dictionary feature construction stage, a pinyin feature construction stage, a dictionary and pinyin feature fusion stage, a feature sequence modeling stage, a multi-annotation frame joint learning stage and an output layer modeling stage. Specific embodiments are described below:
the dictionary feature construction stage corresponds to the technical scheme step (1). The specific implementation mode is as follows: for any given input Chinese character sequenceWherein->Representing Chinese character table, n represents sequence length, c i (1.ltoreq.i.ltoreq.n) represents a Chinese character of length 1. For any Chinese character c in sequence X i For introducing Chinese character c i The word in context requires the introduction of an external dictionary L x By setting a vocabulary matching window w All sentences contain Chinese character c i And the length is less than or equal to l w Text segment of (c) and dictionary L x Matching the words in the list. If present in the dictionary L x In (c), the text segment is regarded as being the Chinese character c i Contextually relevant candidate words. Since there may be multiple Chinese character c contained in the sentence i The text segment appears in the dictionary to finally obtain Chinese character c i Is a set of candidate matching words ws (c) i )={w 1 ,w 2 ,…,w m },w j (1. Ltoreq.j.ltoreq.m) represents matching words.
Obtaining a candidate matching word set ws (c) i ) And then, further screening is needed, and for any word in the candidate matching word set, if the word is a substring of another word in the candidate matching word set, the word is filtered and removed from the candidate matching word set. The reason for this is: 1) A complete word generally more accords with the information in the context of Chinese characters, for example, a 'Changjiang bridge' in a 'Changjiang bridge of Nanjing city' is more suitable as a 'long' candidate word than a 'Changjiang bridge'; 2) The interference in the process of fusing dictionary and pinyin characteristics based on an attention mechanism is reduced, so that attention is more likely to select words which are most suitable for the Chinese character context information from a candidate word list.
By word vector lookup table (lookup table) w The filtered matched word set ws (c i ) The word in the matching word is mapped into word vector to obtain the matching word feature code WE (c i ):
WE(c i )=e w (ws(c i ))
Wherein e w Derived from pre-training already trainedThe word vector remains unchanged during the training process. Then, the Chinese characters are marked by word segmentation in the positions of the matched words. Let B denote Chinese character c i At the beginning of the word, M represents Chinese character c i In the middle of the word, E represents Chinese character c i At the end of the word. Chinese character c i Matching different words with different word segmentation results of sequence, so that it is necessary to match Chinese character c i The word segmentation marks in the matched words are also integrated into dictionary features, so that the difference between different matched words is further highlighted. For Chinese character c i Set of candidate matching words ws (c) i ) Arbitrary word w of (a) j Let seg (w) j ) E { B, M, E } represents Chinese character c i At w j Is a word segmentation marker in the Chinese character. If START (w) j ) Representing w j Start position index in sequence X, END (w j ) Representing w j End position index in sequence X, seg (w j ) The calculation formula of (2) is defined as follows:
for Chinese character c i Set of candidate matching words ws (c) i ) The segs (c) can be obtained by applying the above-mentioned terms i ):
Wherein segs (c) i ) Representation c i The set of word-segmentation markers in all its matched words is looked up by word-segmentation marker vector lookup table e seg Segs (c) i ) Mapping the Chinese word segmentation markers into one-hot vector word segmentation marker codes SEGE (c) i ):
SEGE(c i )=e seg (segs(c i ))
Each dimension of the one-hot vector corresponds to each bit element in the set { B, }. Wherein [1, 0] corresponds to B, [0,1,0] corresponds to M, [0, 1] corresponds to E.
Chinese character c i Word segmentation markers in matching words encode SEGE (c) i ) And (3) withMatch word feature code WE (c) i ) Splicing in coding dimension to obtain Chinese character c i Final dictionary feature encoding LE (c) i ):
LE(c i )=[SEGE(c i );WE(c i )]
The pinyin feature construction stage corresponds to the technical scheme step (2). The specific implementation mode is as follows: including light sounds, pinyin has a total of 5 tones, e.g. "chang", "ch ā ng", "cha ng", "ch b ail ng", "ch a ng". If the entity is extracted from the sentence of the Changjiang bridge in Nanjing, when the sound of ch ng is sent by long in the sentence, the sentence is broken into the sentence of the Changjiang bridge in Nanjing, and the Changjiang bridge is extracted as the place name entity; when the pronunciation of "long" in the sentence is "zh hao ng", the sentence is broken into "Nanjing city length | Jiang Daqiao", and "Jiang Daqiao" is extracted as a name entity. The condition that the phonetic feature of Chinese characters in sentences affects the entity extraction accuracy is described.
For any Chinese character c in the input Chinese character sequence X i Obtaining a candidate word set ws (c) i ) Then, using the Chinese phonetic software (e.g. pypinyin) according to the Chinese character c i Meaning pair c in matching word i Phonetic transcription is injected to obtain a word set ws (c) i ) Corresponding Pinyin set pys (c) i ). Then, look up table e by pinyin vector py Will pys (c) i ) The pinyin in the code is mapped into pinyin vectors to obtain pinyin feature code PYE (c) i ):
PYE(c i )=e py (pys(c i ))
Wherein, the phonetic vector lookup table e py The Chinese phonetic alphabet is obtained by converting an external Chinese corpus (for example, chinese Wikipedia corpus) into pinyin by utilizing Chinese pinyin software and then training by a Skip-gram method based on Word2 Vec. Because the external Chinese corpus may contain digits, english or other symbols without pinyin, the invention converts English into [ ENG ] in the data preprocessing stage before word vector training]", digital conversion to" [ DIGIT ]]", other characters without pinyin are uniformly converted into" [ UNK ]]”。
An example diagram of dictionary and pinyin feature construction is shown in fig. 2. The matching results of "city" and "length" are shown in the figure, where w i,j Representing sequence segment { c i ,c i+1 ,…,c j The word formed. It can be seen that "Yangtze river" is not included in the "long" matching result, because "Yangtze river" is a substring of "Yangtze river bridge" and is filtered.
And (3) a dictionary and pinyin feature fusion stage corresponds to the technical scheme. The specific implementation mode is as follows: in order to avoid model training and fitting caused by smaller scale of entity extraction labeling data sets in some vertical fields, the invention provides semantic support by utilizing a Chinese pre-training language model BERT, and improves model generalization performance. The input sequence x= { c 1 ,c 2 ,…,c n Inputting into Chinese pre-training language model BERT, taking the last layer output of BERT as sequence code X h =[x 1 ,x 2 ,…,x n ]Whereind x Represents the BERT coding dimension, R represents a real number, < >>Indicating x i Is of dimension d x Is a real column vector, ">Indicating X h Is of dimension d x X n. The Chinese character c obtained by the construction i Splicing dictionary features and pinyin features in the coding dimension to obtain fusion features LPE (c) i ):
LPE(c i )=[LE(c i );PYE(c i )]
Suppose word vector lookup table e w Is d in the coding dimension w Pinyin vector lookup table e py The coding dimension is d py Chinese character c i Set of candidate matching words ws (c) i ) The size is mLPE (c) is based on a point-multiplied attention mechanism i ) Fusion to Chinese character code x i Wherein x is i Corresponds to query in the attention mechanism, while LPE (c i ) Then this is equivalent to the key and value in the attention mechanism. First, LPE (c i ) Linear mapping to x i LPE with consistent coding dimension ikv :
Wherein the training parametersAnd the mapped fusion featuresAssuming that unsqueeze (M, y) represents the y-th dimension of the expansion matrix M, squeeze (M, y) represents the y-th dimension of the compression matrix M, unsqueeze (x) i 0) x can be i From->Conversion to->Then, the attention weight LPE is calculated iw :
LPE iw =softmax(unsqueeze(x i ,0)·PE ikv )
Wherein the attention weight LPE iw ∈R 1×m The sum of weights after softmax is 1. Next, attention weighting LPE is utilized iw For LPE ikv Weighted sum calculation attention output LPE io :
Wherein the attention outputFinally, LPE is carried out io And Chinese character code x i Adding as Chinese character c i The final semantic code, expressed as:
x i =LPE io +x i
the feature sequence modeling stage corresponds to the technical scheme step (4). The specific implementation mode is as follows: for the problem that the self-attention mechanism of the Transfomer cannot capture sequence position information, the pre-trained language model BERT incorporates trainable absolute position codes into the input to alleviate the problem, but the modeling of sequence dependence is still lacking. The Long Short-Term Memory (LSTM) network model does not need position coding, and the LSTM structure recursively codes according to the sequence order has the capability of learning sequence position information. Chinese character semantic sequence code obtained by fusing dictionary and phonetic features in the last stepAnd respectively inputting the two Chinese named entity fragment extraction auxiliary tasks into two bidirectional long-short-Term Memory network models (Bidirectional Long Short-Term Memory, biLSTM) for feature sequence modeling, wherein one BiLSTM output is used for extracting the auxiliary task based on the Chinese named entity fragment marked by the sequence in the step (5), and the other BiLSTM output is used for extracting the main task based on the Chinese named entity marked by the pointer in the step (5). BiLSTM consists of forward and backward LSTM, and the BiLSTM for both tasks are independent and do not share training parameters.
Assume that at time step t, the forward LSTM hidden state output of the auxiliary task is extracted based on the Chinese named entity fragment marked by the sequence asThe backward LSTM hidden state output is +.>Will->And->Adding to obtain BiLSTM hidden state output of auxiliary task at time step t>
Chinese named entity extraction main task forward LSTM hidden state output based on pointer labeling isThe backward LSTM hidden state output is +.>Will->And->Adding to obtain BiLSTM hidden state output of main task at time step t>
Finally, modeling and outputting the feature sequence of the sequence labeling auxiliary task as The modeling output of the feature sequence of the pointer labeling main task is +.>d h Representing the LSTM encoding dimension.
The joint learning stage of the multi-annotation frame corresponds to the technical scheme step (5). The specific implementation mode is as follows: sequence annotation and pointer annotation are two common annotation frameworks applied to named entity extraction. The sequence labeling marks the position of each Chinese character in the text sequence in the entity, as shown in FIG. 3, which is an exemplary diagram of labeling the text sequence with BMOES, wherein B represents the start of a Chinese character in a named entity segment, M represents the middle of a Chinese character in a named entity segment, O represents the Chinese character outside the named entity segment, E represents the end of a Chinese character in a named entity segment, and S represents the Chinese character itself as a named entity segment. The example sentence contains two entities of Nanjing city and Yangtze river bridge. The pointer label marks the entity types of the head Chinese characters and the tail Chinese characters of each entity segment in the text sequence, as shown in fig. 4, wherein, the 'Nanjing city' and the 'Changjiang bridge' are both location type (Loc) entities.
The sequence labeling is performed by modeling the whole sequence dependence, so that the integrity of the extracted entity is better, and the precision is generally higher; pointer labeling classifies the entity types of the head and tail Chinese characters of the entity fragments, so that the noise interference resistance and the robustness are better, and the recall ratio is higher generally. To combine the advantages of different labeling frameworks, the following is performedInput as sequence labeling auxiliary task, +.>As the input of the pointer labeling main task, a Multi-task learning model, such as a Multi-gate mixed-of-expertise (MMOE) model, a progressive hierarchy extraction (Progressive Layered Extraction, PLE) model and the like, is utilized to perform joint learning on the auxiliary task extracted based on the sequence labeling Chinese named entity fragment and the main task extracted based on the pointer labeling Chinese named entity so as to obtain the output of the sequence labeling auxiliary task>Marking the main task output with the pointer>
The output layer sequence modeling stage corresponds to the technical scheme step (6). The specific implementation mode is as follows: for X obtained in the previous step a And X is b Adding a layer of Dropout prevents the model from overfitting. Then X after Dropout a Inputting the sequence into a conditional random field (Conditional Random Field, CRF), and calculating a BMOES tag index sequence y E Z based on a Chinese named entity fragment extraction auxiliary task of sequence labeling n Likelihood probability p (y|x):
wherein, the liquid crystal display device comprises a liquid crystal display device,representing the set of X all possible BMOES tag index sequences under this task, y' ε Z n Is thatAny BMOES tag index sequence. Training parameters->b CRF ∈R 5×5 (number of tags by BMOES sequence labeling method is 5),>represents W CRF Corresponding tag y in t Training parameters of->Representation b CRF Corresponding tag y in t-1 Transfer to tag y t Training parameters of->And the same is true. The real BMOES tag index sequence of the assumed sequence labeling auxiliary task is y span ∈Z n Z represents an integer, and is substituted into the above for calculating log likelihood loss of sequence labeling auxiliary task>
Next, X after Dropout is calculated b Linear mapping to Chinese naming entity based on pointer mark to extract the label space of main task, adding one layer of softmax to calculate the probability distribution p of each Chinese character on each label start And p is as follows end :
Wherein the training parameters c e +1 is the entity type number c e The sum of the non-entity types,is the predictive probability distribution of the entity type of the entity fragment header Chinese character,>is the prediction outline of entity type of Chinese character with entity segment tailAnd (3) rate distribution. Assuming that the index sequence of the real entity type label of the entity fragment header Chinese character is y start ∈Z n The index sequence of the real entity type label of the entity segment tail Chinese character is y end ∈Z n Calculating Cross Entropy (CE) loss of pointer labeling master task>And->
Wherein, the liquid crystal display device comprises a liquid crystal display device,tag index of true entity type representing the ith Chinese character,/->Represents p start The corresponding i Chinese character is predicted as the +.>Probability value of seed entity type-> And the same is true.
Finally, the loss of the auxiliary task of the sequence labeling is obtainedMarking the main task with the pointerLoss->After that, the 3 loss fusion model requires a minimum of the whole training goal +.>Performing end-to-end joint training:
wherein lambda is 1 、λ 2 、λ 3 Is a super parameter for controlling the influence of each task on the whole training target. In the test phase, p is taken start And p is as follows end Index corresponding to maximum value of each Chinese character label predictive probability distributionAnd->As a tag prediction index:
and then, matching the head Chinese characters and the tail Chinese characters of the entity fragments with the same entity type and the nearest position, and extracting the entities in the sequence.
The invention provides a Chinese named entity extraction method based on a multi-label framework and fusion characteristics. In order to test the effectiveness of the method, the method is evaluated from three aspects of precision (P), recall (R) and F1 indexes on Ortotes 4 and MSRA, resume, weibo data sets respectively, and is compared with other Chinese named entity extraction methods.
Model optimizer uses adaptive moment estimation (Adaptive momentum estimation, adam), the learning rate of the BERT training parameters is set to 3e-5, the learning rate of other model parameters is set to 1e-3, BERT coding dimension d x =768, the multitask learning model uses a progressive hierarchical extraction model PLE, the number of independent and shared experiments for each task in the PLE is uniformly set to 2, the experiment is set to a single-layer fully-connected network, the PLE layer number is set to 2, the lstm layer number is set to 1, and the lstm coding dimension d h =768, word vector encoding dimension d w =50, pinyin vector encoding dimension d py =50, loss weight
Table 1 shows the results of comparison of the accuracy of the different Chinese named entity extraction methods on Ontonotes4 dataset; table 2 shows the accuracy comparison results of different Chinese named entity extraction methods on MSRA data sets; table 3 shows the accuracy comparison results of different Chinese named entity extraction methods on the result dataset; table 4 shows the accuracy comparison results of the different Chinese named entity extraction methods on Weibo datasets. As can be seen from the experimental results in the table, compared with other Chinese named entity extraction methods, the Chinese named entity extraction method provided by the invention has the best Chinese named entity extraction accuracy performance on most data sets and index items. Fig. 5 (a) and (b) show the experimental results of the influence of the size of the dictionary matching window on the Ottotes 4 and MSRA data sets in the method of the invention, and fig. 6 (a) and (b) show the experimental results of the influence of the size of the dictionary matching window on the Resume and Weibo data sets in the method of the invention, and the extraction accuracy of Chinese named entities is extracted by evaluating the influence of the selection of the size of the dictionary matching window in the analysis method, so that guiding suggestions are provided for the selection of the size of the dictionary matching window in different subsequent application scenarios.
Table 1 accuracy contrast of different entity extraction methods on the otonotes 4 dataset
Table 2 accuracy contrast of different entity extraction methods on MSRA datasets
Table 3 accuracy contrast of different entity extraction methods on the result dataset
Table 4 accuracy contrast of extraction methods for different entities on Weibo dataset
Claims (5)
1. A Chinese named entity extraction method based on a multi-annotation frame and fusion features comprises the following steps:
(1) Carrying out word matching on each Chinese character in the input Chinese character sequence in an external dictionary, mapping the words into word vectors by utilizing a word vector lookup table, and mapping word segmentation marks of the Chinese characters in the words into word mark vectors by utilizing a word segmentation mark vector lookup table, wherein the word segmentation mark vectors and the word vectors are spliced to form dictionary features;
(2) Phonetic characters are annotated according to the meanings of the Chinese characters in the matched words, and phonetic features are obtained through the phonetic mapping of the phonetic vector lookup table;
(3) Based on a dot-multiplication attention mechanism, fusing the dictionary features and the pinyin features into Chinese character codes obtained by a Chinese pre-training language model BERT, and providing Chinese character semantic codes combining the dictionary features and the pinyin features for the follow-up;
(4) The Chinese character semantic codes are respectively input into two independent two-way long-short-term memory network models to perform feature sequence modeling, and are respectively output to obtain first feature sequence codesCoding for the second characteristic sequence->
(5) Sequence labeling is used as an auxiliary task, pointer labeling is used as a main task, and the first characteristic sequence is encodedAs input to the sequence labeling auxiliary task, the second characteristic sequence encodes +.>As the input of the pointer labeling main task, the multi-task learning model is utilized to perform joint learning on the sequence labeling auxiliary task and the pointer labeling main task;
(6) Calculation of log likelihood loss of sequence labeling auxiliary task in conditional random fieldPointer labeling of entity type classification cross entropy loss of entity segment header Chinese characters in main task>Entity type classification cross entropy loss of entity segment tail Chinese characters in pointer labeling main task>For said->The weighted summation obtains a training target which needs to be minimized by the model, the end-to-end joint training is carried out, and the entity fragments and the types thereof in the sentence are extracted by the pointer marking main task in the test stage.
2. The method for extracting Chinese named entities based on multiple labeling frames and fusion features according to claim 1, wherein in the step (1), the external dictionary and the word vector lookup table are derived from pre-trained word vectors disclosed on the internet, and the word segmentation marker vector lookup table is composed of one-hot vectors.
3. The method for extracting Chinese named entities based on multiple labeling frames and fusion features according to claim 1, wherein in the step (2), a pinyin vector lookup table is obtained by word2vec training based on an external Chinese corpus, and the text in the external Chinese corpus is converted into pinyin by using Chinese pinyin software.
4. The method for extracting Chinese named entity based on multi-label framework and fusion features according to claim 1, wherein in the step (5), the sequence labeling auxiliary task marks the entity in the input sentence by using BMOES without entity type, and is responsible for extracting Chinese named entity fragments, wherein the extracted entity fragments are without type; the pointer marking main task only marks the entity types of the head Chinese characters and the tail Chinese characters of the entity fragments in the sentences, and is responsible for extracting the Chinese named entities, and the extracted entities have types.
5. The method for extracting Chinese named entity based on multi-label frame and fusion feature according to claim 1, wherein in the step (6), the label corresponding to the maximum value of the prediction probability distribution of each Chinese character entity type is taken as the prediction label of the Chinese character in the test stage, then the Chinese characters with the end of the entity segment, which are the same as the entity type of the Chinese character with the end of the entity segment and are closest to the entity type of the Chinese character with the end of the entity segment, are matched, and the text segment between the Chinese characters with the end of the entity segment is extracted as the entity.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110511025.8A CN113190656B (en) | 2021-05-11 | 2021-05-11 | Chinese named entity extraction method based on multi-annotation frame and fusion features |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110511025.8A CN113190656B (en) | 2021-05-11 | 2021-05-11 | Chinese named entity extraction method based on multi-annotation frame and fusion features |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113190656A CN113190656A (en) | 2021-07-30 |
CN113190656B true CN113190656B (en) | 2023-07-14 |
Family
ID=76981067
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110511025.8A Active CN113190656B (en) | 2021-05-11 | 2021-05-11 | Chinese named entity extraction method based on multi-annotation frame and fusion features |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113190656B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114139541B (en) * | 2021-11-22 | 2022-08-02 | 北京中科闻歌科技股份有限公司 | Named entity identification method, device, equipment and medium |
CN114036933B (en) * | 2022-01-10 | 2022-04-22 | 湖南工商大学 | Information extraction method based on legal documents |
CN115146644B (en) * | 2022-09-01 | 2022-11-22 | 北京航空航天大学 | Alarm situation text-oriented multi-feature fusion named entity identification method |
CN115470871B (en) * | 2022-11-02 | 2023-02-17 | 江苏鸿程大数据技术与应用研究院有限公司 | Policy matching method and system based on named entity recognition and relation extraction model |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10032451B1 (en) * | 2016-12-20 | 2018-07-24 | Amazon Technologies, Inc. | User recognition for speech processing systems |
US10140973B1 (en) * | 2016-09-15 | 2018-11-27 | Amazon Technologies, Inc. | Text-to-speech processing using previously speech processed data |
CN109446521A (en) * | 2018-10-18 | 2019-03-08 | 京东方科技集团股份有限公司 | Name entity recognition method, device, electronic equipment, machine readable storage medium |
CN111444721A (en) * | 2020-05-27 | 2020-07-24 | 南京大学 | Chinese text key information extraction method based on pre-training language model |
CN111462752A (en) * | 2020-04-01 | 2020-07-28 | 北京思特奇信息技术股份有限公司 | Client intention identification method based on attention mechanism, feature embedding and BI-L STM |
CN111476031A (en) * | 2020-03-11 | 2020-07-31 | 重庆邮电大学 | Improved Chinese named entity recognition method based on L attice-L STM |
-
2021
- 2021-05-11 CN CN202110511025.8A patent/CN113190656B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10140973B1 (en) * | 2016-09-15 | 2018-11-27 | Amazon Technologies, Inc. | Text-to-speech processing using previously speech processed data |
US10032451B1 (en) * | 2016-12-20 | 2018-07-24 | Amazon Technologies, Inc. | User recognition for speech processing systems |
CN109446521A (en) * | 2018-10-18 | 2019-03-08 | 京东方科技集团股份有限公司 | Name entity recognition method, device, electronic equipment, machine readable storage medium |
CN111476031A (en) * | 2020-03-11 | 2020-07-31 | 重庆邮电大学 | Improved Chinese named entity recognition method based on L attice-L STM |
CN111462752A (en) * | 2020-04-01 | 2020-07-28 | 北京思特奇信息技术股份有限公司 | Client intention identification method based on attention mechanism, feature embedding and BI-L STM |
CN111444721A (en) * | 2020-05-27 | 2020-07-24 | 南京大学 | Chinese text key information extraction method based on pre-training language model |
Non-Patent Citations (3)
Title |
---|
Normalizing Text using Language Modelling based on Phonetics and String Similarity;Fenil Doshi等;ArXiv;1-9 * |
Phonetic-enriched Text Representation for Chinese Sentiment Analysis with Reinforcement Learning;H Peng等;IEEE TRANSACTIONS ON AFFECTIVE COMPUTING;1-16 * |
基于深度神经网络的电子病历命名实体识别关键技术研究与应用;江涛;中国优秀硕士学位论文全文数据库 (医药卫生科技辑)(第7期);E053-210 * |
Also Published As
Publication number | Publication date |
---|---|
CN113190656A (en) | 2021-07-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108460013B (en) | Sequence labeling model and method based on fine-grained word representation model | |
CN113190656B (en) | Chinese named entity extraction method based on multi-annotation frame and fusion features | |
CN107330032B (en) | Implicit discourse relation analysis method based on recurrent neural network | |
WO2022141878A1 (en) | End-to-end language model pretraining method and system, and device and storage medium | |
CN111666758B (en) | Chinese word segmentation method, training device and computer readable storage medium | |
CN110688862A (en) | Mongolian-Chinese inter-translation method based on transfer learning | |
CN109800437A (en) | A kind of name entity recognition method based on Fusion Features | |
CN113591483A (en) | Document-level event argument extraction method based on sequence labeling | |
CN115081437B (en) | Machine-generated text detection method and system based on linguistic feature contrast learning | |
CN114757182A (en) | BERT short text sentiment analysis method for improving training mode | |
CN111209749A (en) | Method for applying deep learning to Chinese word segmentation | |
CN110852089B (en) | Operation and maintenance project management method based on intelligent word segmentation and deep learning | |
CN111767718A (en) | Chinese grammar error correction method based on weakened grammar error feature representation | |
CN111368542A (en) | Text language association extraction method and system based on recurrent neural network | |
CN116151256A (en) | Small sample named entity recognition method based on multitasking and prompt learning | |
CN113673254A (en) | Knowledge distillation position detection method based on similarity maintenance | |
CN112163089A (en) | Military high-technology text classification method and system fusing named entity recognition | |
CN111145914B (en) | Method and device for determining text entity of lung cancer clinical disease seed bank | |
CN111125380A (en) | Entity linking method based on RoBERTA and heuristic algorithm | |
CN115238691A (en) | Knowledge fusion based embedded multi-intention recognition and slot filling model | |
CN114757184A (en) | Method and system for realizing knowledge question answering in aviation field | |
CN113191150B (en) | Multi-feature fusion Chinese medical text named entity identification method | |
CN109117471A (en) | A kind of calculation method and terminal of the word degree of correlation | |
CN115600597A (en) | Named entity identification method, device and system based on attention mechanism and intra-word semantic fusion and storage medium | |
CN113449517B (en) | Entity relationship extraction method based on BERT gated multi-window attention network model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |