CN113190656B - Chinese named entity extraction method based on multi-annotation frame and fusion features - Google Patents

Chinese named entity extraction method based on multi-annotation frame and fusion features Download PDF

Info

Publication number
CN113190656B
CN113190656B CN202110511025.8A CN202110511025A CN113190656B CN 113190656 B CN113190656 B CN 113190656B CN 202110511025 A CN202110511025 A CN 202110511025A CN 113190656 B CN113190656 B CN 113190656B
Authority
CN
China
Prior art keywords
chinese
entity
labeling
sequence
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110511025.8A
Other languages
Chinese (zh)
Other versions
CN113190656A (en
Inventor
麦丞程
刘健
黄宜华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN202110511025.8A priority Critical patent/CN113190656B/en
Publication of CN113190656A publication Critical patent/CN113190656A/en
Application granted granted Critical
Publication of CN113190656B publication Critical patent/CN113190656B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a Chinese named entity extraction method based on a multi-label framework and fusion characteristics. Then, word information and word segmentation mark information are introduced for each Chinese character through dictionary matching, and dictionary characteristics are constructed. On the basis, phonetic characters are annotated by using Chinese phonetic software according to the meaning of the Chinese characters in the matched words, so that phonetic features are constructed. And then, based on a dot-multiplying attention mechanism, combining dictionary features and pinyin features into Chinese character codes to obtain Chinese character semantic codes combining the dictionary features and the pinyin features, and improving the recognition capability of boundaries of Chinese named entities. Finally, combining the advantages of sequence labeling and index labeling, and utilizing a multi-task learning model to jointly learn two labeling tasks, so as to improve the accuracy of extracting the Chinese named entities.

Description

Chinese named entity extraction method based on multi-annotation frame and fusion features
Technical Field
The invention belongs to the field of artificial intelligence and natural language processing, and particularly relates to a Chinese named entity extraction method based on a multi-annotation framework and fusion characteristics.
Background
With the rapid development of internet technology, the data information of each industry is in explosive growth, so that the development of industrial big data intelligent analysis mining service and innovative application is promoted, and the development of digital economy in China is further promoted. The data information contains a large amount of unstructured text, and extraction of structured effective information from the unstructured text is an important point in industry, and one basic task in the field of natural language processing is as follows: and (5) extracting named entities.
Early studies of named entity recognition were primarily dictionary and rule-based methods that relied primarily on linguists and domain experts to manually construct domain dictionary and rule templates based on dataset features. An advantage of this rule-based approach is that the iteration rules can be updated continuously as needed to extract the target entity. However, the method has the defects that the cost of manually establishing rules is high in the face of some complicated fields and application scenes, and the problem of rule conflict is easy to generate along with the expansion of a rule base, so that the existing rule base is difficult to maintain and expand, and cannot adapt to the change of data and fields.
Subsequently, named entity recognition studies based on statistical machine learning are attracting attention. Named entity recognition is defined as a sequence annotation problem in statistical machine learning methods. The statistical machine learning method applied to NER mainly comprises a maximum entropy model, a hidden Markov model, a maximum entropy Markov model, a conditional random field and the like. This method relies on the characteristics of manual construction and is cumbersome in process.
With the development of deep learning in recent years, more and more work based on deep neural networks (Deep Neural Network, DNN) has emerged in the field of named entity recognition. The DNN-based named entity recognition method does not need complicated characteristic engineering, and the model effect is far superior to that of the traditional rule and statistical machine learning method.
Chinese named entity recognition is more difficult than English because Chinese lacks separators such as space characters in English text, has no obvious morphological change characteristics, and is easy to cause boundary ambiguity. In addition, chinese has the phenomenon of word ambiguity, and the same word is expressed in different meanings in different fields or in different contexts, so that the word meaning needs to be understood by fully utilizing the context information. Simultaneously, the Chinese has the linguistic characteristics of omission, shorthand and the like, and the linguistic characteristics bring greater challenges to the recognition of the Chinese named entities. The existing Chinese named entity extraction methods lack the utilization of word information, and have single labeling frame and large limitation, so that the extraction precision of the Chinese named entity is affected.
Disclosure of Invention
The invention aims to: aiming at the problems and the defects existing in the prior art, the invention aims to provide a Chinese named entity extraction method based on multiple annotation frames and fusion characteristics, so as to solve the problems that the existing Chinese named entity extraction method is limited to a single annotation frame due to single annotation frame and is difficult to identify entity boundaries due to lack of utilization of word information.
The technical scheme is as follows: in order to achieve the purpose of the invention, the technical scheme adopted by the invention is a Chinese named entity extraction method based on a multi-label framework and fusion characteristics, which comprises the following steps:
(1) Carrying out word matching on each Chinese character in the input Chinese character sequence in an external dictionary, mapping the words into word vectors by utilizing a word vector lookup table, and mapping word segmentation marks of the Chinese characters in the words into word mark vectors by utilizing a word segmentation mark vector lookup table, wherein the word segmentation mark vectors and the word vectors are spliced to form dictionary features;
(2) Phonetic characters are annotated according to the meanings of the Chinese characters in the matched words, and phonetic features are obtained through the phonetic mapping of the phonetic vector lookup table;
(3) Based on a dot-multiplication attention mechanism, fusing the dictionary features and the pinyin features into Chinese character codes obtained by a Chinese pre-training language model BERT, and providing Chinese character semantic codes combining the dictionary features and the pinyin features for the follow-up;
(4) The Chinese character semantic codes are respectively input into two independent two-way long-short-term memory network models to perform feature sequence modeling, and are respectively output to obtain first feature sequence codes
Figure BDA0003060386620000031
Coding for the second characteristic sequence->
Figure BDA0003060386620000032
(5) Sequence labeling is used as an auxiliary task, pointer labeling is used as a main task, and the first characteristic sequence is encoded
Figure BDA0003060386620000033
As input to the sequence labeling auxiliary task, the second characteristic sequence encodes +.>
Figure BDA0003060386620000034
As the input of the pointer labeling main task, the multi-task learning model is utilized to perform joint learning on the sequence labeling auxiliary task and the pointer labeling main task;
(6) Calculation of log likelihood loss of sequence labeling auxiliary task in conditional random field
Figure BDA0003060386620000035
Pointer labeling of entity type classification cross entropy loss of entity segment header Chinese characters in main task>
Figure BDA0003060386620000036
Entity type classification cross entropy loss of entity segment tail Chinese characters in pointer labeling main task>
Figure BDA0003060386620000037
For said->
Figure BDA0003060386620000038
The weighted summation obtains the training target which needs to be minimized by the model, and performs end-to-end combined trainingThe test stage extracts the entity fragments and types of the entity fragments from the sentences through the pointer marking main task.
Further, in the step (1), the external dictionary and the word vector lookup table are derived from a pre-training word vector disclosed on the internet, and the word segmentation marker vector lookup table is composed of one-hot vectors.
Further, in the step (2), the pinyin vector lookup table is obtained by training word2vec based on an external Chinese corpus, and the text in the external Chinese corpus is converted into pinyin by using Chinese pinyin software.
Further, in the step (5), the auxiliary task of sequence labeling marks the entity in the input sentence by using BMOES without entity type, and is responsible for extracting the Chinese named entity fragment, and the extracted entity fragment is without type; the pointer marking main task only marks the entity types of the head Chinese characters and the tail Chinese characters of the entity fragments in the sentences, and is responsible for extracting the Chinese named entities, and the extracted entities have types.
Further, in the step (6), the test stage takes the label corresponding to the maximum value of the prediction probability distribution of each Chinese character entity type as the prediction label of the Chinese character, then matches the entity segment tail Chinese character with the same entity type as the entity segment head Chinese character and the nearest position distance, and extracts the text segment between the entity segment head Chinese character and the entity segment tail Chinese character as the entity.
The beneficial effects are that: the invention can effectively solve the problem that the boundaries of the Chinese named entities are difficult to identify, fully plays the advantages of different labeling frameworks, and improves the accuracy of extracting the Chinese named entities. Firstly, the invention enhances the recognition of entity boundaries by a model by constructing dictionary and pinyin characteristics, and encodes Chinese characters by a Chinese pre-training language model BERT to provide context semantic support for an upper model; secondly, performing feature sequence modeling by using a recursion structure of the two-way long-short-term memory network model, and learning sequence position information, thereby solving the problem that the sequence position information is easy to lose due to the lack of sequence-dependent modeling of the pre-training language model BERT; thirdly, the sequence annotation and the pointer annotation are subjected to joint learning through a multi-task learning model, and the limitation of a single annotation frame is broken through by combining the advantages of different annotation frames, so that the accuracy of extracting the Chinese named entities is further improved.
Drawings
FIG. 1 is an overall frame diagram of the method of the present invention;
FIG. 2 is an exemplary diagram of dictionary and pinyin feature construction in the method of the present invention;
FIG. 3 is a diagram showing an exemplary sequence of labels in the method of the present invention;
FIG. 4 is a diagram illustrating an example of pointer labels in the method of the present invention;
FIG. 5 (a) (b) is a graph of experimental results of the effect of dictionary match window size on accuracy on Ottonos 4 dataset and MSRA dataset, respectively, in the method of the present invention;
fig. 6 (a) (b) is a graph of experimental results of the effect of dictionary matching window size on accuracy on Resume dataset and Weibo dataset, respectively, in the method of the present invention.
Detailed Description
The present invention is further illustrated in the accompanying drawings and detailed description which are to be understood as being merely illustrative of the invention and not limiting of its scope, and various modifications of the invention, which are equivalent to those skilled in the art upon reading the invention, will fall within the scope of the invention as defined in the appended claims.
The invention provides a Chinese named entity extraction method based on multiple annotation frames and fusion features, which solves the problems that the existing Chinese named entity extraction method is difficult to identify entity boundaries and is limited to a single annotation frame. As shown in FIG. 1, the complete flow of the invention comprises 6 parts of a dictionary feature construction stage, a pinyin feature construction stage, a dictionary and pinyin feature fusion stage, a feature sequence modeling stage, a multi-annotation frame joint learning stage and an output layer modeling stage. Specific embodiments are described below:
the dictionary feature construction stage corresponds to the technical scheme step (1). The specific implementation mode is as follows: for any given input Chinese character sequence
Figure BDA0003060386620000051
Wherein->
Figure BDA0003060386620000052
Representing Chinese character table, n represents sequence length, c i (1.ltoreq.i.ltoreq.n) represents a Chinese character of length 1. For any Chinese character c in sequence X i For introducing Chinese character c i The word in context requires the introduction of an external dictionary L x By setting a vocabulary matching window w All sentences contain Chinese character c i And the length is less than or equal to l w Text segment of (c) and dictionary L x Matching the words in the list. If present in the dictionary L x In (c), the text segment is regarded as being the Chinese character c i Contextually relevant candidate words. Since there may be multiple Chinese character c contained in the sentence i The text segment appears in the dictionary to finally obtain Chinese character c i Is a set of candidate matching words ws (c) i )={w 1 ,w 2 ,…,w m },w j (1. Ltoreq.j.ltoreq.m) represents matching words.
Obtaining a candidate matching word set ws (c) i ) And then, further screening is needed, and for any word in the candidate matching word set, if the word is a substring of another word in the candidate matching word set, the word is filtered and removed from the candidate matching word set. The reason for this is: 1) A complete word generally more accords with the information in the context of Chinese characters, for example, a 'Changjiang bridge' in a 'Changjiang bridge of Nanjing city' is more suitable as a 'long' candidate word than a 'Changjiang bridge'; 2) The interference in the process of fusing dictionary and pinyin characteristics based on an attention mechanism is reduced, so that attention is more likely to select words which are most suitable for the Chinese character context information from a candidate word list.
By word vector lookup table (lookup table) w The filtered matched word set ws (c i ) The word in the matching word is mapped into word vector to obtain the matching word feature code WE (c i ):
WE(c i )=e w (ws(c i ))
Wherein e w Derived from pre-training already trainedThe word vector remains unchanged during the training process. Then, the Chinese characters are marked by word segmentation in the positions of the matched words. Let B denote Chinese character c i At the beginning of the word, M represents Chinese character c i In the middle of the word, E represents Chinese character c i At the end of the word. Chinese character c i Matching different words with different word segmentation results of sequence, so that it is necessary to match Chinese character c i The word segmentation marks in the matched words are also integrated into dictionary features, so that the difference between different matched words is further highlighted. For Chinese character c i Set of candidate matching words ws (c) i ) Arbitrary word w of (a) j Let seg (w) j ) E { B, M, E } represents Chinese character c i At w j Is a word segmentation marker in the Chinese character. If START (w) j ) Representing w j Start position index in sequence X, END (w j ) Representing w j End position index in sequence X, seg (w j ) The calculation formula of (2) is defined as follows:
Figure BDA0003060386620000061
for Chinese character c i Set of candidate matching words ws (c) i ) The segs (c) can be obtained by applying the above-mentioned terms i ):
Figure BDA0003060386620000062
Wherein segs (c) i ) Representation c i The set of word-segmentation markers in all its matched words is looked up by word-segmentation marker vector lookup table e seg Segs (c) i ) Mapping the Chinese word segmentation markers into one-hot vector word segmentation marker codes SEGE (c) i ):
SEGE(c i )=e seg (segs(c i ))
Each dimension of the one-hot vector corresponds to each bit element in the set { B, }. Wherein [1, 0] corresponds to B, [0,1,0] corresponds to M, [0, 1] corresponds to E.
Chinese character c i Word segmentation markers in matching words encode SEGE (c) i ) And (3) withMatch word feature code WE (c) i ) Splicing in coding dimension to obtain Chinese character c i Final dictionary feature encoding LE (c) i ):
LE(c i )=[SEGE(c i );WE(c i )]
The pinyin feature construction stage corresponds to the technical scheme step (2). The specific implementation mode is as follows: including light sounds, pinyin has a total of 5 tones, e.g. "chang", "ch ā ng", "cha ng", "ch b ail ng", "ch a ng". If the entity is extracted from the sentence of the Changjiang bridge in Nanjing, when the sound of ch ng is sent by long in the sentence, the sentence is broken into the sentence of the Changjiang bridge in Nanjing, and the Changjiang bridge is extracted as the place name entity; when the pronunciation of "long" in the sentence is "zh hao ng", the sentence is broken into "Nanjing city length | Jiang Daqiao", and "Jiang Daqiao" is extracted as a name entity. The condition that the phonetic feature of Chinese characters in sentences affects the entity extraction accuracy is described.
For any Chinese character c in the input Chinese character sequence X i Obtaining a candidate word set ws (c) i ) Then, using the Chinese phonetic software (e.g. pypinyin) according to the Chinese character c i Meaning pair c in matching word i Phonetic transcription is injected to obtain a word set ws (c) i ) Corresponding Pinyin set pys (c) i ). Then, look up table e by pinyin vector py Will pys (c) i ) The pinyin in the code is mapped into pinyin vectors to obtain pinyin feature code PYE (c) i ):
PYE(c i )=e py (pys(c i ))
Wherein, the phonetic vector lookup table e py The Chinese phonetic alphabet is obtained by converting an external Chinese corpus (for example, chinese Wikipedia corpus) into pinyin by utilizing Chinese pinyin software and then training by a Skip-gram method based on Word2 Vec. Because the external Chinese corpus may contain digits, english or other symbols without pinyin, the invention converts English into [ ENG ] in the data preprocessing stage before word vector training]", digital conversion to" [ DIGIT ]]", other characters without pinyin are uniformly converted into" [ UNK ]]”。
An example diagram of dictionary and pinyin feature construction is shown in fig. 2. The matching results of "city" and "length" are shown in the figure, where w i,j Representing sequence segment { c i ,c i+1 ,…,c j The word formed. It can be seen that "Yangtze river" is not included in the "long" matching result, because "Yangtze river" is a substring of "Yangtze river bridge" and is filtered.
And (3) a dictionary and pinyin feature fusion stage corresponds to the technical scheme. The specific implementation mode is as follows: in order to avoid model training and fitting caused by smaller scale of entity extraction labeling data sets in some vertical fields, the invention provides semantic support by utilizing a Chinese pre-training language model BERT, and improves model generalization performance. The input sequence x= { c 1 ,c 2 ,…,c n Inputting into Chinese pre-training language model BERT, taking the last layer output of BERT as sequence code X h =[x 1 ,x 2 ,…,x n ]Wherein
Figure BDA0003060386620000081
d x Represents the BERT coding dimension, R represents a real number, < >>
Figure BDA0003060386620000082
Indicating x i Is of dimension d x Is a real column vector, ">
Figure BDA0003060386620000083
Indicating X h Is of dimension d x X n. The Chinese character c obtained by the construction i Splicing dictionary features and pinyin features in the coding dimension to obtain fusion features LPE (c) i ):
LPE(c i )=[LE(c i );PYE(c i )]
Suppose word vector lookup table e w Is d in the coding dimension w Pinyin vector lookup table e py The coding dimension is d py Chinese character c i Set of candidate matching words ws (c) i ) The size is m
Figure BDA0003060386620000084
LPE (c) is based on a point-multiplied attention mechanism i ) Fusion to Chinese character code x i Wherein x is i Corresponds to query in the attention mechanism, while LPE (c i ) Then this is equivalent to the key and value in the attention mechanism. First, LPE (c i ) Linear mapping to x i LPE with consistent coding dimension ikv
Figure BDA0003060386620000085
Wherein the training parameters
Figure BDA0003060386620000086
And the mapped fusion features
Figure BDA0003060386620000087
Assuming that unsqueeze (M, y) represents the y-th dimension of the expansion matrix M, squeeze (M, y) represents the y-th dimension of the compression matrix M, unsqueeze (x) i 0) x can be i From->
Figure BDA0003060386620000088
Conversion to->
Figure BDA0003060386620000089
Then, the attention weight LPE is calculated iw
LPE iw =softmax(unsqueeze(x i ,0)·PE ikv )
Wherein the attention weight LPE iw ∈R 1×m The sum of weights after softmax is 1. Next, attention weighting LPE is utilized iw For LPE ikv Weighted sum calculation attention output LPE io
Figure BDA00030603866200000810
Wherein the attention output
Figure BDA00030603866200000811
Finally, LPE is carried out io And Chinese character code x i Adding as Chinese character c i The final semantic code, expressed as:
x i =LPE io +x i
the feature sequence modeling stage corresponds to the technical scheme step (4). The specific implementation mode is as follows: for the problem that the self-attention mechanism of the Transfomer cannot capture sequence position information, the pre-trained language model BERT incorporates trainable absolute position codes into the input to alleviate the problem, but the modeling of sequence dependence is still lacking. The Long Short-Term Memory (LSTM) network model does not need position coding, and the LSTM structure recursively codes according to the sequence order has the capability of learning sequence position information. Chinese character semantic sequence code obtained by fusing dictionary and phonetic features in the last step
Figure BDA0003060386620000091
And respectively inputting the two Chinese named entity fragment extraction auxiliary tasks into two bidirectional long-short-Term Memory network models (Bidirectional Long Short-Term Memory, biLSTM) for feature sequence modeling, wherein one BiLSTM output is used for extracting the auxiliary task based on the Chinese named entity fragment marked by the sequence in the step (5), and the other BiLSTM output is used for extracting the main task based on the Chinese named entity marked by the pointer in the step (5). BiLSTM consists of forward and backward LSTM, and the BiLSTM for both tasks are independent and do not share training parameters.
Assume that at time step t, the forward LSTM hidden state output of the auxiliary task is extracted based on the Chinese named entity fragment marked by the sequence as
Figure BDA0003060386620000092
The backward LSTM hidden state output is +.>
Figure BDA0003060386620000093
Will->
Figure BDA0003060386620000094
And->
Figure BDA0003060386620000095
Adding to obtain BiLSTM hidden state output of auxiliary task at time step t>
Figure BDA0003060386620000096
Figure BDA0003060386620000097
Chinese named entity extraction main task forward LSTM hidden state output based on pointer labeling is
Figure BDA0003060386620000098
The backward LSTM hidden state output is +.>
Figure BDA0003060386620000099
Will->
Figure BDA00030603866200000910
And->
Figure BDA00030603866200000911
Adding to obtain BiLSTM hidden state output of main task at time step t>
Figure BDA00030603866200000912
Figure BDA00030603866200000913
Finally, modeling and outputting the feature sequence of the sequence labeling auxiliary task as
Figure BDA00030603866200000914
Figure BDA00030603866200000915
The modeling output of the feature sequence of the pointer labeling main task is +.>
Figure BDA00030603866200000916
d h Representing the LSTM encoding dimension.
The joint learning stage of the multi-annotation frame corresponds to the technical scheme step (5). The specific implementation mode is as follows: sequence annotation and pointer annotation are two common annotation frameworks applied to named entity extraction. The sequence labeling marks the position of each Chinese character in the text sequence in the entity, as shown in FIG. 3, which is an exemplary diagram of labeling the text sequence with BMOES, wherein B represents the start of a Chinese character in a named entity segment, M represents the middle of a Chinese character in a named entity segment, O represents the Chinese character outside the named entity segment, E represents the end of a Chinese character in a named entity segment, and S represents the Chinese character itself as a named entity segment. The example sentence contains two entities of Nanjing city and Yangtze river bridge. The pointer label marks the entity types of the head Chinese characters and the tail Chinese characters of each entity segment in the text sequence, as shown in fig. 4, wherein, the 'Nanjing city' and the 'Changjiang bridge' are both location type (Loc) entities.
The sequence labeling is performed by modeling the whole sequence dependence, so that the integrity of the extracted entity is better, and the precision is generally higher; pointer labeling classifies the entity types of the head and tail Chinese characters of the entity fragments, so that the noise interference resistance and the robustness are better, and the recall ratio is higher generally. To combine the advantages of different labeling frameworks, the following is performed
Figure BDA0003060386620000101
Input as sequence labeling auxiliary task, +.>
Figure BDA0003060386620000102
As the input of the pointer labeling main task, a Multi-task learning model, such as a Multi-gate mixed-of-expertise (MMOE) model, a progressive hierarchy extraction (Progressive Layered Extraction, PLE) model and the like, is utilized to perform joint learning on the auxiliary task extracted based on the sequence labeling Chinese named entity fragment and the main task extracted based on the pointer labeling Chinese named entity so as to obtain the output of the sequence labeling auxiliary task>
Figure BDA0003060386620000103
Marking the main task output with the pointer>
Figure BDA0003060386620000104
The output layer sequence modeling stage corresponds to the technical scheme step (6). The specific implementation mode is as follows: for X obtained in the previous step a And X is b Adding a layer of Dropout prevents the model from overfitting. Then X after Dropout a Inputting the sequence into a conditional random field (Conditional Random Field, CRF), and calculating a BMOES tag index sequence y E Z based on a Chinese named entity fragment extraction auxiliary task of sequence labeling n Likelihood probability p (y|x):
Figure BDA0003060386620000111
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure BDA0003060386620000112
representing the set of X all possible BMOES tag index sequences under this task, y' ε Z n Is that
Figure BDA0003060386620000113
Any BMOES tag index sequence. Training parameters->
Figure BDA0003060386620000114
b CRF ∈R 5×5 (number of tags by BMOES sequence labeling method is 5),>
Figure BDA0003060386620000115
represents W CRF Corresponding tag y in t Training parameters of->
Figure BDA0003060386620000116
Representation b CRF Corresponding tag y in t-1 Transfer to tag y t Training parameters of->
Figure BDA0003060386620000117
And the same is true. The real BMOES tag index sequence of the assumed sequence labeling auxiliary task is y span ∈Z n Z represents an integer, and is substituted into the above for calculating log likelihood loss of sequence labeling auxiliary task>
Figure BDA0003060386620000118
Figure BDA0003060386620000119
Next, X after Dropout is calculated b Linear mapping to Chinese naming entity based on pointer mark to extract the label space of main task, adding one layer of softmax to calculate the probability distribution p of each Chinese character on each label start And p is as follows end
Figure BDA00030603866200001110
Figure BDA00030603866200001111
Wherein the training parameters
Figure BDA00030603866200001112
Figure BDA00030603866200001113
c e +1 is the entity type number c e The sum of the non-entity types,
Figure BDA00030603866200001114
is the predictive probability distribution of the entity type of the entity fragment header Chinese character,>
Figure BDA00030603866200001115
is the prediction outline of entity type of Chinese character with entity segment tailAnd (3) rate distribution. Assuming that the index sequence of the real entity type label of the entity fragment header Chinese character is y start ∈Z n The index sequence of the real entity type label of the entity segment tail Chinese character is y end ∈Z n Calculating Cross Entropy (CE) loss of pointer labeling master task>
Figure BDA00030603866200001116
And->
Figure BDA00030603866200001117
Figure BDA00030603866200001118
Figure BDA00030603866200001119
Wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure BDA0003060386620000121
tag index of true entity type representing the ith Chinese character,/->
Figure BDA0003060386620000122
Represents p start The corresponding i Chinese character is predicted as the +.>
Figure BDA0003060386620000123
Probability value of seed entity type->
Figure BDA0003060386620000124
Figure BDA0003060386620000125
And the same is true.
Finally, the loss of the auxiliary task of the sequence labeling is obtained
Figure BDA0003060386620000126
Marking the main task with the pointerLoss->
Figure BDA0003060386620000127
After that, the 3 loss fusion model requires a minimum of the whole training goal +.>
Figure BDA0003060386620000128
Performing end-to-end joint training:
Figure BDA0003060386620000129
wherein lambda is 1 、λ 2 、λ 3 Is a super parameter for controlling the influence of each task on the whole training target. In the test phase, p is taken start And p is as follows end Index corresponding to maximum value of each Chinese character label predictive probability distribution
Figure BDA00030603866200001210
And->
Figure BDA00030603866200001211
As a tag prediction index:
Figure BDA00030603866200001212
Figure BDA00030603866200001213
and then, matching the head Chinese characters and the tail Chinese characters of the entity fragments with the same entity type and the nearest position, and extracting the entities in the sequence.
The invention provides a Chinese named entity extraction method based on a multi-label framework and fusion characteristics. In order to test the effectiveness of the method, the method is evaluated from three aspects of precision (P), recall (R) and F1 indexes on Ortotes 4 and MSRA, resume, weibo data sets respectively, and is compared with other Chinese named entity extraction methods.
Model optimizer uses adaptive moment estimation (Adaptive momentum estimation, adam), the learning rate of the BERT training parameters is set to 3e-5, the learning rate of other model parameters is set to 1e-3, BERT coding dimension d x =768, the multitask learning model uses a progressive hierarchical extraction model PLE, the number of independent and shared experiments for each task in the PLE is uniformly set to 2, the experiment is set to a single-layer fully-connected network, the PLE layer number is set to 2, the lstm layer number is set to 1, and the lstm coding dimension d h =768, word vector encoding dimension d w =50, pinyin vector encoding dimension d py =50, loss weight
Figure BDA00030603866200001214
Table 1 shows the results of comparison of the accuracy of the different Chinese named entity extraction methods on Ontonotes4 dataset; table 2 shows the accuracy comparison results of different Chinese named entity extraction methods on MSRA data sets; table 3 shows the accuracy comparison results of different Chinese named entity extraction methods on the result dataset; table 4 shows the accuracy comparison results of the different Chinese named entity extraction methods on Weibo datasets. As can be seen from the experimental results in the table, compared with other Chinese named entity extraction methods, the Chinese named entity extraction method provided by the invention has the best Chinese named entity extraction accuracy performance on most data sets and index items. Fig. 5 (a) and (b) show the experimental results of the influence of the size of the dictionary matching window on the Ottotes 4 and MSRA data sets in the method of the invention, and fig. 6 (a) and (b) show the experimental results of the influence of the size of the dictionary matching window on the Resume and Weibo data sets in the method of the invention, and the extraction accuracy of Chinese named entities is extracted by evaluating the influence of the selection of the size of the dictionary matching window in the analysis method, so that guiding suggestions are provided for the selection of the size of the dictionary matching window in different subsequent application scenarios.
Table 1 accuracy contrast of different entity extraction methods on the otonotes 4 dataset
Figure BDA0003060386620000141
Table 2 accuracy contrast of different entity extraction methods on MSRA datasets
Figure BDA0003060386620000142
Figure BDA0003060386620000151
Table 3 accuracy contrast of different entity extraction methods on the result dataset
Figure BDA0003060386620000152
Table 4 accuracy contrast of extraction methods for different entities on Weibo dataset
Figure BDA0003060386620000153
/>

Claims (5)

1. A Chinese named entity extraction method based on a multi-annotation frame and fusion features comprises the following steps:
(1) Carrying out word matching on each Chinese character in the input Chinese character sequence in an external dictionary, mapping the words into word vectors by utilizing a word vector lookup table, and mapping word segmentation marks of the Chinese characters in the words into word mark vectors by utilizing a word segmentation mark vector lookup table, wherein the word segmentation mark vectors and the word vectors are spliced to form dictionary features;
(2) Phonetic characters are annotated according to the meanings of the Chinese characters in the matched words, and phonetic features are obtained through the phonetic mapping of the phonetic vector lookup table;
(3) Based on a dot-multiplication attention mechanism, fusing the dictionary features and the pinyin features into Chinese character codes obtained by a Chinese pre-training language model BERT, and providing Chinese character semantic codes combining the dictionary features and the pinyin features for the follow-up;
(4) The Chinese character semantic codes are respectively input into two independent two-way long-short-term memory network models to perform feature sequence modeling, and are respectively output to obtain first feature sequence codes
Figure FDA0003060386610000011
Coding for the second characteristic sequence->
Figure FDA0003060386610000012
(5) Sequence labeling is used as an auxiliary task, pointer labeling is used as a main task, and the first characteristic sequence is encoded
Figure FDA0003060386610000013
As input to the sequence labeling auxiliary task, the second characteristic sequence encodes +.>
Figure FDA0003060386610000014
As the input of the pointer labeling main task, the multi-task learning model is utilized to perform joint learning on the sequence labeling auxiliary task and the pointer labeling main task;
(6) Calculation of log likelihood loss of sequence labeling auxiliary task in conditional random field
Figure FDA0003060386610000015
Pointer labeling of entity type classification cross entropy loss of entity segment header Chinese characters in main task>
Figure FDA0003060386610000016
Entity type classification cross entropy loss of entity segment tail Chinese characters in pointer labeling main task>
Figure FDA0003060386610000017
For said->
Figure FDA0003060386610000018
The weighted summation obtains a training target which needs to be minimized by the model, the end-to-end joint training is carried out, and the entity fragments and the types thereof in the sentence are extracted by the pointer marking main task in the test stage.
2. The method for extracting Chinese named entities based on multiple labeling frames and fusion features according to claim 1, wherein in the step (1), the external dictionary and the word vector lookup table are derived from pre-trained word vectors disclosed on the internet, and the word segmentation marker vector lookup table is composed of one-hot vectors.
3. The method for extracting Chinese named entities based on multiple labeling frames and fusion features according to claim 1, wherein in the step (2), a pinyin vector lookup table is obtained by word2vec training based on an external Chinese corpus, and the text in the external Chinese corpus is converted into pinyin by using Chinese pinyin software.
4. The method for extracting Chinese named entity based on multi-label framework and fusion features according to claim 1, wherein in the step (5), the sequence labeling auxiliary task marks the entity in the input sentence by using BMOES without entity type, and is responsible for extracting Chinese named entity fragments, wherein the extracted entity fragments are without type; the pointer marking main task only marks the entity types of the head Chinese characters and the tail Chinese characters of the entity fragments in the sentences, and is responsible for extracting the Chinese named entities, and the extracted entities have types.
5. The method for extracting Chinese named entity based on multi-label frame and fusion feature according to claim 1, wherein in the step (6), the label corresponding to the maximum value of the prediction probability distribution of each Chinese character entity type is taken as the prediction label of the Chinese character in the test stage, then the Chinese characters with the end of the entity segment, which are the same as the entity type of the Chinese character with the end of the entity segment and are closest to the entity type of the Chinese character with the end of the entity segment, are matched, and the text segment between the Chinese characters with the end of the entity segment is extracted as the entity.
CN202110511025.8A 2021-05-11 2021-05-11 Chinese named entity extraction method based on multi-annotation frame and fusion features Active CN113190656B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110511025.8A CN113190656B (en) 2021-05-11 2021-05-11 Chinese named entity extraction method based on multi-annotation frame and fusion features

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110511025.8A CN113190656B (en) 2021-05-11 2021-05-11 Chinese named entity extraction method based on multi-annotation frame and fusion features

Publications (2)

Publication Number Publication Date
CN113190656A CN113190656A (en) 2021-07-30
CN113190656B true CN113190656B (en) 2023-07-14

Family

ID=76981067

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110511025.8A Active CN113190656B (en) 2021-05-11 2021-05-11 Chinese named entity extraction method based on multi-annotation frame and fusion features

Country Status (1)

Country Link
CN (1) CN113190656B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114139541B (en) * 2021-11-22 2022-08-02 北京中科闻歌科技股份有限公司 Named entity identification method, device, equipment and medium
CN114036933B (en) * 2022-01-10 2022-04-22 湖南工商大学 Information extraction method based on legal documents
CN115146644B (en) * 2022-09-01 2022-11-22 北京航空航天大学 Alarm situation text-oriented multi-feature fusion named entity identification method
CN115470871B (en) * 2022-11-02 2023-02-17 江苏鸿程大数据技术与应用研究院有限公司 Policy matching method and system based on named entity recognition and relation extraction model

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10032451B1 (en) * 2016-12-20 2018-07-24 Amazon Technologies, Inc. User recognition for speech processing systems
US10140973B1 (en) * 2016-09-15 2018-11-27 Amazon Technologies, Inc. Text-to-speech processing using previously speech processed data
CN109446521A (en) * 2018-10-18 2019-03-08 京东方科技集团股份有限公司 Name entity recognition method, device, electronic equipment, machine readable storage medium
CN111444721A (en) * 2020-05-27 2020-07-24 南京大学 Chinese text key information extraction method based on pre-training language model
CN111462752A (en) * 2020-04-01 2020-07-28 北京思特奇信息技术股份有限公司 Client intention identification method based on attention mechanism, feature embedding and BI-L STM
CN111476031A (en) * 2020-03-11 2020-07-31 重庆邮电大学 Improved Chinese named entity recognition method based on L attice-L STM

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10140973B1 (en) * 2016-09-15 2018-11-27 Amazon Technologies, Inc. Text-to-speech processing using previously speech processed data
US10032451B1 (en) * 2016-12-20 2018-07-24 Amazon Technologies, Inc. User recognition for speech processing systems
CN109446521A (en) * 2018-10-18 2019-03-08 京东方科技集团股份有限公司 Name entity recognition method, device, electronic equipment, machine readable storage medium
CN111476031A (en) * 2020-03-11 2020-07-31 重庆邮电大学 Improved Chinese named entity recognition method based on L attice-L STM
CN111462752A (en) * 2020-04-01 2020-07-28 北京思特奇信息技术股份有限公司 Client intention identification method based on attention mechanism, feature embedding and BI-L STM
CN111444721A (en) * 2020-05-27 2020-07-24 南京大学 Chinese text key information extraction method based on pre-training language model

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Normalizing Text using Language Modelling based on Phonetics and String Similarity;Fenil Doshi等;ArXiv;1-9 *
Phonetic-enriched Text Representation for Chinese Sentiment Analysis with Reinforcement Learning;H Peng等;IEEE TRANSACTIONS ON AFFECTIVE COMPUTING;1-16 *
基于深度神经网络的电子病历命名实体识别关键技术研究与应用;江涛;中国优秀硕士学位论文全文数据库 (医药卫生科技辑)(第7期);E053-210 *

Also Published As

Publication number Publication date
CN113190656A (en) 2021-07-30

Similar Documents

Publication Publication Date Title
CN108460013B (en) Sequence labeling model and method based on fine-grained word representation model
CN113190656B (en) Chinese named entity extraction method based on multi-annotation frame and fusion features
CN107330032B (en) Implicit discourse relation analysis method based on recurrent neural network
WO2022141878A1 (en) End-to-end language model pretraining method and system, and device and storage medium
CN111666758B (en) Chinese word segmentation method, training device and computer readable storage medium
CN110688862A (en) Mongolian-Chinese inter-translation method based on transfer learning
CN109800437A (en) A kind of name entity recognition method based on Fusion Features
CN113591483A (en) Document-level event argument extraction method based on sequence labeling
CN115081437B (en) Machine-generated text detection method and system based on linguistic feature contrast learning
CN114757182A (en) BERT short text sentiment analysis method for improving training mode
CN111209749A (en) Method for applying deep learning to Chinese word segmentation
CN110852089B (en) Operation and maintenance project management method based on intelligent word segmentation and deep learning
CN111767718A (en) Chinese grammar error correction method based on weakened grammar error feature representation
CN111368542A (en) Text language association extraction method and system based on recurrent neural network
CN116151256A (en) Small sample named entity recognition method based on multitasking and prompt learning
CN113673254A (en) Knowledge distillation position detection method based on similarity maintenance
CN112163089A (en) Military high-technology text classification method and system fusing named entity recognition
CN111145914B (en) Method and device for determining text entity of lung cancer clinical disease seed bank
CN111125380A (en) Entity linking method based on RoBERTA and heuristic algorithm
CN115238691A (en) Knowledge fusion based embedded multi-intention recognition and slot filling model
CN114757184A (en) Method and system for realizing knowledge question answering in aviation field
CN113191150B (en) Multi-feature fusion Chinese medical text named entity identification method
CN109117471A (en) A kind of calculation method and terminal of the word degree of correlation
CN115600597A (en) Named entity identification method, device and system based on attention mechanism and intra-word semantic fusion and storage medium
CN113449517B (en) Entity relationship extraction method based on BERT gated multi-window attention network model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant