CN113190656A - Chinese named entity extraction method based on multi-label framework and fusion features - Google Patents
Chinese named entity extraction method based on multi-label framework and fusion features Download PDFInfo
- Publication number
- CN113190656A CN113190656A CN202110511025.8A CN202110511025A CN113190656A CN 113190656 A CN113190656 A CN 113190656A CN 202110511025 A CN202110511025 A CN 202110511025A CN 113190656 A CN113190656 A CN 113190656A
- Authority
- CN
- China
- Prior art keywords
- chinese
- entity
- sequence
- chinese character
- pinyin
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 39
- 230000004927 fusion Effects 0.000 title claims abstract description 16
- 238000002372 labelling Methods 0.000 claims abstract description 43
- 230000011218 segmentation Effects 0.000 claims abstract description 15
- 230000007246 mechanism Effects 0.000 claims abstract description 7
- 239000013598 vector Substances 0.000 claims description 39
- 238000012549 training Methods 0.000 claims description 29
- 239000012634 fragment Substances 0.000 claims description 26
- 238000000034 method Methods 0.000 claims description 25
- 238000013507 mapping Methods 0.000 claims description 10
- 238000012360 testing method Methods 0.000 claims description 6
- 230000002457 bidirectional effect Effects 0.000 claims description 3
- 230000015654 memory Effects 0.000 claims description 3
- 239000000284 extract Substances 0.000 claims 1
- 230000008901 benefit Effects 0.000 abstract description 6
- 238000010276 construction Methods 0.000 description 8
- 239000010410 layer Substances 0.000 description 7
- 238000010586 diagram Methods 0.000 description 6
- 101001013832 Homo sapiens Mitochondrial peptide methionine sulfoxide reductase Proteins 0.000 description 5
- 102100031767 Mitochondrial peptide methionine sulfoxide reductase Human genes 0.000 description 5
- 230000001419 dependent effect Effects 0.000 description 4
- 238000011161 development Methods 0.000 description 4
- 230000018109 developmental process Effects 0.000 description 4
- 238000010801 machine learning Methods 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 230000000694 effects Effects 0.000 description 3
- 239000011159 matrix material Substances 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 230000000750 progressive effect Effects 0.000 description 2
- 230000006403 short-term memory Effects 0.000 description 2
- 239000000126 substance Substances 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000003550 marker Substances 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000004660 morphological change Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 239000002356 single layer Substances 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3346—Query execution using probabilistic model
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Abstract
The invention discloses a Chinese named entity extraction method based on a multi-label frame and fusion characteristics. And then, introducing word information and word segmentation mark information for each Chinese character through dictionary matching to construct dictionary features. On the basis, Chinese phonetic software is used for phonetic notation of the Chinese characters according to the meanings of the Chinese characters in the matched words, and phonetic features are constructed. And then, based on a point-by-point attention mechanism, dictionary features and pinyin features are fused into Chinese character codes to obtain Chinese character semantic codes combining the dictionary features and the pinyin features, so that the recognition capability of Chinese named entity boundaries is improved. And finally, combining the advantages of sequence labeling and index labeling, and jointly learning two labeling tasks by using a multi-task learning model, thereby improving the accuracy of Chinese named entity extraction.
Description
Technical Field
The invention belongs to the field of artificial intelligence and natural language processing, and particularly relates to a Chinese named entity extraction method based on a multi-label framework and fusion characteristics.
Background
With the rapid development of the internet technology, data information of various industries is increased explosively, the development of intelligent analysis and mining service and innovation application of industrial big data is promoted, and the development of digital economy in China is further promoted. The data information contains a large amount of unstructured texts, and extracting structured effective information from the unstructured texts becomes a key point of attention in the industry, and relates to a basic task in the field of natural language processing: named entity extraction.
Early research efforts for named entity recognition were primarily dictionary and rule-based methods that relied primarily on linguists and domain experts to manually construct domain dictionaries and rule templates based on dataset features. The advantage of this rule-based approach is that the iteration rules can be constantly updated to extract the target entities as needed. However, the method has the disadvantages that the cost of manually establishing the rules is high in the face of some complicated fields and application scenes, and the problem of rule conflict is easily caused along with the expansion of the rule base, so that the existing rule base is difficult to maintain and expand and cannot adapt to the change of data and fields.
Subsequently, named entity recognition studies based on statistical machine learning are focused on. Named entity recognition is defined as a sequence tagging problem in statistical machine learning methods. The statistical machine learning method applied to the NER mainly comprises a maximum entropy model, a hidden Markov model, a maximum entropy Markov model, a conditional random field and the like. The method depends on characteristics of manual construction, and the process is relatively complicated.
In recent years, with the continuous development of Deep learning, the field of named entity recognition has appeared more and more work based on Deep Neural Networks (DNN). The DNN-based named entity identification method does not need complicated characteristic engineering, and the model effect is far superior to that of the traditional rule and the statistical machine learning method.
The recognition of named entities in chinese is more difficult than that in english because chinese lacks separators such as space bars in english text and has no obvious morphological change characteristics, which easily causes boundary ambiguity. In addition, the Chinese language has a phenomenon of word sense, in different fields or different contexts, the same word shows different meanings, and the word sense needs to be understood by fully utilizing context information. Meanwhile, the Chinese language also has linguistic characteristics such as omission, shorthand and the like, which bring greater challenges to the recognition of the named entity of the Chinese language. The existing Chinese named entity extraction methods lack the utilization of word information, are single in labeling frame and large in limitation, and influence the precision of Chinese named entity extraction.
Disclosure of Invention
The purpose of the invention is as follows: aiming at the problems and the defects in the prior art, the invention aims to provide a Chinese named entity extraction method based on a multi-labeling frame and fusion characteristics, so as to solve the problems that the traditional Chinese named entity extraction method is limited to a single-labeling frame due to single labeling frame and the entity boundary is difficult to identify due to lack of utilization of word information.
The technical scheme is as follows: in order to achieve the above object, the technical scheme adopted by the invention is a Chinese named entity extraction method based on a multi-label frame and fusion characteristics, which comprises the following steps:
(1) performing word matching on each Chinese character in an input Chinese character sequence in an external dictionary, mapping words into word vectors by using a word vector query table, mapping word segmentation marks of the Chinese characters in the words into word marking vectors by using a word segmentation mark vector query table, and splicing the word segmentation mark vectors and the word vectors to form dictionary features;
(2) according to the meaning of the Chinese characters in the matching words, the Chinese characters are marked with pinyin, and pinyin characteristics are obtained by mapping the pinyin through a pinyin vector lookup table;
(3) fusing the dictionary features and the pinyin features into Chinese character codes obtained by a Chinese pre-training language model BERT based on a point-by-point attention mechanism, and providing Chinese character semantic codes combining the dictionary features and the pinyin features for follow-up;
(4) the Chinese character semantic codes are respectively input into two independent bidirectional long-short term memory network models for feature sequence modeling, and the feature sequences are respectively output to obtain first feature sequence codesEncoding with the second signature sequence
(5) Sequence mark as auxiliary task and pointer mark as main task, and coding the first characteristic sequenceThe second signature sequence is encoded as input to a sequence annotation auxiliary taskAs the input of the pointer labeling main task, performing joint learning on the sequence labeling auxiliary task and the pointer labeling main task by using a multi-task learning model;
(6) calculating log-likelihood loss of sequence labeling auxiliary tasks in conditional random fieldsPointer labeling entity type classification cross entropy loss of entity fragment head Chinese character in main taskAnd entity type classification cross entropy loss of entity fragment tail Chinese characters in the pointer labeling main taskTo the aboveAnd weighting and summing to obtain a training target of the model needing to be minimized, performing end-to-end joint training, and labeling the entity fragment and the type thereof in the sentence by the pointer in the testing stage through the main task.
Further, in the step (1), the external dictionary and word vector lookup table is derived from pre-training word vectors published on the internet, and the word segmentation and labeling vector lookup table is composed of one-hot vectors.
Further, in the step (2), the pinyin vector lookup table is obtained by word2vec training based on the external chinese corpus, and the text in the external chinese corpus is converted into pinyin by using the chinese pinyin software.
Further, in the step (5), the sequence annotation auxiliary task uses BMOES without entity types to label entities in the input sentences, and is responsible for extracting Chinese named entity fragments, wherein the extracted entity fragments have no types; the pointer labeling main task only carries out entity type labeling on the head and tail Chinese characters of the entity fragment in the sentence and is responsible for extracting the named Chinese entity, and the extracted entity has a type.
Further, in the step (6), in the testing stage, a label corresponding to the maximum value of the predicted probability distribution of each Chinese character entity type is taken as a predicted label of the Chinese character, then an entity fragment tail Chinese character which is the same as the entity type of the Chinese character at the head of the entity fragment and has the closest position distance is matched, and a text fragment between the Chinese character at the head of the entity fragment and the Chinese character at the tail of the entity fragment is extracted as the entity.
Has the advantages that: the method can effectively solve the problem that the boundary of the Chinese named entity is difficult to identify, fully exerts the advantages of different labeling frames, and improves the accuracy rate of extracting the Chinese named entity. Firstly, the recognition of a model to an entity boundary is enhanced by constructing a dictionary and pinyin characteristics, and Chinese characters are coded by a Chinese pre-training language model BERT to provide context semantic support for an upper model; secondly, performing feature sequence modeling by using a recursive structure of a bidirectional long-short term memory network model, learning sequence position information, and relieving the problem that the sequence position information is easy to lose due to the fact that a pre-training language model BERT lacks sequence dependent modeling; thirdly, the sequence labeling and the pointer labeling are jointly learned through a multi-task learning model, the advantages of different labeling frames are combined, the limitation of a single labeling frame is broken through, and the accuracy of Chinese named entity extraction is further improved.
Drawings
FIG. 1 is an overall block diagram of the method of the present invention;
FIG. 2 is an exemplary diagram of dictionary and Pinyin feature construction in the method of the present invention;
FIG. 3 is a diagram illustrating sequence notation in the method of the present invention;
FIG. 4 is a diagram illustrating an example of a pointer marking in the method of the present invention;
FIG. 5(a) (b) are graphs of experimental results of the effect of the size of the dictionary matching window on accuracy on the Ontosites 4 dataset and the MSRA dataset in the method of the present invention, respectively;
FIG. 6(a) (b) are graphs of experimental results showing the effect of the size of the dictionary matching window on the accuracy of the Resume dataset and the Weibo dataset, respectively, in the method of the present invention.
Detailed Description
The present invention is further illustrated by the following figures and specific examples, which are to be understood as illustrative only and not as limiting the scope of the invention, which is to be given the full breadth of the appended claims and any and all equivalent modifications thereof which may occur to those skilled in the art upon reading the present specification.
The invention provides a Chinese named entity extraction method based on a multi-labeling frame and fusion characteristics, and solves the problems that the traditional Chinese named entity extraction method is difficult to identify entity boundaries and is limited to a single labeling frame. As shown in FIG. 1, the complete process of the present invention includes 6 parts, namely a dictionary feature construction stage, a pinyin feature construction stage, a dictionary and pinyin feature fusion stage, a feature sequence modeling stage, a multi-label frame joint learning stage, and an output layer modeling stage. Specific embodiments are described below:
the dictionary feature construction stage corresponds to the technical scheme step (1). The specific implementation mode is: for any given input Chinese character sequenceWhereinRepresenting a table of Chinese characters, n representing the sequence length, ci(i is more than or equal to 1 and less than or equal to n) represents a Chinese character with the length of 1. For any Chinese character c in the sequence XiTo introduce and Chinese character ciContext-dependent words, requiring the introduction of an external dictionary LxBy setting a vocabulary matching window lwAll Chinese characters c contained in the sentenceiAnd the length is less than or equal to lwText segment and dictionary LxThe words in (1) are matched. If it appears in dictionary LxIn this way, the text segment is regarded as the Chinese character ciContext-dependent candidate words. Since there may be a plurality of Chinese characters c contained in the sentenceiThe text segment of (2) appears in the dictionary to finally obtain the Chinese character ciA candidate matching word set ws (c)i)={w1,w2,…,wm},wj(1. ltoreq. j. ltoreq.m) represents a matching word.
Obtaining a candidate matching word set ws (c)i) And then, further screening is needed, and for any word in the candidate matching word set, if the word is a substring of another word in the candidate matching word set, the word is filtered and removed from the candidate matching word set. The reason for this is: 1) a complete word generally better conforms to information in the context of Chinese characters, for example, the "changjiang bridge" in "changjiang bridge in Nanjing city" is more suitable as a candidate word for "long" than "changjiang bridge"; 2) the interference in the process of fusing the dictionary and the pinyin characteristics based on the attention mechanism is reduced, so that the attention is more likely to select the word which best meets the context information of the Chinese character from the candidate word list.
By word vector lookup table (lookup table)wGathering the screened matched words ws (c)i) Mapping the words in (1) into word vectors to obtain matched word feature codes WE (c)i):
WE(ci)=ew(ws(ci))
Wherein e iswDerived from the pre-trained word vectors that have been trained, remain unchanged during the training process. And then, performing word segmentation and marking on the position of the Chinese character in the matched word. Suppose B represents Chinese character ciAt the beginning of the word, M represents the Chinese character ciIn the middle of the word, E represents the Chinese character ciAt the end of the word. Chinese character ciMatching different words to result in different sequences of word segmentation, so it is necessary to match Chinese character ciThe word segmentation marks in the matched words are also integrated into the dictionary features, so that the difference between different matched words is further highlighted. For Chinese character ciCandidate matching word set ws (c)i) Arbitrary word w ofjLet seg (w)j) E { B, M, E } represents Hanzi ciAt wjWord segmentation markers in (1). If START (w)j) Denotes wjIndex of the starting position in the sequence X, END (w)j) Denotes wjIndex of the end position in the sequence X, seg (w)j) The calculation formula of (c) is defined as follows:
for Chinese character ciCandidate matching word set ws (c)i) All the words in the above formula can be used to obtain segs (c)i):
Wherein segs (c)i) Denotes ciA set of word segmentation markers in all the matched words is searched by a word segmentation marker vector lookup table esegWill segs (c)i) Mapping of participle tags to one-hot vector participle tag code SEGE (c)i):
SEGE(ci)=eseg(segs(ci))
Each dimension of the one-hot vector corresponds to each bit element in the set B,, respectively. Wherein [1,0,0] corresponds to B, [0,1,0] corresponds to M, and [0,0,1] corresponds to E.
Chinese character ciSegmentation tagging encoding SEGE in matching words (c)i) Characteristic coding WE (c) with matching wordsi) Splicing on coding dimension to obtain Chinese character ciFinal dictionary feature encoding LE (c)i):
LE(ci)=[SEGE(ci);WE(ci)]
The pinyin feature construction stage corresponds to the technical scheme step (2). The specific implementation mode is as follows: including the soft sound, the pinyin has 5 tones, such as "chang", "ch ā ng", "ch a ng", and "ch a ng". If an entity is to be extracted from the sentence "Changjiang bridge in Nanjing city", when the "long" in the sentence generates the sound "ch" ng ", the sentence is broken into" Changjiang bridge in Nanjing city ", and the" Changjiang bridge "is extracted as a place name entity; when the long pronunciation in the sentence is zh { hacao } ng, the sentence is broken into a Nanjing Xichanghuangqiao, and the Nanjing Xichangqiao is extracted as a name entity. The situation that the pinyin characteristics of the Chinese characters in the sentences influence the entity extraction accuracy is explained.
For any Chinese character c in the input Chinese character sequence XiTo obtain the candidate word set ws (c)i) Then, using Chinese phonetic software (e.g. pypinyin), according to Chinese character ciMeaning pair c in matching wordsiThe pinyin is marked to obtain a candidate matching word set ws (c)i) Corresponding pinyin collection pys (c)i). Then, look up table e by Pinyin vectorpyPys (c)i) The Pinyin in (1) is mapped into a Pinyin vector to obtain a Pinyin feature code PYE (c)i):
PYE(ci)=epy(pys(ci))
Wherein, the pinyin vector lookup table epyThe method is characterized in that an external Chinese corpus (for example, a Chinese Wikipedia corpus) is converted into pinyin by utilizing Chinese pinyin software, and then the pinyin is obtained by training a Skip-gram method based on Word2 Vec. Data preprocessing stage prior to word vector training, since the outer Chinese corpus may contain numbers, English, or other non-phonetic symbolsParagraph, the present invention converts English into "[ ENG]", the number is converted to" [ DIGIT]", other characters without pinyin are converted into" UNK ""]”。
An exemplary diagram of dictionary and pinyin feature construction is shown in fig. 2. The matching results for "city" and "long" are given in the figure, where wi,jRepresenting a sequence fragment ci,ci+1,…,cjThe words formed. It can be seen that "Yangtze river" is not included in the matching results of "Yangtze river" because "Yangtze river" is filtered because it is a substring of "Yangtze river bridge".
The dictionary and pinyin feature fusion stage corresponds to the technical scheme step (3). The specific implementation mode is as follows: in order to avoid overfitting of model training caused by small scale of entity extraction labeled data sets in some vertical fields, the invention provides semantic support by using a Chinese pre-training language model BERT and improves the generalization performance of the model. Converting the input sequence X to { c }1,c2,…,cnThe Chinese language is input into a Chinese pre-training language model BERT, and the output of the last layer of the BERT is taken as a sequence code Xh=[x1,x2,…,xn]WhereindxRepresenting the BERT encoding dimension, R representing a real number,indicates xiIs dimension dxThe real number column vector of (1) is,shows XhIs dimension dxX n real matrix. The Chinese character c obtained by the constructioniThe dictionary features and the pinyin features are spliced on the coding dimension to obtain fusion features LPE (c)i):
LPE(ci)=[LE(ci);PYE(ci)]
Hypothesis word vector lookup table ewHas a coding dimension of dwPinyin vector lookup table epyCoding dimension of dpyChinese character ciCandidate matching word set ws (c)i) A size of m, thenConverting LPE (c) based on point-by-point attention mechanismi) Merging into Chinese character coding xiIn, xiCorresponding to query in attention mechanism, and LPE (c)i) It is equivalent to key and value in the attention mechanism. First, LPE (c)i) Linear mapping to xiEncoding dimensionally consistent LPEikv:
Wherein the training parametersAnd the mapped fusion featuresAssuming that unsqueeze (M, y) represents the y-th dimension of the expansion matrix M and squeeze (M, y) represents the y-th dimension of the compression matrix M, unsqueeze (x)i0) x can be substitutediFromIs converted intoThen, an attention weight LPE is calculatediw:
LPEiw=softmax(unsqueeze(xi,0)·PEikv)
Wherein the attention weight LPEiw∈R1×mThe sum of the weights after softmax is 1. Then, using the attention weight LPEiwFor LPEikvWeighted summation to calculate attention output LPEio:
Wherein attention is outputFinally, LPE is mixedioAnd Chinese character coding xiAdded as Chinese character ciThe final semantic code, expressed as:
xi=LPEio+xi
the characteristic sequence modeling stage corresponds to the technical scheme step (4). The specific implementation mode is as follows: aiming at the problem that the self-attention mechanism of the Transfomer cannot capture sequence position information, the pre-training language model BERT integrates trainable absolute position coding into input to relieve the problem, but still lacks sequence-dependent modeling. The Long Short-Term Memory network model (LSTM) does not need position coding, and the structure of the LSTM recursively coded in the sequence order has the ability of learning sequence position information. The Chinese character semantic sequence after the dictionary and the phonetic feature are fused in the last step is codedAnd respectively inputting the two-way Long Short-Term Memory network models (BilSTMs) to perform feature sequence modeling, wherein one BilSTM output is used for the sequence labeling-based Chinese named entity fragment extraction auxiliary task in the step (5), and the other BilSTM output is used for the pointer labeling-based Chinese named entity extraction main task in the step (5). BilSTM consists of forward and backward LSTMs, and the BilSTMs of the two tasks are independent and do not share training parameters.
Assuming that at time step t, the forward LSTM hidden state output of the secondary task is extracted based on the Chinese named entity fragment labeled in sequence asBackward LSTM hidden state output isWill be provided withAndadding to obtain the BilSTM hidden state output of the auxiliary task at the time step t
Pointer annotation based Chinese named entity extraction main task forward LSTM hidden state output asBackward LSTM hidden state output isWill be provided withAndadding to obtain the BilSTM hidden state output of the main task at the time step t
Finally, the characteristic sequence modeling output of the sequence labeling auxiliary task is Pointer label ownerThe feature sequence modeling output of the task isdhRepresenting the LSTM encoding dimension.
The joint learning stage of the multi-label frame corresponds to the technical scheme step (5). The specific implementation mode is as follows: sequence annotation and pointer annotation are two common annotation frameworks that apply to named entity extraction. Sequence labeling marks the position of each Chinese character in the text sequence in the entity, as shown in fig. 3, which is an exemplary diagram of marking the text sequence by BMOES, wherein B denotes the beginning of the Chinese character in the named entity segment, M denotes the middle of the Chinese character in the named entity segment, O denotes the Chinese character outside the named entity segment, E denotes the end of the Chinese character in the named entity segment, and S denotes that the Chinese character itself is the named entity segment. The example sentence includes two entities of "Nanjing City" and "Changjiang river bridge". The pointer label marks the entity types of the first Chinese character and the last Chinese character of each entity fragment in the text sequence, as shown in fig. 4, wherein "Nanjing City" and "Changjiang river bridge" are both place class (Loc) entities.
Sequence labeling is characterized in that the integrity of an extracted entity is better and the precision is generally higher by modeling the full sequence dependence; pointer marking is used for classifying the entity types of the Chinese characters at the head and the tail of the entity fragment, so that the noise interference resistance and the robustness are better, and the recall ratio is generally higher. In order to combine the advantages of different labeling frameworksAs an input to the sequence annotation auxiliary task,as the input of the main task of pointer labeling, a Multi-task learning model, such as a Multi-gate mixed Experts (MMOE) model, a Progressive hierarchical Extraction (PLE) model and the like, is utilized to perform joint learning on the auxiliary task of extracting the Chinese named entity segments based on the sequence labeling and the main task of extracting the Chinese named entity based on the pointer labeling to obtain a sequence labelingColumn annotation assisted task outputAnd pointer annotation main task output
The output layer sequence modeling stage corresponds to the technical scheme step (6). The specific implementation mode is as follows: for X obtained in the last stepaAnd XbAdding a layer Dropout prevents the model from overfitting. Then, X after Dropout is addedaInputting the sequence-labeled Chinese named entity fragment into a Conditional Random Field (CRF), and calculating a BMOES label index sequence y belonging to a sequence-labeled Chinese named entity fragment extraction auxiliary tasknLikelihood probability p (y | X):
wherein the content of the first and second substances,represents the set of all possible BMOES label index sequences of X under the task, y ∈ ZnIs thatAny BMOES tag index sequence. Training parametersbCRF∈R5×5(the number of tags in the BMOES sequence labeling method is 5),represents WCRFMiddle corresponding label ytThe training parameters of (a) are set,denotes bCRFMiddle corresponding label yt-1Transfer to label ytThe training parameters of (a) are set,the same is true. Suppose that the true BMOES tag index sequence of the sequence annotation auxiliary task is yspan∈ZnZ represents an integer, substituted into the above formula for calculating the log-likelihood loss of the sequence annotation aid task
Then, X after Dropout is addedbLinearly mapping to the Chinese named entity based on the pointer label to extract the label space of the main task, and then adding a layer of softmax to calculate the probability distribution p of each Chinese character on each labelstartAnd pend:
Wherein the training parameters ce+1 is the number of entity types ceThe sum of the type of the non-entity,is the prediction probability distribution of the entity type of the first Chinese character of the entity fragment,is a solid fragment of the Chinese scheffleraA predicted probability distribution of a word entity type. Suppose the index sequence of the real entity type label of the first Chinese character of the entity segment is ystart∈ZnThe index sequence of the real entity type label of the entity segment tail Chinese character is yend∈ZnComputing Cross Entropy (CE) loss of the pointer annotation main taskAnd
wherein the content of the first and second substances,the real entity type tag index representing the ith chinese character,represents pstartThe ith Chinese character is predicted asThe probability value of the type of the species entity, the same is true.
Finally, the loss of the sequence labeling auxiliary task is obtainedAnd the pointer mark masterLoss of serviceLater, fusing 3 loss into the model requires minimized overall training objectivesPerforming end-to-end joint training:
wherein λ is1、λ2、λ3Is a hyper-parameter that controls the impact of each task on the overall training goal. In the test phase, take pstartAnd pendIndex corresponding to maximum value of probability distribution of each Chinese character label predictionAndas a label prediction index:
and then, matching the head and tail Chinese characters of the entity segments with the same entity types and the positions closest to the entity segments, and extracting the entities in the sequence.
The invention provides a Chinese named entity extraction method based on a multi-label frame and fusion characteristics. In order to test the effectiveness of the method, the method is evaluated on the Ontosites 4, MSRA, Resume and Weibo data sets respectively from three aspects of precision ratio (P), recall ratio (R) and F1 indexes, and compared with other Chinese named entity extraction methods.
The model optimizer uses Adaptive moment estimation (Adam), the learning rate of the BERT training parameters is set to 3e-5, the learning rate of other model parameters is set to 1e-3, and the BERT coding dimension dx768, the multi-task learning model uses a progressive hierarchical extraction model PLE, the number of independent Experts of each task and the number of Experts sharing the Experts in the PLE are uniformly set to be 2, the Expert is set to be a single-layer fully-connected network, the number of PLE layers is set to be 2, the number of LSTM layers is set to be 1, and the LSTM coding dimension d is set to beh768, the word vector coding dimension dw50, Pinyin vector coding dimension dpyLose weight at 50
Table 1 shows the results of the comparison of the accuracy of different chinese named entity extraction methods on the ontones 4 dataset; table 2 shows the results of the comparison of the accuracy rates of different chinese named entity extraction methods on the MSRA dataset; table 3 shows the results of comparing the accuracy of the various chinese named entity extraction methods on the Resume dataset; table 4 shows the results of comparing the accuracy of different chinese named entity extraction methods on the Weibo dataset. From the experimental results in the table, it can be seen that the Chinese named entity extraction method provided by the invention obtains the best Chinese named entity extraction accuracy performance on most data sets and index items compared with other Chinese named entity extraction methods. Fig. 5(a) (b) shows the experimental results of the influence of the size of the dictionary matching window on ontototes 4 and MSRA data sets on accuracy in the method of the present invention, and fig. 6(a) (b) shows the experimental results of the influence of the size of the dictionary matching window on the accuracy in the Resume and Weibo data sets in the method of the present invention, and provides guiding suggestions for the selection of the size of the dictionary matching window in different subsequent application scenarios by evaluating the influence of the selection of the size of the dictionary matching window in the analysis method on the extraction accuracy of the named entity in chinese.
TABLE 1 comparison of accuracy of extraction methods for different entities on Ontosites 4 dataset
TABLE 2 comparison of accuracy rates of different entity extraction methods on MSRA datasets
TABLE 3 comparison of the accuracy of the extraction methods for different entities on the Resume dataset
TABLE 4 comparison of accuracy of different entity extraction methods on Weibo data set
Claims (5)
1. A Chinese named entity extraction method based on a multi-label frame and fusion features comprises the following steps:
(1) performing word matching on each Chinese character in an input Chinese character sequence in an external dictionary, mapping words into word vectors by using a word vector query table, mapping word segmentation marks of the Chinese characters in the words into word marking vectors by using a word segmentation mark vector query table, and splicing the word segmentation mark vectors and the word vectors to form dictionary features;
(2) according to the meaning of the Chinese characters in the matching words, the Chinese characters are marked with pinyin, and pinyin characteristics are obtained by mapping the pinyin through a pinyin vector lookup table;
(3) fusing the dictionary features and the pinyin features into Chinese character codes obtained by a Chinese pre-training language model BERT based on a point-by-point attention mechanism, and providing Chinese character semantic codes combining the dictionary features and the pinyin features for follow-up;
(4) the Chinese character semantic codes are respectively input into two independent bidirectional long-short term memory network models for feature sequence modeling, and the feature sequences are respectively output to obtain first feature sequence codesEncoding with the second signature sequence
(5) Sequence mark as auxiliary task and pointer mark as main task, and coding the first characteristic sequenceThe second signature sequence is encoded as input to a sequence annotation auxiliary taskAs the input of the pointer labeling main task, performing joint learning on the sequence labeling auxiliary task and the pointer labeling main task by using a multi-task learning model;
(6) calculating log-likelihood loss of sequence labeling auxiliary tasks in conditional random fieldsPointer labeling entity type classification cross entropy loss of entity fragment head Chinese character in main taskAnd entity type classification cross entropy loss of entity fragment tail Chinese characters in the pointer labeling main taskTo the aboveWeighted sum to obtain model requirementsEnd-to-end joint training is carried out on the minimized training target, and in the testing stage, entity fragments and types in sentences are extracted by marking a main task through a pointer.
2. The method for extracting Chinese named entities based on multi-label frame and fusion features as claimed in claim 1, wherein in step (1), the external dictionary and word vector lookup table is derived from pre-training word vectors published on the Internet, and the word segmentation label vector lookup table is composed of one-hot vectors.
3. The method as claimed in claim 1, wherein in step (2), the pinyin vector lookup table is obtained by word2vec training based on the external chinese corpus, and the text in the external chinese corpus is converted into pinyin by using chinese pinyin software.
4. The method for extracting named entity in Chinese based on multi-label frame and fusion features as claimed in claim 1, wherein in step (5), the sequence label assisting task uses BMOES without entity type to label the entities in the input sentence, which is responsible for the extraction of named entity fragment in Chinese, and the extracted entity fragment has no type; the pointer labeling main task only carries out entity type labeling on the head and tail Chinese characters of the entity fragment in the sentence and is responsible for extracting the named Chinese entity, and the extracted entity has a type.
5. The method for extracting named entities in Chinese based on multi-label frame and fusion features as claimed in claim 1, wherein in step (6), the test stage takes the label corresponding to the maximum value of the predicted probability distribution of each Chinese character entity type as the predicted label of the Chinese character, then matches the tail Chinese character of the entity segment with the same type as the Chinese character at the head of the entity segment and the closest position distance, and extracts the text segment between the head Chinese character of the entity segment and the tail Chinese character of the entity segment as the entity.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110511025.8A CN113190656B (en) | 2021-05-11 | 2021-05-11 | Chinese named entity extraction method based on multi-annotation frame and fusion features |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110511025.8A CN113190656B (en) | 2021-05-11 | 2021-05-11 | Chinese named entity extraction method based on multi-annotation frame and fusion features |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113190656A true CN113190656A (en) | 2021-07-30 |
CN113190656B CN113190656B (en) | 2023-07-14 |
Family
ID=76981067
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110511025.8A Active CN113190656B (en) | 2021-05-11 | 2021-05-11 | Chinese named entity extraction method based on multi-annotation frame and fusion features |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113190656B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114036933A (en) * | 2022-01-10 | 2022-02-11 | 湖南工商大学 | Information extraction method based on legal documents |
CN114139541A (en) * | 2021-11-22 | 2022-03-04 | 北京中科闻歌科技股份有限公司 | Named entity identification method, device, equipment and medium |
CN115146644A (en) * | 2022-09-01 | 2022-10-04 | 北京航空航天大学 | Multi-feature fusion named entity identification method for warning situation text |
CN115470871A (en) * | 2022-11-02 | 2022-12-13 | 江苏鸿程大数据技术与应用研究院有限公司 | Policy matching method and system based on named entity recognition and relation extraction model |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10032451B1 (en) * | 2016-12-20 | 2018-07-24 | Amazon Technologies, Inc. | User recognition for speech processing systems |
US10140973B1 (en) * | 2016-09-15 | 2018-11-27 | Amazon Technologies, Inc. | Text-to-speech processing using previously speech processed data |
CN109446521A (en) * | 2018-10-18 | 2019-03-08 | 京东方科技集团股份有限公司 | Name entity recognition method, device, electronic equipment, machine readable storage medium |
CN111444721A (en) * | 2020-05-27 | 2020-07-24 | 南京大学 | Chinese text key information extraction method based on pre-training language model |
CN111462752A (en) * | 2020-04-01 | 2020-07-28 | 北京思特奇信息技术股份有限公司 | Client intention identification method based on attention mechanism, feature embedding and BI-L STM |
CN111476031A (en) * | 2020-03-11 | 2020-07-31 | 重庆邮电大学 | Improved Chinese named entity recognition method based on L attice-L STM |
-
2021
- 2021-05-11 CN CN202110511025.8A patent/CN113190656B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10140973B1 (en) * | 2016-09-15 | 2018-11-27 | Amazon Technologies, Inc. | Text-to-speech processing using previously speech processed data |
US10032451B1 (en) * | 2016-12-20 | 2018-07-24 | Amazon Technologies, Inc. | User recognition for speech processing systems |
CN109446521A (en) * | 2018-10-18 | 2019-03-08 | 京东方科技集团股份有限公司 | Name entity recognition method, device, electronic equipment, machine readable storage medium |
CN111476031A (en) * | 2020-03-11 | 2020-07-31 | 重庆邮电大学 | Improved Chinese named entity recognition method based on L attice-L STM |
CN111462752A (en) * | 2020-04-01 | 2020-07-28 | 北京思特奇信息技术股份有限公司 | Client intention identification method based on attention mechanism, feature embedding and BI-L STM |
CN111444721A (en) * | 2020-05-27 | 2020-07-24 | 南京大学 | Chinese text key information extraction method based on pre-training language model |
Non-Patent Citations (3)
Title |
---|
FENIL DOSHI等: "Normalizing Text using Language Modelling based on Phonetics and String Similarity", ARXIV, pages 1 - 9 * |
H PENG等: "Phonetic-enriched Text Representation for Chinese Sentiment Analysis with Reinforcement Learning", IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, pages 1 - 16 * |
江涛: "基于深度神经网络的电子病历命名实体识别关键技术研究与应用", 中国优秀硕士学位论文全文数据库 (医药卫生科技辑), no. 7, pages 053 - 210 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114139541A (en) * | 2021-11-22 | 2022-03-04 | 北京中科闻歌科技股份有限公司 | Named entity identification method, device, equipment and medium |
CN114036933A (en) * | 2022-01-10 | 2022-02-11 | 湖南工商大学 | Information extraction method based on legal documents |
CN114036933B (en) * | 2022-01-10 | 2022-04-22 | 湖南工商大学 | Information extraction method based on legal documents |
CN115146644A (en) * | 2022-09-01 | 2022-10-04 | 北京航空航天大学 | Multi-feature fusion named entity identification method for warning situation text |
CN115470871A (en) * | 2022-11-02 | 2022-12-13 | 江苏鸿程大数据技术与应用研究院有限公司 | Policy matching method and system based on named entity recognition and relation extraction model |
CN115470871B (en) * | 2022-11-02 | 2023-02-17 | 江苏鸿程大数据技术与应用研究院有限公司 | Policy matching method and system based on named entity recognition and relation extraction model |
Also Published As
Publication number | Publication date |
---|---|
CN113190656B (en) | 2023-07-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108984526B (en) | Document theme vector extraction method based on deep learning | |
CN108416058B (en) | Bi-LSTM input information enhancement-based relation extraction method | |
CN110377903B (en) | Sentence-level entity and relation combined extraction method | |
CN110008469B (en) | Multilevel named entity recognition method | |
CN113190656B (en) | Chinese named entity extraction method based on multi-annotation frame and fusion features | |
CN111666758B (en) | Chinese word segmentation method, training device and computer readable storage medium | |
CN111767718B (en) | Chinese grammar error correction method based on weakened grammar error feature representation | |
CN110688862A (en) | Mongolian-Chinese inter-translation method based on transfer learning | |
CN112926324B (en) | Vietnamese event entity recognition method integrating dictionary and anti-migration | |
CN114757182A (en) | BERT short text sentiment analysis method for improving training mode | |
CN115081437B (en) | Machine-generated text detection method and system based on linguistic feature contrast learning | |
CN112183064B (en) | Text emotion reason recognition system based on multi-task joint learning | |
CN110852089B (en) | Operation and maintenance project management method based on intelligent word segmentation and deep learning | |
CN116151256A (en) | Small sample named entity recognition method based on multitasking and prompt learning | |
CN111368542A (en) | Text language association extraction method and system based on recurrent neural network | |
CN116432655B (en) | Method and device for identifying named entities with few samples based on language knowledge learning | |
CN113255320A (en) | Entity relation extraction method and device based on syntax tree and graph attention machine mechanism | |
CN115831102A (en) | Speech recognition method and device based on pre-training feature representation and electronic equipment | |
CN114912453A (en) | Chinese legal document named entity identification method based on enhanced sequence features | |
CN111145914B (en) | Method and device for determining text entity of lung cancer clinical disease seed bank | |
CN117196032A (en) | Knowledge graph construction method and device for intelligent decision, electronic equipment and storage medium | |
CN113191150A (en) | Multi-feature fusion Chinese medical text named entity identification method | |
CN112818698A (en) | Fine-grained user comment sentiment analysis method based on dual-channel model | |
CN115186670B (en) | Method and system for identifying domain named entities based on active learning | |
CN113139050B (en) | Text abstract generation method based on named entity identification additional label and priori knowledge |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |