CN113190656A - Chinese named entity extraction method based on multi-label framework and fusion features - Google Patents

Chinese named entity extraction method based on multi-label framework and fusion features Download PDF

Info

Publication number
CN113190656A
CN113190656A CN202110511025.8A CN202110511025A CN113190656A CN 113190656 A CN113190656 A CN 113190656A CN 202110511025 A CN202110511025 A CN 202110511025A CN 113190656 A CN113190656 A CN 113190656A
Authority
CN
China
Prior art keywords
chinese
entity
sequence
chinese character
pinyin
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110511025.8A
Other languages
Chinese (zh)
Other versions
CN113190656B (en
Inventor
麦丞程
刘健
黄宜华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN202110511025.8A priority Critical patent/CN113190656B/en
Publication of CN113190656A publication Critical patent/CN113190656A/en
Application granted granted Critical
Publication of CN113190656B publication Critical patent/CN113190656B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a Chinese named entity extraction method based on a multi-label frame and fusion characteristics. And then, introducing word information and word segmentation mark information for each Chinese character through dictionary matching to construct dictionary features. On the basis, Chinese phonetic software is used for phonetic notation of the Chinese characters according to the meanings of the Chinese characters in the matched words, and phonetic features are constructed. And then, based on a point-by-point attention mechanism, dictionary features and pinyin features are fused into Chinese character codes to obtain Chinese character semantic codes combining the dictionary features and the pinyin features, so that the recognition capability of Chinese named entity boundaries is improved. And finally, combining the advantages of sequence labeling and index labeling, and jointly learning two labeling tasks by using a multi-task learning model, thereby improving the accuracy of Chinese named entity extraction.

Description

Chinese named entity extraction method based on multi-label framework and fusion features
Technical Field
The invention belongs to the field of artificial intelligence and natural language processing, and particularly relates to a Chinese named entity extraction method based on a multi-label framework and fusion characteristics.
Background
With the rapid development of the internet technology, data information of various industries is increased explosively, the development of intelligent analysis and mining service and innovation application of industrial big data is promoted, and the development of digital economy in China is further promoted. The data information contains a large amount of unstructured texts, and extracting structured effective information from the unstructured texts becomes a key point of attention in the industry, and relates to a basic task in the field of natural language processing: named entity extraction.
Early research efforts for named entity recognition were primarily dictionary and rule-based methods that relied primarily on linguists and domain experts to manually construct domain dictionaries and rule templates based on dataset features. The advantage of this rule-based approach is that the iteration rules can be constantly updated to extract the target entities as needed. However, the method has the disadvantages that the cost of manually establishing the rules is high in the face of some complicated fields and application scenes, and the problem of rule conflict is easily caused along with the expansion of the rule base, so that the existing rule base is difficult to maintain and expand and cannot adapt to the change of data and fields.
Subsequently, named entity recognition studies based on statistical machine learning are focused on. Named entity recognition is defined as a sequence tagging problem in statistical machine learning methods. The statistical machine learning method applied to the NER mainly comprises a maximum entropy model, a hidden Markov model, a maximum entropy Markov model, a conditional random field and the like. The method depends on characteristics of manual construction, and the process is relatively complicated.
In recent years, with the continuous development of Deep learning, the field of named entity recognition has appeared more and more work based on Deep Neural Networks (DNN). The DNN-based named entity identification method does not need complicated characteristic engineering, and the model effect is far superior to that of the traditional rule and the statistical machine learning method.
The recognition of named entities in chinese is more difficult than that in english because chinese lacks separators such as space bars in english text and has no obvious morphological change characteristics, which easily causes boundary ambiguity. In addition, the Chinese language has a phenomenon of word sense, in different fields or different contexts, the same word shows different meanings, and the word sense needs to be understood by fully utilizing context information. Meanwhile, the Chinese language also has linguistic characteristics such as omission, shorthand and the like, which bring greater challenges to the recognition of the named entity of the Chinese language. The existing Chinese named entity extraction methods lack the utilization of word information, are single in labeling frame and large in limitation, and influence the precision of Chinese named entity extraction.
Disclosure of Invention
The purpose of the invention is as follows: aiming at the problems and the defects in the prior art, the invention aims to provide a Chinese named entity extraction method based on a multi-labeling frame and fusion characteristics, so as to solve the problems that the traditional Chinese named entity extraction method is limited to a single-labeling frame due to single labeling frame and the entity boundary is difficult to identify due to lack of utilization of word information.
The technical scheme is as follows: in order to achieve the above object, the technical scheme adopted by the invention is a Chinese named entity extraction method based on a multi-label frame and fusion characteristics, which comprises the following steps:
(1) performing word matching on each Chinese character in an input Chinese character sequence in an external dictionary, mapping words into word vectors by using a word vector query table, mapping word segmentation marks of the Chinese characters in the words into word marking vectors by using a word segmentation mark vector query table, and splicing the word segmentation mark vectors and the word vectors to form dictionary features;
(2) according to the meaning of the Chinese characters in the matching words, the Chinese characters are marked with pinyin, and pinyin characteristics are obtained by mapping the pinyin through a pinyin vector lookup table;
(3) fusing the dictionary features and the pinyin features into Chinese character codes obtained by a Chinese pre-training language model BERT based on a point-by-point attention mechanism, and providing Chinese character semantic codes combining the dictionary features and the pinyin features for follow-up;
(4) the Chinese character semantic codes are respectively input into two independent bidirectional long-short term memory network models for feature sequence modeling, and the feature sequences are respectively output to obtain first feature sequence codes
Figure BDA0003060386620000031
Encoding with the second signature sequence
Figure BDA0003060386620000032
(5) Sequence mark as auxiliary task and pointer mark as main task, and coding the first characteristic sequence
Figure BDA0003060386620000033
The second signature sequence is encoded as input to a sequence annotation auxiliary task
Figure BDA0003060386620000034
As the input of the pointer labeling main task, performing joint learning on the sequence labeling auxiliary task and the pointer labeling main task by using a multi-task learning model;
(6) calculating log-likelihood loss of sequence labeling auxiliary tasks in conditional random fields
Figure BDA0003060386620000035
Pointer labeling entity type classification cross entropy loss of entity fragment head Chinese character in main task
Figure BDA0003060386620000036
And entity type classification cross entropy loss of entity fragment tail Chinese characters in the pointer labeling main task
Figure BDA0003060386620000037
To the above
Figure BDA0003060386620000038
And weighting and summing to obtain a training target of the model needing to be minimized, performing end-to-end joint training, and labeling the entity fragment and the type thereof in the sentence by the pointer in the testing stage through the main task.
Further, in the step (1), the external dictionary and word vector lookup table is derived from pre-training word vectors published on the internet, and the word segmentation and labeling vector lookup table is composed of one-hot vectors.
Further, in the step (2), the pinyin vector lookup table is obtained by word2vec training based on the external chinese corpus, and the text in the external chinese corpus is converted into pinyin by using the chinese pinyin software.
Further, in the step (5), the sequence annotation auxiliary task uses BMOES without entity types to label entities in the input sentences, and is responsible for extracting Chinese named entity fragments, wherein the extracted entity fragments have no types; the pointer labeling main task only carries out entity type labeling on the head and tail Chinese characters of the entity fragment in the sentence and is responsible for extracting the named Chinese entity, and the extracted entity has a type.
Further, in the step (6), in the testing stage, a label corresponding to the maximum value of the predicted probability distribution of each Chinese character entity type is taken as a predicted label of the Chinese character, then an entity fragment tail Chinese character which is the same as the entity type of the Chinese character at the head of the entity fragment and has the closest position distance is matched, and a text fragment between the Chinese character at the head of the entity fragment and the Chinese character at the tail of the entity fragment is extracted as the entity.
Has the advantages that: the method can effectively solve the problem that the boundary of the Chinese named entity is difficult to identify, fully exerts the advantages of different labeling frames, and improves the accuracy rate of extracting the Chinese named entity. Firstly, the recognition of a model to an entity boundary is enhanced by constructing a dictionary and pinyin characteristics, and Chinese characters are coded by a Chinese pre-training language model BERT to provide context semantic support for an upper model; secondly, performing feature sequence modeling by using a recursive structure of a bidirectional long-short term memory network model, learning sequence position information, and relieving the problem that the sequence position information is easy to lose due to the fact that a pre-training language model BERT lacks sequence dependent modeling; thirdly, the sequence labeling and the pointer labeling are jointly learned through a multi-task learning model, the advantages of different labeling frames are combined, the limitation of a single labeling frame is broken through, and the accuracy of Chinese named entity extraction is further improved.
Drawings
FIG. 1 is an overall block diagram of the method of the present invention;
FIG. 2 is an exemplary diagram of dictionary and Pinyin feature construction in the method of the present invention;
FIG. 3 is a diagram illustrating sequence notation in the method of the present invention;
FIG. 4 is a diagram illustrating an example of a pointer marking in the method of the present invention;
FIG. 5(a) (b) are graphs of experimental results of the effect of the size of the dictionary matching window on accuracy on the Ontosites 4 dataset and the MSRA dataset in the method of the present invention, respectively;
FIG. 6(a) (b) are graphs of experimental results showing the effect of the size of the dictionary matching window on the accuracy of the Resume dataset and the Weibo dataset, respectively, in the method of the present invention.
Detailed Description
The present invention is further illustrated by the following figures and specific examples, which are to be understood as illustrative only and not as limiting the scope of the invention, which is to be given the full breadth of the appended claims and any and all equivalent modifications thereof which may occur to those skilled in the art upon reading the present specification.
The invention provides a Chinese named entity extraction method based on a multi-labeling frame and fusion characteristics, and solves the problems that the traditional Chinese named entity extraction method is difficult to identify entity boundaries and is limited to a single labeling frame. As shown in FIG. 1, the complete process of the present invention includes 6 parts, namely a dictionary feature construction stage, a pinyin feature construction stage, a dictionary and pinyin feature fusion stage, a feature sequence modeling stage, a multi-label frame joint learning stage, and an output layer modeling stage. Specific embodiments are described below:
the dictionary feature construction stage corresponds to the technical scheme step (1). The specific implementation mode is: for any given input Chinese character sequence
Figure BDA0003060386620000051
Wherein
Figure BDA0003060386620000052
Representing a table of Chinese characters, n representing the sequence length, ci(i is more than or equal to 1 and less than or equal to n) represents a Chinese character with the length of 1. For any Chinese character c in the sequence XiTo introduce and Chinese character ciContext-dependent words, requiring the introduction of an external dictionary LxBy setting a vocabulary matching window lwAll Chinese characters c contained in the sentenceiAnd the length is less than or equal to lwText segment and dictionary LxThe words in (1) are matched. If it appears in dictionary LxIn this way, the text segment is regarded as the Chinese character ciContext-dependent candidate words. Since there may be a plurality of Chinese characters c contained in the sentenceiThe text segment of (2) appears in the dictionary to finally obtain the Chinese character ciA candidate matching word set ws (c)i)={w1,w2,…,wm},wj(1. ltoreq. j. ltoreq.m) represents a matching word.
Obtaining a candidate matching word set ws (c)i) And then, further screening is needed, and for any word in the candidate matching word set, if the word is a substring of another word in the candidate matching word set, the word is filtered and removed from the candidate matching word set. The reason for this is: 1) a complete word generally better conforms to information in the context of Chinese characters, for example, the "changjiang bridge" in "changjiang bridge in Nanjing city" is more suitable as a candidate word for "long" than "changjiang bridge"; 2) the interference in the process of fusing the dictionary and the pinyin characteristics based on the attention mechanism is reduced, so that the attention is more likely to select the word which best meets the context information of the Chinese character from the candidate word list.
By word vector lookup table (lookup table)wGathering the screened matched words ws (c)i) Mapping the words in (1) into word vectors to obtain matched word feature codes WE (c)i):
WE(ci)=ew(ws(ci))
Wherein e iswDerived from the pre-trained word vectors that have been trained, remain unchanged during the training process. And then, performing word segmentation and marking on the position of the Chinese character in the matched word. Suppose B represents Chinese character ciAt the beginning of the word, M represents the Chinese character ciIn the middle of the word, E represents the Chinese character ciAt the end of the word. Chinese character ciMatching different words to result in different sequences of word segmentation, so it is necessary to match Chinese character ciThe word segmentation marks in the matched words are also integrated into the dictionary features, so that the difference between different matched words is further highlighted. For Chinese character ciCandidate matching word set ws (c)i) Arbitrary word w ofjLet seg (w)j) E { B, M, E } represents Hanzi ciAt wjWord segmentation markers in (1). If START (w)j) Denotes wjIndex of the starting position in the sequence X, END (w)j) Denotes wjIndex of the end position in the sequence X, seg (w)j) The calculation formula of (c) is defined as follows:
Figure BDA0003060386620000061
for Chinese character ciCandidate matching word set ws (c)i) All the words in the above formula can be used to obtain segs (c)i):
Figure BDA0003060386620000062
Wherein segs (c)i) Denotes ciA set of word segmentation markers in all the matched words is searched by a word segmentation marker vector lookup table esegWill segs (c)i) Mapping of participle tags to one-hot vector participle tag code SEGE (c)i):
SEGE(ci)=eseg(segs(ci))
Each dimension of the one-hot vector corresponds to each bit element in the set B,, respectively. Wherein [1,0,0] corresponds to B, [0,1,0] corresponds to M, and [0,0,1] corresponds to E.
Chinese character ciSegmentation tagging encoding SEGE in matching words (c)i) Characteristic coding WE (c) with matching wordsi) Splicing on coding dimension to obtain Chinese character ciFinal dictionary feature encoding LE (c)i):
LE(ci)=[SEGE(ci);WE(ci)]
The pinyin feature construction stage corresponds to the technical scheme step (2). The specific implementation mode is as follows: including the soft sound, the pinyin has 5 tones, such as "chang", "ch ā ng", "ch a ng", and "ch a ng". If an entity is to be extracted from the sentence "Changjiang bridge in Nanjing city", when the "long" in the sentence generates the sound "ch" ng ", the sentence is broken into" Changjiang bridge in Nanjing city ", and the" Changjiang bridge "is extracted as a place name entity; when the long pronunciation in the sentence is zh { hacao } ng, the sentence is broken into a Nanjing Xichanghuangqiao, and the Nanjing Xichangqiao is extracted as a name entity. The situation that the pinyin characteristics of the Chinese characters in the sentences influence the entity extraction accuracy is explained.
For any Chinese character c in the input Chinese character sequence XiTo obtain the candidate word set ws (c)i) Then, using Chinese phonetic software (e.g. pypinyin), according to Chinese character ciMeaning pair c in matching wordsiThe pinyin is marked to obtain a candidate matching word set ws (c)i) Corresponding pinyin collection pys (c)i). Then, look up table e by Pinyin vectorpyPys (c)i) The Pinyin in (1) is mapped into a Pinyin vector to obtain a Pinyin feature code PYE (c)i):
PYE(ci)=epy(pys(ci))
Wherein, the pinyin vector lookup table epyThe method is characterized in that an external Chinese corpus (for example, a Chinese Wikipedia corpus) is converted into pinyin by utilizing Chinese pinyin software, and then the pinyin is obtained by training a Skip-gram method based on Word2 Vec. Data preprocessing stage prior to word vector training, since the outer Chinese corpus may contain numbers, English, or other non-phonetic symbolsParagraph, the present invention converts English into "[ ENG]", the number is converted to" [ DIGIT]", other characters without pinyin are converted into" UNK ""]”。
An exemplary diagram of dictionary and pinyin feature construction is shown in fig. 2. The matching results for "city" and "long" are given in the figure, where wi,jRepresenting a sequence fragment ci,ci+1,…,cjThe words formed. It can be seen that "Yangtze river" is not included in the matching results of "Yangtze river" because "Yangtze river" is filtered because it is a substring of "Yangtze river bridge".
The dictionary and pinyin feature fusion stage corresponds to the technical scheme step (3). The specific implementation mode is as follows: in order to avoid overfitting of model training caused by small scale of entity extraction labeled data sets in some vertical fields, the invention provides semantic support by using a Chinese pre-training language model BERT and improves the generalization performance of the model. Converting the input sequence X to { c }1,c2,…,cnThe Chinese language is input into a Chinese pre-training language model BERT, and the output of the last layer of the BERT is taken as a sequence code Xh=[x1,x2,…,xn]Wherein
Figure BDA0003060386620000081
dxRepresenting the BERT encoding dimension, R representing a real number,
Figure BDA0003060386620000082
indicates xiIs dimension dxThe real number column vector of (1) is,
Figure BDA0003060386620000083
shows XhIs dimension dxX n real matrix. The Chinese character c obtained by the constructioniThe dictionary features and the pinyin features are spliced on the coding dimension to obtain fusion features LPE (c)i):
LPE(ci)=[LE(ci);PYE(ci)]
Hypothesis word vector lookup table ewHas a coding dimension of dwPinyin vector lookup table epyCoding dimension of dpyChinese character ciCandidate matching word set ws (c)i) A size of m, then
Figure BDA0003060386620000084
Converting LPE (c) based on point-by-point attention mechanismi) Merging into Chinese character coding xiIn, xiCorresponding to query in attention mechanism, and LPE (c)i) It is equivalent to key and value in the attention mechanism. First, LPE (c)i) Linear mapping to xiEncoding dimensionally consistent LPEikv
Figure BDA0003060386620000085
Wherein the training parameters
Figure BDA0003060386620000086
And the mapped fusion features
Figure BDA0003060386620000087
Assuming that unsqueeze (M, y) represents the y-th dimension of the expansion matrix M and squeeze (M, y) represents the y-th dimension of the compression matrix M, unsqueeze (x)i0) x can be substitutediFrom
Figure BDA0003060386620000088
Is converted into
Figure BDA0003060386620000089
Then, an attention weight LPE is calculatediw
LPEiw=softmax(unsqueeze(xi,0)·PEikv)
Wherein the attention weight LPEiw∈R1×mThe sum of the weights after softmax is 1. Then, using the attention weight LPEiwFor LPEikvWeighted summation to calculate attention output LPEio
Figure BDA00030603866200000810
Wherein attention is output
Figure BDA00030603866200000811
Finally, LPE is mixedioAnd Chinese character coding xiAdded as Chinese character ciThe final semantic code, expressed as:
xi=LPEio+xi
the characteristic sequence modeling stage corresponds to the technical scheme step (4). The specific implementation mode is as follows: aiming at the problem that the self-attention mechanism of the Transfomer cannot capture sequence position information, the pre-training language model BERT integrates trainable absolute position coding into input to relieve the problem, but still lacks sequence-dependent modeling. The Long Short-Term Memory network model (LSTM) does not need position coding, and the structure of the LSTM recursively coded in the sequence order has the ability of learning sequence position information. The Chinese character semantic sequence after the dictionary and the phonetic feature are fused in the last step is coded
Figure BDA0003060386620000091
And respectively inputting the two-way Long Short-Term Memory network models (BilSTMs) to perform feature sequence modeling, wherein one BilSTM output is used for the sequence labeling-based Chinese named entity fragment extraction auxiliary task in the step (5), and the other BilSTM output is used for the pointer labeling-based Chinese named entity extraction main task in the step (5). BilSTM consists of forward and backward LSTMs, and the BilSTMs of the two tasks are independent and do not share training parameters.
Assuming that at time step t, the forward LSTM hidden state output of the secondary task is extracted based on the Chinese named entity fragment labeled in sequence as
Figure BDA0003060386620000092
Backward LSTM hidden state output is
Figure BDA0003060386620000093
Will be provided with
Figure BDA0003060386620000094
And
Figure BDA0003060386620000095
adding to obtain the BilSTM hidden state output of the auxiliary task at the time step t
Figure BDA0003060386620000096
Figure BDA0003060386620000097
Pointer annotation based Chinese named entity extraction main task forward LSTM hidden state output as
Figure BDA0003060386620000098
Backward LSTM hidden state output is
Figure BDA0003060386620000099
Will be provided with
Figure BDA00030603866200000910
And
Figure BDA00030603866200000911
adding to obtain the BilSTM hidden state output of the main task at the time step t
Figure BDA00030603866200000912
Figure BDA00030603866200000913
Finally, the characteristic sequence modeling output of the sequence labeling auxiliary task is
Figure BDA00030603866200000914
Figure BDA00030603866200000915
Pointer label ownerThe feature sequence modeling output of the task is
Figure BDA00030603866200000916
dhRepresenting the LSTM encoding dimension.
The joint learning stage of the multi-label frame corresponds to the technical scheme step (5). The specific implementation mode is as follows: sequence annotation and pointer annotation are two common annotation frameworks that apply to named entity extraction. Sequence labeling marks the position of each Chinese character in the text sequence in the entity, as shown in fig. 3, which is an exemplary diagram of marking the text sequence by BMOES, wherein B denotes the beginning of the Chinese character in the named entity segment, M denotes the middle of the Chinese character in the named entity segment, O denotes the Chinese character outside the named entity segment, E denotes the end of the Chinese character in the named entity segment, and S denotes that the Chinese character itself is the named entity segment. The example sentence includes two entities of "Nanjing City" and "Changjiang river bridge". The pointer label marks the entity types of the first Chinese character and the last Chinese character of each entity fragment in the text sequence, as shown in fig. 4, wherein "Nanjing City" and "Changjiang river bridge" are both place class (Loc) entities.
Sequence labeling is characterized in that the integrity of an extracted entity is better and the precision is generally higher by modeling the full sequence dependence; pointer marking is used for classifying the entity types of the Chinese characters at the head and the tail of the entity fragment, so that the noise interference resistance and the robustness are better, and the recall ratio is generally higher. In order to combine the advantages of different labeling frameworks
Figure BDA0003060386620000101
As an input to the sequence annotation auxiliary task,
Figure BDA0003060386620000102
as the input of the main task of pointer labeling, a Multi-task learning model, such as a Multi-gate mixed Experts (MMOE) model, a Progressive hierarchical Extraction (PLE) model and the like, is utilized to perform joint learning on the auxiliary task of extracting the Chinese named entity segments based on the sequence labeling and the main task of extracting the Chinese named entity based on the pointer labeling to obtain a sequence labelingColumn annotation assisted task output
Figure BDA0003060386620000103
And pointer annotation main task output
Figure BDA0003060386620000104
The output layer sequence modeling stage corresponds to the technical scheme step (6). The specific implementation mode is as follows: for X obtained in the last stepaAnd XbAdding a layer Dropout prevents the model from overfitting. Then, X after Dropout is addedaInputting the sequence-labeled Chinese named entity fragment into a Conditional Random Field (CRF), and calculating a BMOES label index sequence y belonging to a sequence-labeled Chinese named entity fragment extraction auxiliary tasknLikelihood probability p (y | X):
Figure BDA0003060386620000111
wherein the content of the first and second substances,
Figure BDA0003060386620000112
represents the set of all possible BMOES label index sequences of X under the task, y ∈ ZnIs that
Figure BDA0003060386620000113
Any BMOES tag index sequence. Training parameters
Figure BDA0003060386620000114
bCRF∈R5×5(the number of tags in the BMOES sequence labeling method is 5),
Figure BDA0003060386620000115
represents WCRFMiddle corresponding label ytThe training parameters of (a) are set,
Figure BDA0003060386620000116
denotes bCRFMiddle corresponding label yt-1Transfer to label ytThe training parameters of (a) are set,
Figure BDA0003060386620000117
the same is true. Suppose that the true BMOES tag index sequence of the sequence annotation auxiliary task is yspan∈ZnZ represents an integer, substituted into the above formula for calculating the log-likelihood loss of the sequence annotation aid task
Figure BDA0003060386620000118
Figure BDA0003060386620000119
Then, X after Dropout is addedbLinearly mapping to the Chinese named entity based on the pointer label to extract the label space of the main task, and then adding a layer of softmax to calculate the probability distribution p of each Chinese character on each labelstartAnd pend
Figure BDA00030603866200001110
Figure BDA00030603866200001111
Wherein the training parameters
Figure BDA00030603866200001112
Figure BDA00030603866200001113
ce+1 is the number of entity types ceThe sum of the type of the non-entity,
Figure BDA00030603866200001114
is the prediction probability distribution of the entity type of the first Chinese character of the entity fragment,
Figure BDA00030603866200001115
is a solid fragment of the Chinese scheffleraA predicted probability distribution of a word entity type. Suppose the index sequence of the real entity type label of the first Chinese character of the entity segment is ystart∈ZnThe index sequence of the real entity type label of the entity segment tail Chinese character is yend∈ZnComputing Cross Entropy (CE) loss of the pointer annotation main task
Figure BDA00030603866200001116
And
Figure BDA00030603866200001117
Figure BDA00030603866200001118
Figure BDA00030603866200001119
wherein the content of the first and second substances,
Figure BDA0003060386620000121
the real entity type tag index representing the ith chinese character,
Figure BDA0003060386620000122
represents pstartThe ith Chinese character is predicted as
Figure BDA0003060386620000123
The probability value of the type of the species entity,
Figure BDA0003060386620000124
Figure BDA0003060386620000125
the same is true.
Finally, the loss of the sequence labeling auxiliary task is obtained
Figure BDA0003060386620000126
And the pointer mark masterLoss of service
Figure BDA0003060386620000127
Later, fusing 3 loss into the model requires minimized overall training objectives
Figure BDA0003060386620000128
Performing end-to-end joint training:
Figure BDA0003060386620000129
wherein λ is1、λ2、λ3Is a hyper-parameter that controls the impact of each task on the overall training goal. In the test phase, take pstartAnd pendIndex corresponding to maximum value of probability distribution of each Chinese character label prediction
Figure BDA00030603866200001210
And
Figure BDA00030603866200001211
as a label prediction index:
Figure BDA00030603866200001212
Figure BDA00030603866200001213
and then, matching the head and tail Chinese characters of the entity segments with the same entity types and the positions closest to the entity segments, and extracting the entities in the sequence.
The invention provides a Chinese named entity extraction method based on a multi-label frame and fusion characteristics. In order to test the effectiveness of the method, the method is evaluated on the Ontosites 4, MSRA, Resume and Weibo data sets respectively from three aspects of precision ratio (P), recall ratio (R) and F1 indexes, and compared with other Chinese named entity extraction methods.
The model optimizer uses Adaptive moment estimation (Adam), the learning rate of the BERT training parameters is set to 3e-5, the learning rate of other model parameters is set to 1e-3, and the BERT coding dimension dx768, the multi-task learning model uses a progressive hierarchical extraction model PLE, the number of independent Experts of each task and the number of Experts sharing the Experts in the PLE are uniformly set to be 2, the Expert is set to be a single-layer fully-connected network, the number of PLE layers is set to be 2, the number of LSTM layers is set to be 1, and the LSTM coding dimension d is set to beh768, the word vector coding dimension dw50, Pinyin vector coding dimension dpyLose weight at 50
Figure BDA00030603866200001214
Table 1 shows the results of the comparison of the accuracy of different chinese named entity extraction methods on the ontones 4 dataset; table 2 shows the results of the comparison of the accuracy rates of different chinese named entity extraction methods on the MSRA dataset; table 3 shows the results of comparing the accuracy of the various chinese named entity extraction methods on the Resume dataset; table 4 shows the results of comparing the accuracy of different chinese named entity extraction methods on the Weibo dataset. From the experimental results in the table, it can be seen that the Chinese named entity extraction method provided by the invention obtains the best Chinese named entity extraction accuracy performance on most data sets and index items compared with other Chinese named entity extraction methods. Fig. 5(a) (b) shows the experimental results of the influence of the size of the dictionary matching window on ontototes 4 and MSRA data sets on accuracy in the method of the present invention, and fig. 6(a) (b) shows the experimental results of the influence of the size of the dictionary matching window on the accuracy in the Resume and Weibo data sets in the method of the present invention, and provides guiding suggestions for the selection of the size of the dictionary matching window in different subsequent application scenarios by evaluating the influence of the selection of the size of the dictionary matching window in the analysis method on the extraction accuracy of the named entity in chinese.
TABLE 1 comparison of accuracy of extraction methods for different entities on Ontosites 4 dataset
Figure BDA0003060386620000141
TABLE 2 comparison of accuracy rates of different entity extraction methods on MSRA datasets
Figure BDA0003060386620000142
Figure BDA0003060386620000151
TABLE 3 comparison of the accuracy of the extraction methods for different entities on the Resume dataset
Figure BDA0003060386620000152
TABLE 4 comparison of accuracy of different entity extraction methods on Weibo data set
Figure BDA0003060386620000153

Claims (5)

1. A Chinese named entity extraction method based on a multi-label frame and fusion features comprises the following steps:
(1) performing word matching on each Chinese character in an input Chinese character sequence in an external dictionary, mapping words into word vectors by using a word vector query table, mapping word segmentation marks of the Chinese characters in the words into word marking vectors by using a word segmentation mark vector query table, and splicing the word segmentation mark vectors and the word vectors to form dictionary features;
(2) according to the meaning of the Chinese characters in the matching words, the Chinese characters are marked with pinyin, and pinyin characteristics are obtained by mapping the pinyin through a pinyin vector lookup table;
(3) fusing the dictionary features and the pinyin features into Chinese character codes obtained by a Chinese pre-training language model BERT based on a point-by-point attention mechanism, and providing Chinese character semantic codes combining the dictionary features and the pinyin features for follow-up;
(4) the Chinese character semantic codes are respectively input into two independent bidirectional long-short term memory network models for feature sequence modeling, and the feature sequences are respectively output to obtain first feature sequence codes
Figure FDA0003060386610000011
Encoding with the second signature sequence
Figure FDA0003060386610000012
(5) Sequence mark as auxiliary task and pointer mark as main task, and coding the first characteristic sequence
Figure FDA0003060386610000013
The second signature sequence is encoded as input to a sequence annotation auxiliary task
Figure FDA0003060386610000014
As the input of the pointer labeling main task, performing joint learning on the sequence labeling auxiliary task and the pointer labeling main task by using a multi-task learning model;
(6) calculating log-likelihood loss of sequence labeling auxiliary tasks in conditional random fields
Figure FDA0003060386610000015
Pointer labeling entity type classification cross entropy loss of entity fragment head Chinese character in main task
Figure FDA0003060386610000016
And entity type classification cross entropy loss of entity fragment tail Chinese characters in the pointer labeling main task
Figure FDA0003060386610000017
To the above
Figure FDA0003060386610000018
Weighted sum to obtain model requirementsEnd-to-end joint training is carried out on the minimized training target, and in the testing stage, entity fragments and types in sentences are extracted by marking a main task through a pointer.
2. The method for extracting Chinese named entities based on multi-label frame and fusion features as claimed in claim 1, wherein in step (1), the external dictionary and word vector lookup table is derived from pre-training word vectors published on the Internet, and the word segmentation label vector lookup table is composed of one-hot vectors.
3. The method as claimed in claim 1, wherein in step (2), the pinyin vector lookup table is obtained by word2vec training based on the external chinese corpus, and the text in the external chinese corpus is converted into pinyin by using chinese pinyin software.
4. The method for extracting named entity in Chinese based on multi-label frame and fusion features as claimed in claim 1, wherein in step (5), the sequence label assisting task uses BMOES without entity type to label the entities in the input sentence, which is responsible for the extraction of named entity fragment in Chinese, and the extracted entity fragment has no type; the pointer labeling main task only carries out entity type labeling on the head and tail Chinese characters of the entity fragment in the sentence and is responsible for extracting the named Chinese entity, and the extracted entity has a type.
5. The method for extracting named entities in Chinese based on multi-label frame and fusion features as claimed in claim 1, wherein in step (6), the test stage takes the label corresponding to the maximum value of the predicted probability distribution of each Chinese character entity type as the predicted label of the Chinese character, then matches the tail Chinese character of the entity segment with the same type as the Chinese character at the head of the entity segment and the closest position distance, and extracts the text segment between the head Chinese character of the entity segment and the tail Chinese character of the entity segment as the entity.
CN202110511025.8A 2021-05-11 2021-05-11 Chinese named entity extraction method based on multi-annotation frame and fusion features Active CN113190656B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110511025.8A CN113190656B (en) 2021-05-11 2021-05-11 Chinese named entity extraction method based on multi-annotation frame and fusion features

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110511025.8A CN113190656B (en) 2021-05-11 2021-05-11 Chinese named entity extraction method based on multi-annotation frame and fusion features

Publications (2)

Publication Number Publication Date
CN113190656A true CN113190656A (en) 2021-07-30
CN113190656B CN113190656B (en) 2023-07-14

Family

ID=76981067

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110511025.8A Active CN113190656B (en) 2021-05-11 2021-05-11 Chinese named entity extraction method based on multi-annotation frame and fusion features

Country Status (1)

Country Link
CN (1) CN113190656B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114036933A (en) * 2022-01-10 2022-02-11 湖南工商大学 Information extraction method based on legal documents
CN114139541A (en) * 2021-11-22 2022-03-04 北京中科闻歌科技股份有限公司 Named entity identification method, device, equipment and medium
CN115146644A (en) * 2022-09-01 2022-10-04 北京航空航天大学 Multi-feature fusion named entity identification method for warning situation text
CN115470871A (en) * 2022-11-02 2022-12-13 江苏鸿程大数据技术与应用研究院有限公司 Policy matching method and system based on named entity recognition and relation extraction model

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10032451B1 (en) * 2016-12-20 2018-07-24 Amazon Technologies, Inc. User recognition for speech processing systems
US10140973B1 (en) * 2016-09-15 2018-11-27 Amazon Technologies, Inc. Text-to-speech processing using previously speech processed data
CN109446521A (en) * 2018-10-18 2019-03-08 京东方科技集团股份有限公司 Name entity recognition method, device, electronic equipment, machine readable storage medium
CN111444721A (en) * 2020-05-27 2020-07-24 南京大学 Chinese text key information extraction method based on pre-training language model
CN111462752A (en) * 2020-04-01 2020-07-28 北京思特奇信息技术股份有限公司 Client intention identification method based on attention mechanism, feature embedding and BI-L STM
CN111476031A (en) * 2020-03-11 2020-07-31 重庆邮电大学 Improved Chinese named entity recognition method based on L attice-L STM

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10140973B1 (en) * 2016-09-15 2018-11-27 Amazon Technologies, Inc. Text-to-speech processing using previously speech processed data
US10032451B1 (en) * 2016-12-20 2018-07-24 Amazon Technologies, Inc. User recognition for speech processing systems
CN109446521A (en) * 2018-10-18 2019-03-08 京东方科技集团股份有限公司 Name entity recognition method, device, electronic equipment, machine readable storage medium
CN111476031A (en) * 2020-03-11 2020-07-31 重庆邮电大学 Improved Chinese named entity recognition method based on L attice-L STM
CN111462752A (en) * 2020-04-01 2020-07-28 北京思特奇信息技术股份有限公司 Client intention identification method based on attention mechanism, feature embedding and BI-L STM
CN111444721A (en) * 2020-05-27 2020-07-24 南京大学 Chinese text key information extraction method based on pre-training language model

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
FENIL DOSHI等: "Normalizing Text using Language Modelling based on Phonetics and String Similarity", ARXIV, pages 1 - 9 *
H PENG等: "Phonetic-enriched Text Representation for Chinese Sentiment Analysis with Reinforcement Learning", IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, pages 1 - 16 *
江涛: "基于深度神经网络的电子病历命名实体识别关键技术研究与应用", 中国优秀硕士学位论文全文数据库 (医药卫生科技辑), no. 7, pages 053 - 210 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114139541A (en) * 2021-11-22 2022-03-04 北京中科闻歌科技股份有限公司 Named entity identification method, device, equipment and medium
CN114036933A (en) * 2022-01-10 2022-02-11 湖南工商大学 Information extraction method based on legal documents
CN114036933B (en) * 2022-01-10 2022-04-22 湖南工商大学 Information extraction method based on legal documents
CN115146644A (en) * 2022-09-01 2022-10-04 北京航空航天大学 Multi-feature fusion named entity identification method for warning situation text
CN115470871A (en) * 2022-11-02 2022-12-13 江苏鸿程大数据技术与应用研究院有限公司 Policy matching method and system based on named entity recognition and relation extraction model
CN115470871B (en) * 2022-11-02 2023-02-17 江苏鸿程大数据技术与应用研究院有限公司 Policy matching method and system based on named entity recognition and relation extraction model

Also Published As

Publication number Publication date
CN113190656B (en) 2023-07-14

Similar Documents

Publication Publication Date Title
CN108984526B (en) Document theme vector extraction method based on deep learning
CN108416058B (en) Bi-LSTM input information enhancement-based relation extraction method
CN110377903B (en) Sentence-level entity and relation combined extraction method
CN110008469B (en) Multilevel named entity recognition method
CN113190656B (en) Chinese named entity extraction method based on multi-annotation frame and fusion features
CN111666758B (en) Chinese word segmentation method, training device and computer readable storage medium
CN111767718B (en) Chinese grammar error correction method based on weakened grammar error feature representation
CN110688862A (en) Mongolian-Chinese inter-translation method based on transfer learning
CN112926324B (en) Vietnamese event entity recognition method integrating dictionary and anti-migration
CN114757182A (en) BERT short text sentiment analysis method for improving training mode
CN115081437B (en) Machine-generated text detection method and system based on linguistic feature contrast learning
CN112183064B (en) Text emotion reason recognition system based on multi-task joint learning
CN110852089B (en) Operation and maintenance project management method based on intelligent word segmentation and deep learning
CN116151256A (en) Small sample named entity recognition method based on multitasking and prompt learning
CN111368542A (en) Text language association extraction method and system based on recurrent neural network
CN116432655B (en) Method and device for identifying named entities with few samples based on language knowledge learning
CN113255320A (en) Entity relation extraction method and device based on syntax tree and graph attention machine mechanism
CN115831102A (en) Speech recognition method and device based on pre-training feature representation and electronic equipment
CN114912453A (en) Chinese legal document named entity identification method based on enhanced sequence features
CN111145914B (en) Method and device for determining text entity of lung cancer clinical disease seed bank
CN117196032A (en) Knowledge graph construction method and device for intelligent decision, electronic equipment and storage medium
CN113191150A (en) Multi-feature fusion Chinese medical text named entity identification method
CN112818698A (en) Fine-grained user comment sentiment analysis method based on dual-channel model
CN115186670B (en) Method and system for identifying domain named entities based on active learning
CN113139050B (en) Text abstract generation method based on named entity identification additional label and priori knowledge

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant