CN113190656A

CN113190656A - Chinese named entity extraction method based on multi-label framework and fusion features

Info

Publication number: CN113190656A
Application number: CN202110511025.8A
Authority: CN
Inventors: 麦丞程; 刘健; 黄宜华
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2021-05-11
Filing date: 2021-05-11
Publication date: 2021-07-30
Anticipated expiration: 2041-05-11
Also published as: CN113190656B

Abstract

The invention discloses a Chinese named entity extraction method based on a multi-label frame and fusion characteristics. And then, introducing word information and word segmentation mark information for each Chinese character through dictionary matching to construct dictionary features. On the basis, Chinese phonetic software is used for phonetic notation of the Chinese characters according to the meanings of the Chinese characters in the matched words, and phonetic features are constructed. And then, based on a point-by-point attention mechanism, dictionary features and pinyin features are fused into Chinese character codes to obtain Chinese character semantic codes combining the dictionary features and the pinyin features, so that the recognition capability of Chinese named entity boundaries is improved. And finally, combining the advantages of sequence labeling and index labeling, and jointly learning two labeling tasks by using a multi-task learning model, thereby improving the accuracy of Chinese named entity extraction.

Description

Chinese named entity extraction method based on multi-label framework and fusion features

Technical Field

The invention belongs to the field of artificial intelligence and natural language processing, and particularly relates to a Chinese named entity extraction method based on a multi-label framework and fusion characteristics.

Background

With the rapid development of the internet technology, data information of various industries is increased explosively, the development of intelligent analysis and mining service and innovation application of industrial big data is promoted, and the development of digital economy in China is further promoted. The data information contains a large amount of unstructured texts, and extracting structured effective information from the unstructured texts becomes a key point of attention in the industry, and relates to a basic task in the field of natural language processing: named entity extraction.

Early research efforts for named entity recognition were primarily dictionary and rule-based methods that relied primarily on linguists and domain experts to manually construct domain dictionaries and rule templates based on dataset features. The advantage of this rule-based approach is that the iteration rules can be constantly updated to extract the target entities as needed. However, the method has the disadvantages that the cost of manually establishing the rules is high in the face of some complicated fields and application scenes, and the problem of rule conflict is easily caused along with the expansion of the rule base, so that the existing rule base is difficult to maintain and expand and cannot adapt to the change of data and fields.

Subsequently, named entity recognition studies based on statistical machine learning are focused on. Named entity recognition is defined as a sequence tagging problem in statistical machine learning methods. The statistical machine learning method applied to the NER mainly comprises a maximum entropy model, a hidden Markov model, a maximum entropy Markov model, a conditional random field and the like. The method depends on characteristics of manual construction, and the process is relatively complicated.

In recent years, with the continuous development of Deep learning, the field of named entity recognition has appeared more and more work based on Deep Neural Networks (DNN). The DNN-based named entity identification method does not need complicated characteristic engineering, and the model effect is far superior to that of the traditional rule and the statistical machine learning method.

The recognition of named entities in chinese is more difficult than that in english because chinese lacks separators such as space bars in english text and has no obvious morphological change characteristics, which easily causes boundary ambiguity. In addition, the Chinese language has a phenomenon of word sense, in different fields or different contexts, the same word shows different meanings, and the word sense needs to be understood by fully utilizing context information. Meanwhile, the Chinese language also has linguistic characteristics such as omission, shorthand and the like, which bring greater challenges to the recognition of the named entity of the Chinese language. The existing Chinese named entity extraction methods lack the utilization of word information, are single in labeling frame and large in limitation, and influence the precision of Chinese named entity extraction.

Disclosure of Invention

The purpose of the invention is as follows: aiming at the problems and the defects in the prior art, the invention aims to provide a Chinese named entity extraction method based on a multi-labeling frame and fusion characteristics, so as to solve the problems that the traditional Chinese named entity extraction method is limited to a single-labeling frame due to single labeling frame and the entity boundary is difficult to identify due to lack of utilization of word information.

The technical scheme is as follows: in order to achieve the above object, the technical scheme adopted by the invention is a Chinese named entity extraction method based on a multi-label frame and fusion characteristics, which comprises the following steps:

(1) performing word matching on each Chinese character in an input Chinese character sequence in an external dictionary, mapping words into word vectors by using a word vector query table, mapping word segmentation marks of the Chinese characters in the words into word marking vectors by using a word segmentation mark vector query table, and splicing the word segmentation mark vectors and the word vectors to form dictionary features;

(2) according to the meaning of the Chinese characters in the matching words, the Chinese characters are marked with pinyin, and pinyin characteristics are obtained by mapping the pinyin through a pinyin vector lookup table;

(3) fusing the dictionary features and the pinyin features into Chinese character codes obtained by a Chinese pre-training language model BERT based on a point-by-point attention mechanism, and providing Chinese character semantic codes combining the dictionary features and the pinyin features for follow-up;

(4) the Chinese character semantic codes are respectively input into two independent bidirectional long-short term memory network models for feature sequence modeling, and the feature sequences are respectively output to obtain first feature sequence codes

Encoding with the second signature sequence

(5) Sequence mark as auxiliary task and pointer mark as main task, and coding the first characteristic sequence

The second signature sequence is encoded as input to a sequence annotation auxiliary task

As the input of the pointer labeling main task, performing joint learning on the sequence labeling auxiliary task and the pointer labeling main task by using a multi-task learning model;

(6) calculating log-likelihood loss of sequence labeling auxiliary tasks in conditional random fields

Pointer labeling entity type classification cross entropy loss of entity fragment head Chinese character in main task

And entity type classification cross entropy loss of entity fragment tail Chinese characters in the pointer labeling main task

To the above

And weighting and summing to obtain a training target of the model needing to be minimized, performing end-to-end joint training, and labeling the entity fragment and the type thereof in the sentence by the pointer in the testing stage through the main task.

Further, in the step (1), the external dictionary and word vector lookup table is derived from pre-training word vectors published on the internet, and the word segmentation and labeling vector lookup table is composed of one-hot vectors.

Further, in the step (2), the pinyin vector lookup table is obtained by word2vec training based on the external chinese corpus, and the text in the external chinese corpus is converted into pinyin by using the chinese pinyin software.

Further, in the step (5), the sequence annotation auxiliary task uses BMOES without entity types to label entities in the input sentences, and is responsible for extracting Chinese named entity fragments, wherein the extracted entity fragments have no types; the pointer labeling main task only carries out entity type labeling on the head and tail Chinese characters of the entity fragment in the sentence and is responsible for extracting the named Chinese entity, and the extracted entity has a type.

Further, in the step (6), in the testing stage, a label corresponding to the maximum value of the predicted probability distribution of each Chinese character entity type is taken as a predicted label of the Chinese character, then an entity fragment tail Chinese character which is the same as the entity type of the Chinese character at the head of the entity fragment and has the closest position distance is matched, and a text fragment between the Chinese character at the head of the entity fragment and the Chinese character at the tail of the entity fragment is extracted as the entity.

Has the advantages that: the method can effectively solve the problem that the boundary of the Chinese named entity is difficult to identify, fully exerts the advantages of different labeling frames, and improves the accuracy rate of extracting the Chinese named entity. Firstly, the recognition of a model to an entity boundary is enhanced by constructing a dictionary and pinyin characteristics, and Chinese characters are coded by a Chinese pre-training language model BERT to provide context semantic support for an upper model; secondly, performing feature sequence modeling by using a recursive structure of a bidirectional long-short term memory network model, learning sequence position information, and relieving the problem that the sequence position information is easy to lose due to the fact that a pre-training language model BERT lacks sequence dependent modeling; thirdly, the sequence labeling and the pointer labeling are jointly learned through a multi-task learning model, the advantages of different labeling frames are combined, the limitation of a single labeling frame is broken through, and the accuracy of Chinese named entity extraction is further improved.

Drawings

FIG. 1 is an overall block diagram of the method of the present invention;

FIG. 2 is an exemplary diagram of dictionary and Pinyin feature construction in the method of the present invention;

FIG. 3 is a diagram illustrating sequence notation in the method of the present invention;

FIG. 4 is a diagram illustrating an example of a pointer marking in the method of the present invention;

FIG. 5(a) (b) are graphs of experimental results of the effect of the size of the dictionary matching window on accuracy on the Ontosites 4 dataset and the MSRA dataset in the method of the present invention, respectively;

FIG. 6(a) (b) are graphs of experimental results showing the effect of the size of the dictionary matching window on the accuracy of the Resume dataset and the Weibo dataset, respectively, in the method of the present invention.

Detailed Description

The present invention is further illustrated by the following figures and specific examples, which are to be understood as illustrative only and not as limiting the scope of the invention, which is to be given the full breadth of the appended claims and any and all equivalent modifications thereof which may occur to those skilled in the art upon reading the present specification.

The invention provides a Chinese named entity extraction method based on a multi-labeling frame and fusion characteristics, and solves the problems that the traditional Chinese named entity extraction method is difficult to identify entity boundaries and is limited to a single labeling frame. As shown in FIG. 1, the complete process of the present invention includes 6 parts, namely a dictionary feature construction stage, a pinyin feature construction stage, a dictionary and pinyin feature fusion stage, a feature sequence modeling stage, a multi-label frame joint learning stage, and an output layer modeling stage. Specific embodiments are described below:

the dictionary feature construction stage corresponds to the technical scheme step (1). The specific implementation mode is: for any given input Chinese character sequence

Wherein

Representing a table of Chinese characters, n representing the sequence length, c_i(i is more than or equal to 1 and less than or equal to n) represents a Chinese character with the length of 1. For any Chinese character c in the sequence X_iTo introduce and Chinese character c_iContext-dependent words, requiring the introduction of an external dictionary L_xBy setting a vocabulary matching window l_wAll Chinese characters c contained in the sentence_iAnd the length is less than or equal to l_wText segment and dictionary L_xThe words in (1) are matched. If it appears in dictionary L_xIn this way, the text segment is regarded as the Chinese character c_iContext-dependent candidate words. Since there may be a plurality of Chinese characters c contained in the sentence_iThe text segment of (2) appears in the dictionary to finally obtain the Chinese character c_iA candidate matching word set ws (c)_i)＝{w₁,w₂,…,w_m}，w_j(1. ltoreq. j. ltoreq.m) represents a matching word.

Obtaining a candidate matching word set ws (c)_i) And then, further screening is needed, and for any word in the candidate matching word set, if the word is a substring of another word in the candidate matching word set, the word is filtered and removed from the candidate matching word set. The reason for this is: 1) a complete word generally better conforms to information in the context of Chinese characters, for example, the "changjiang bridge" in "changjiang bridge in Nanjing city" is more suitable as a candidate word for "long" than "changjiang bridge"; 2) the interference in the process of fusing the dictionary and the pinyin characteristics based on the attention mechanism is reduced, so that the attention is more likely to select the word which best meets the context information of the Chinese character from the candidate word list.

By word vector lookup table (lookup table)^wGathering the screened matched words ws (c)_i) Mapping the words in (1) into word vectors to obtain matched word feature codes WE (c)_i)：

WE(c_i)＝e^w(ws(c_i))

Wherein e is^wDerived from the pre-trained word vectors that have been trained, remain unchanged during the training process. And then, performing word segmentation and marking on the position of the Chinese character in the matched word. Suppose B represents Chinese character c_iAt the beginning of the word, M represents the Chinese character c_iIn the middle of the word, E represents the Chinese character c_iAt the end of the word. Chinese character c_iMatching different words to result in different sequences of word segmentation, so it is necessary to match Chinese character c_iThe word segmentation marks in the matched words are also integrated into the dictionary features, so that the difference between different matched words is further highlighted. For Chinese character c_iCandidate matching word set ws (c)_i) Arbitrary word w of_jLet seg (w)_j) E { B, M, E } represents Hanzi c_iAt w_jWord segmentation markers in (1). If START (w)_j) Denotes w_jIndex of the starting position in the sequence X, END (w)_j) Denotes w_jIndex of the end position in the sequence X, seg (w)_j) The calculation formula of (c) is defined as follows:

for Chinese character c_iCandidate matching word set ws (c)_i) All the words in the above formula can be used to obtain segs (c)_i)：

Wherein segs (c)_i) Denotes c_iA set of word segmentation markers in all the matched words is searched by a word segmentation marker vector lookup table e^segWill segs (c)_i) Mapping of participle tags to one-hot vector participle tag code SEGE (c)_i)：

SEGE(c_i)＝e^seg(segs(c_i))

Each dimension of the one-hot vector corresponds to each bit element in the set B,, respectively. Wherein [1,0,0] corresponds to B, [0,1,0] corresponds to M, and [0,0,1] corresponds to E.

Chinese character c_iSegmentation tagging encoding SEGE in matching words (c)_i) Characteristic coding WE (c) with matching words_i) Splicing on coding dimension to obtain Chinese character c_iFinal dictionary feature encoding LE (c)_i)：

LE(c_i)＝[SEGE(c_i)；WE(c_i)]

The pinyin feature construction stage corresponds to the technical scheme step (2). The specific implementation mode is as follows: including the soft sound, the pinyin has 5 tones, such as "chang", "ch ā ng", "ch a ng", and "ch a ng". If an entity is to be extracted from the sentence "Changjiang bridge in Nanjing city", when the "long" in the sentence generates the sound "ch" ng ", the sentence is broken into" Changjiang bridge in Nanjing city ", and the" Changjiang bridge "is extracted as a place name entity; when the long pronunciation in the sentence is zh { hacao } ng, the sentence is broken into a Nanjing Xichanghuangqiao, and the Nanjing Xichangqiao is extracted as a name entity. The situation that the pinyin characteristics of the Chinese characters in the sentences influence the entity extraction accuracy is explained.

For any Chinese character c in the input Chinese character sequence X_iTo obtain the candidate word set ws (c)_i) Then, using Chinese phonetic software (e.g. pypinyin), according to Chinese character c_iMeaning pair c in matching words_iThe pinyin is marked to obtain a candidate matching word set ws (c)_i) Corresponding pinyin collection pys (c)_i). Then, look up table e by Pinyin vector^pyPys (c)_i) The Pinyin in (1) is mapped into a Pinyin vector to obtain a Pinyin feature code PYE (c)_i)：

PYE(c_i)＝e^py(pys(c_i))

Wherein, the pinyin vector lookup table e^pyThe method is characterized in that an external Chinese corpus (for example, a Chinese Wikipedia corpus) is converted into pinyin by utilizing Chinese pinyin software, and then the pinyin is obtained by training a Skip-gram method based on Word2 Vec. Data preprocessing stage prior to word vector training, since the outer Chinese corpus may contain numbers, English, or other non-phonetic symbolsParagraph, the present invention converts English into "[ ENG]", the number is converted to" [ DIGIT]", other characters without pinyin are converted into" UNK ""]”。

An exemplary diagram of dictionary and pinyin feature construction is shown in fig. 2. The matching results for "city" and "long" are given in the figure, where w_i,jRepresenting a sequence fragment c_i,c_i+1,…,c_jThe words formed. It can be seen that "Yangtze river" is not included in the matching results of "Yangtze river" because "Yangtze river" is filtered because it is a substring of "Yangtze river bridge".

The dictionary and pinyin feature fusion stage corresponds to the technical scheme step (3). The specific implementation mode is as follows: in order to avoid overfitting of model training caused by small scale of entity extraction labeled data sets in some vertical fields, the invention provides semantic support by using a Chinese pre-training language model BERT and improves the generalization performance of the model. Converting the input sequence X to { c }₁,c₂,…,c_nThe Chinese language is input into a Chinese pre-training language model BERT, and the output of the last layer of the BERT is taken as a sequence code X_h＝[x₁,x₂,…,x_n]Wherein

d_xRepresenting the BERT encoding dimension, R representing a real number,

indicates x_iIs dimension d_xThe real number column vector of (1) is,

shows X_hIs dimension d_xX n real matrix. The Chinese character c obtained by the construction_iThe dictionary features and the pinyin features are spliced on the coding dimension to obtain fusion features LPE (c)_i)：

LPE(c_i)＝[LE(c_i)；PYE(c_i)]

Hypothesis word vector lookup table e^wHas a coding dimension of d_wPinyin vector lookup table e^pyCoding dimension of d_pyChinese character c_iCandidate matching word set ws (c)_i) A size of m, then

Converting LPE (c) based on point-by-point attention mechanism_i) Merging into Chinese character coding x_iIn, x_iCorresponding to query in attention mechanism, and LPE (c)_i) It is equivalent to key and value in the attention mechanism. First, LPE (c)_i) Linear mapping to x_iEncoding dimensionally consistent LPE_ikv：

Wherein the training parameters

And the mapped fusion features

Assuming that unsqueeze (M, y) represents the y-th dimension of the expansion matrix M and squeeze (M, y) represents the y-th dimension of the compression matrix M, unsqueeze (x)_i0) x can be substituted_iFrom

Is converted into

Then, an attention weight LPE is calculated_iw：

LPE_iw＝softmax(unsqueeze(x_i,0)·PE_ikv)

Wherein the attention weight LPE_iw∈R^1×mThe sum of the weights after softmax is 1. Then, using the attention weight LPE_iwFor LPE_ikvWeighted summation to calculate attention output LPE_io：

Wherein attention is output

Finally, LPE is mixed_ioAnd Chinese character coding x_iAdded as Chinese character c_iThe final semantic code, expressed as:

x_i＝LPE_io+x_i

the characteristic sequence modeling stage corresponds to the technical scheme step (4). The specific implementation mode is as follows: aiming at the problem that the self-attention mechanism of the Transfomer cannot capture sequence position information, the pre-training language model BERT integrates trainable absolute position coding into input to relieve the problem, but still lacks sequence-dependent modeling. The Long Short-Term Memory network model (LSTM) does not need position coding, and the structure of the LSTM recursively coded in the sequence order has the ability of learning sequence position information. The Chinese character semantic sequence after the dictionary and the phonetic feature are fused in the last step is coded

And respectively inputting the two-way Long Short-Term Memory network models (BilSTMs) to perform feature sequence modeling, wherein one BilSTM output is used for the sequence labeling-based Chinese named entity fragment extraction auxiliary task in the step (5), and the other BilSTM output is used for the pointer labeling-based Chinese named entity extraction main task in the step (5). BilSTM consists of forward and backward LSTMs, and the BilSTMs of the two tasks are independent and do not share training parameters.

Assuming that at time step t, the forward LSTM hidden state output of the secondary task is extracted based on the Chinese named entity fragment labeled in sequence as

Backward LSTM hidden state output is

Will be provided with

And

adding to obtain the BilSTM hidden state output of the auxiliary task at the time step t

Pointer annotation based Chinese named entity extraction main task forward LSTM hidden state output as

Backward LSTM hidden state output is

Will be provided with

And

adding to obtain the BilSTM hidden state output of the main task at the time step t

Finally, the characteristic sequence modeling output of the sequence labeling auxiliary task is

Pointer label ownerThe feature sequence modeling output of the task is

d_hRepresenting the LSTM encoding dimension.

The joint learning stage of the multi-label frame corresponds to the technical scheme step (5). The specific implementation mode is as follows: sequence annotation and pointer annotation are two common annotation frameworks that apply to named entity extraction. Sequence labeling marks the position of each Chinese character in the text sequence in the entity, as shown in fig. 3, which is an exemplary diagram of marking the text sequence by BMOES, wherein B denotes the beginning of the Chinese character in the named entity segment, M denotes the middle of the Chinese character in the named entity segment, O denotes the Chinese character outside the named entity segment, E denotes the end of the Chinese character in the named entity segment, and S denotes that the Chinese character itself is the named entity segment. The example sentence includes two entities of "Nanjing City" and "Changjiang river bridge". The pointer label marks the entity types of the first Chinese character and the last Chinese character of each entity fragment in the text sequence, as shown in fig. 4, wherein "Nanjing City" and "Changjiang river bridge" are both place class (Loc) entities.

Sequence labeling is characterized in that the integrity of an extracted entity is better and the precision is generally higher by modeling the full sequence dependence; pointer marking is used for classifying the entity types of the Chinese characters at the head and the tail of the entity fragment, so that the noise interference resistance and the robustness are better, and the recall ratio is generally higher. In order to combine the advantages of different labeling frameworks

As an input to the sequence annotation auxiliary task,

as the input of the main task of pointer labeling, a Multi-task learning model, such as a Multi-gate mixed Experts (MMOE) model, a Progressive hierarchical Extraction (PLE) model and the like, is utilized to perform joint learning on the auxiliary task of extracting the Chinese named entity segments based on the sequence labeling and the main task of extracting the Chinese named entity based on the pointer labeling to obtain a sequence labelingColumn annotation assisted task output

And pointer annotation main task output

The output layer sequence modeling stage corresponds to the technical scheme step (6). The specific implementation mode is as follows: for X obtained in the last step_aAnd X_bAdding a layer Dropout prevents the model from overfitting. Then, X after Dropout is added_aInputting the sequence-labeled Chinese named entity fragment into a Conditional Random Field (CRF), and calculating a BMOES label index sequence y belonging to a sequence-labeled Chinese named entity fragment extraction auxiliary taskⁿLikelihood probability p (y | X):

wherein the content of the first and second substances,

represents the set of all possible BMOES label index sequences of X under the task, y ∈ ZⁿIs that

Any BMOES tag index sequence. Training parameters

b_CRF∈R^5×5(the number of tags in the BMOES sequence labeling method is 5),

represents W_CRFMiddle corresponding label y_tThe training parameters of (a) are set,

denotes b_CRFMiddle corresponding label y_t-1Transfer to label y_tThe training parameters of (a) are set,

the same is true. Suppose that the true BMOES tag index sequence of the sequence annotation auxiliary task is y_span∈ZⁿZ represents an integer, substituted into the above formula for calculating the log-likelihood loss of the sequence annotation aid task

Then, X after Dropout is added_bLinearly mapping to the Chinese named entity based on the pointer label to extract the label space of the main task, and then adding a layer of softmax to calculate the probability distribution p of each Chinese character on each label_startAnd p_end：

Wherein the training parameters

c_e+1 is the number of entity types c_eThe sum of the type of the non-entity,

is the prediction probability distribution of the entity type of the first Chinese character of the entity fragment,

is a solid fragment of the Chinese scheffleraA predicted probability distribution of a word entity type. Suppose the index sequence of the real entity type label of the first Chinese character of the entity segment is y_start∈ZⁿThe index sequence of the real entity type label of the entity segment tail Chinese character is y_end∈ZⁿComputing Cross Entropy (CE) loss of the pointer annotation main task

And

wherein the content of the first and second substances,

the real entity type tag index representing the ith chinese character,

represents p_startThe ith Chinese character is predicted as

The probability value of the type of the species entity,

the same is true.

Finally, the loss of the sequence labeling auxiliary task is obtained

And the pointer mark masterLoss of service

Later, fusing 3 loss into the model requires minimized overall training objectives

Performing end-to-end joint training:

wherein λ is₁、λ₂、λ₃Is a hyper-parameter that controls the impact of each task on the overall training goal. In the test phase, take p_startAnd p_endIndex corresponding to maximum value of probability distribution of each Chinese character label prediction

And

as a label prediction index:

and then, matching the head and tail Chinese characters of the entity segments with the same entity types and the positions closest to the entity segments, and extracting the entities in the sequence.

The invention provides a Chinese named entity extraction method based on a multi-label frame and fusion characteristics. In order to test the effectiveness of the method, the method is evaluated on the Ontosites 4, MSRA, Resume and Weibo data sets respectively from three aspects of precision ratio (P), recall ratio (R) and F1 indexes, and compared with other Chinese named entity extraction methods.

The model optimizer uses Adaptive moment estimation (Adam), the learning rate of the BERT training parameters is set to 3e-5, the learning rate of other model parameters is set to 1e-3, and the BERT coding dimension d_x768, the multi-task learning model uses a progressive hierarchical extraction model PLE, the number of independent Experts of each task and the number of Experts sharing the Experts in the PLE are uniformly set to be 2, the Expert is set to be a single-layer fully-connected network, the number of PLE layers is set to be 2, the number of LSTM layers is set to be 1, and the LSTM coding dimension d is set to be_h768, the word vector coding dimension d_w50, Pinyin vector coding dimension d_pyLose weight at 50

Table 1 shows the results of the comparison of the accuracy of different chinese named entity extraction methods on the ontones 4 dataset; table 2 shows the results of the comparison of the accuracy rates of different chinese named entity extraction methods on the MSRA dataset; table 3 shows the results of comparing the accuracy of the various chinese named entity extraction methods on the Resume dataset; table 4 shows the results of comparing the accuracy of different chinese named entity extraction methods on the Weibo dataset. From the experimental results in the table, it can be seen that the Chinese named entity extraction method provided by the invention obtains the best Chinese named entity extraction accuracy performance on most data sets and index items compared with other Chinese named entity extraction methods. Fig. 5(a) (b) shows the experimental results of the influence of the size of the dictionary matching window on ontototes 4 and MSRA data sets on accuracy in the method of the present invention, and fig. 6(a) (b) shows the experimental results of the influence of the size of the dictionary matching window on the accuracy in the Resume and Weibo data sets in the method of the present invention, and provides guiding suggestions for the selection of the size of the dictionary matching window in different subsequent application scenarios by evaluating the influence of the selection of the size of the dictionary matching window in the analysis method on the extraction accuracy of the named entity in chinese.

TABLE 1 comparison of accuracy of extraction methods for different entities on Ontosites 4 dataset

TABLE 2 comparison of accuracy rates of different entity extraction methods on MSRA datasets

TABLE 3 comparison of the accuracy of the extraction methods for different entities on the Resume dataset

TABLE 4 comparison of accuracy of different entity extraction methods on Weibo data set

Claims

1. A Chinese named entity extraction method based on a multi-label frame and fusion features comprises the following steps:

Encoding with the second signature sequence

To the above

Weighted sum to obtain model requirementsEnd-to-end joint training is carried out on the minimized training target, and in the testing stage, entity fragments and types in sentences are extracted by marking a main task through a pointer.

2. The method for extracting Chinese named entities based on multi-label frame and fusion features as claimed in claim 1, wherein in step (1), the external dictionary and word vector lookup table is derived from pre-training word vectors published on the Internet, and the word segmentation label vector lookup table is composed of one-hot vectors.

3. The method as claimed in claim 1, wherein in step (2), the pinyin vector lookup table is obtained by word2vec training based on the external chinese corpus, and the text in the external chinese corpus is converted into pinyin by using chinese pinyin software.

4. The method for extracting named entity in Chinese based on multi-label frame and fusion features as claimed in claim 1, wherein in step (5), the sequence label assisting task uses BMOES without entity type to label the entities in the input sentence, which is responsible for the extraction of named entity fragment in Chinese, and the extracted entity fragment has no type; the pointer labeling main task only carries out entity type labeling on the head and tail Chinese characters of the entity fragment in the sentence and is responsible for extracting the named Chinese entity, and the extracted entity has a type.

5. The method for extracting named entities in Chinese based on multi-label frame and fusion features as claimed in claim 1, wherein in step (6), the test stage takes the label corresponding to the maximum value of the predicted probability distribution of each Chinese character entity type as the predicted label of the Chinese character, then matches the tail Chinese character of the entity segment with the same type as the Chinese character at the head of the entity segment and the closest position distance, and extracts the text segment between the head Chinese character of the entity segment and the tail Chinese character of the entity segment as the entity.