CN117217223A

CN117217223A - Chinese named entity recognition method and system based on multi-feature embedding

Info

Publication number: CN117217223A
Application number: CN202310911373.3A
Authority: CN
Inventors: 胡为; 刘伟; 蔡思涵; 李小智; 陶家俊
Original assignee: Hunan University of Chinese Medicine
Current assignee: Hunan University of Chinese Medicine
Priority date: 2023-07-24
Filing date: 2023-07-24
Publication date: 2023-12-12

Abstract

The invention discloses a Chinese named entity recognition method based on multi-feature embedding, which comprises the following steps: extracting word vectors containing rich context information by using a BERT model, and processing to obtain word root embedded vectors and font embedded vectors; extracting features of the font embedded vector and the root embedded vector by utilizing a bi-directional long-short-term memory network BiLSTM, extracting features of the font embedded vector by utilizing an iterative expansion convolutional neural network IDCNN, and splicing the three feature vectors; inputting the spliced three feature vectors into a multi-head self-attention mechanism layer, dynamically fusing the spliced vectors, and extracting key features; performing annotation decoding on the sequence by using a conditional random field CRF; the invention also discloses a Chinese named entity recognition system based on multi-feature embedding, and the F1 value is obviously improved.

Description

Chinese named entity recognition method and system based on multi-feature embedding

Technical Field

The invention relates to the technical field of named entity recognition, in particular to a Chinese named entity recognition method and system based on multi-feature embedding.

Background

Named entity recognition is one of the important subtasks in the field of natural language processing, which can extract useful information from unstructured or structured text. Named entities include person names, place names, institution names, etc., which are of significant value to other natural language processing tasks, such as relationship extraction, entity linking, knowledge graphs, intelligent questions and answers, etc. However, the recognition of the Chinese named entities starts later, and the difference between Chinese and English causes difficulty in migrating the method of English named entity recognition into Chinese research, which is mainly embodied in the following aspects. First, since a space such as an english text is not used as a separator in a chinese text, it is difficult to determine a word segmentation boundary. For example, the entity "Yangtze river bridge in Nanjing" may be understood as either name Jiang Daqiao of Nanjing's City, or as Yangtze river bridge in Nanjing. Second, the semantics of Chinese characters can change with time and occasion, which increases the complexity of Chinese named entity recognition. For example, the entity "chinese construction bank" in the text "Zhang San now in chinese construction bank" should be marked as a Location (LOC) when describing the location of Zhang San, but should be marked as an Organization (ORG) when describing the situation of Zhang San. Thirdly, with the rapid development of the Internet, a large number of web texts emerge, and Chinese characters are more personalized and randomized to use, which also increases the difficulty of Chinese named entity recognition. Aiming at the difficulties, in order to correctly mark the entity, the context needs to be combined, so that the character contains more context information, a pre-training model is introduced on a CNER model based on character representation in the general field method, so that richer semantic information can be captured, but the semantic information contained in the inherent characteristics of the Chinese character is not considered.

In the early stages of named entity recognition, researchers have mainly used rules or statistical methods to identify named entities. However, these methods often require manual creation of rules or features and require significant time and effort. In subsequent studies, researchers began using traditional machine learning algorithms, such as Support Vector Machines (SVMs) and Conditional Random Fields (CRFs), for named entity recognition. These algorithms typically use manually designed features to help identify entities. While these approaches have achieved good results in some tasks, they are still limited by the feature design.

With the rapid development of deep learning, the NER method based on deep learning has gradually become the mainstream, and continuous performance improvement has been achieved. Compared with traditional machine learning, the deep learning neural network can automatically extract characteristics such as character level, word level and sentence level, reduces subjectivity of characteristic selection, makes full use of data original information, and is beneficial to further improving recognition effect. Currently, convolutional Neural Networks (CNN), recurrent Neural Networks (RNN), gated loop units (GRU), long and short term memory networks (LSTM), and other deep neural networks are widely used in named entity recognition. The independent entity identification neural network only considers sample input, nonlinear conversion calculation is carried out in a neural network structural unit, and further consideration on the output process and the meaning of results is lacking, so that based on the idea of model fusion, researchers usually take LSTM-CRF as a main structure to solve the defect of a single entity identification neural network model, and on the basis, the model is simple ^[1] Etc. to propose a bi-directional LSTM and combine it with CRF to further improve model performance, which is hereafter gradually referred to as the mainstream model, applicable to NER in various fields ^[2–4] . Thereafter, a number of new models have been proposed, such as IDCNN-CRF ^[5] ，Transformer ^[6] And GCN ^[7] Etc.

In named entity recognition tasks, particularly English named entity recognition, word-level based embedded representations are often used for constructionAnd (5) molding. However, in the task of identifying the Chinese named entity, the Chinese text has no obvious word boundary, and word segmentation errors are easy to generate, and the errors have great influence on the performance of entity identification. Accordingly, many researchers have employed character-based entity recognition methods to reduce recognition errors due to word segmentation errors. Liu et al ^[8] The character level embedding is proved to be more suitable for the Chinese named entity recognition task than the word level embedding. Dong et al ^[9] The BiLSTM-CRF model based on characters is used for Chinese NER for the first time, and stroke characteristics are integrated, so that better performance is achieved in Chinese NER tasks.

But the word vectors contain relatively single information, researchers have begun introducing pre-trained models into NER in order to further improve NER performance. The pre-training model can capture more abundant semantic information when processing a named entity recognition task through large-scale non-labeling data training. In the course of development of a pre-trained model, e.g. Word2vec ^[10] 、Glove ^[11] 、ELMo ^[12] 、BERT ^[13] And ALBERT ^[14] And the like. Previously, a relatively wide pre-training model was Word2vec, but the Word vector trained by the model is static and does not change according to the change of the context, later, the Google AI team proposes a BERT (Bidirectional Encoder Representations from Transformers) model which adopts the structure of a bidirectional transducer encoder, the trained Word vector can be recalculated according to the different contexts in each model operation, and the characteristic enables the BERT model to obtain the most advanced effect on various NLP tasks.

While character sequence based models have achieved good performance, there is a disadvantage in that word level information is not utilized, in practical applications, a large number of entity words are known, and these entity words tend to have fixed features and contextual semantic information. For this purpose, some researchers began to study vocabulary enhancement methods, adding external dictionaries to character-embedding-based models to provide additional information, e.g., zhang et al ^[15] The vocabulary enhancement named entity recognition model Lattice LSTM is firstly proposed, and LST with Lattice structure is usedM represents dictionary words in the sentence, thereby integrating the potential word information into the character-based LSTM-CRF, with best results compared to character-based and word-based baselines. Liu et al ^[16] The WC-LSTM model is provided, word information is integrated into the character-based model, the effect is better than that of Lattice, and the efficiency is faster. For LSTM not utilizing GPU parallelism problem, li, etc ^[17] A FLAT (FLAT-LAttice Transformer) model is proposed to transform the lattice structure into a planar structure consisting of spans. Each span corresponds to a character or potential word and its position in the original lattice. With the powerful function and elaborate position coding of the transducer, the FLAT can fully utilize lattice information and has excellent parallelization capability.

In addition to adding dictionary external information, researchers have explored adding external information such as radicals, strokes, etc. to improve named entity recognition performance. Chinese characters are pictographic characters, which themselves contain a large amount of information, such as radicals, root compositions, strokes, etc. Zhang et al ^[18] The study of the Chinese character has proved that the Chinese characters with similar strokes, structures and pinyin have similar semantics. The NER method in the general field does not consider the inherent information of the Chinese character, so that the research in China is focused on the exploration of the characteristics of the Chinese character so as to enhance the semantic information of the vector. Some researchers have proposed radical-based methods to improve the performance of named entity recognition. For example, li, etc ^[19] By adding the radical codes in the word vectors and considering the radical information when calculating the labeling score in the CRF layer, the performance of the proposed model is proved to be higher than that of the general model through a plurality of data sets, and the identification performance of the named entity in the medical field is further improved. In addition, cao et al ^[20] It is proposed for the first time that the Chinese character stroke feature enhancement semantics are used to define five different types of strokes and assign a unique ID to them, so that the stroke sequence of each word is encoded as a feature vector and combined with the word vector to improve the performance of the model. Some researchers began to apply image processing techniques to chinese named entity recognition. For example, su et al ^[21] Direct by convolutional automatic encoder (convAE)Character shape features are learned from bitmaps of characters, improving the performance of character-embedded Chinese word representation models. Some researchers convert Chinese characters into images and extract the character pattern features of Chinese characters by using convolutional neural network CNN ^[ 22 ^,23] However, only a marginal improvement has been obtained. Bang et al ^[24] The defect of the prior work is avoided, and a BERT model Glyce based on a font is provided, and each Chinese character is encoded by adopting a field character grid CNN. The field character lattice is a traditional form of Chinese handwriting, accords with the radical distribution in Chinese characters, and then uses a transducer as a sequence encoder in the Glyce, so that the optimal performance is realized. Xuan et al ^[25] And a FGN model is provided, a novel CNN structure called CGS-CNN is used for capturing the interactive information between the font information and the adjacent graph, and an asynchronous sliding window and slice attention fusion method is adopted to fuse the BERT and the output representation of the CGS-CNN, so that additional interactive information is provided for NER tasks, and the named entity recognition performance is further improved. Inspired by the method, external information root words and fonts are added to improve model performance.

In addition, an attention mechanism is added into the model, different weights can be distributed for characters in the text, so that the model focuses on important features, ignores irrelevant features, and extracts the entity more accurately. Zhang et al ^[26] The human uses the attention mechanism to further extract the features playing a key role in entity recognition from the features output by the BiLSTM and IDCNN fusion layers, ignores irrelevant features, and inspires the fact that the multi-head self-attention mechanism is used for fusing word feature vectors, word roots and font feature vectors, ignores irrelevant features or interference features and extracts key features and inter-word association features.

[1]LAMPLE G,BALLESTEROS M,SUBRAMANIAN S,etal.Neural Architectures for Named Entity Recognition[M/OL].arXiv,2016[2023-04-04].http://arxiv.org/abs/1603.01360.

[2] Liuru, song Yang, gu Rui, etc. based on protected health information identification [ J ]. Data analysis and knowledge discovery in BiLSTM-CRF Chinese clinical text 2020,4 (10): 124-133.

[3] Shore Hu Fengju, pei Wei Chinese text named entity recognition based on BiLSTM-CRF [ J ]. World science and technology-Chinese medicine modernization, 2020,22 (7): 2504-2510.

[4]TANG P,YANG P L,SHI Y.Recognizing Chinese judicial named entity using BiLSTM-CRF-IOPscience[EB/OL].[2023-04-04].https://iopscience.iop.org/article/10.1088/1742-6596/1592/1/012040/meta.

[5] Li Ni, guan Huanmei, yang Piao, et al, chinese named entity recognition method based on BERT-IDCNN-CRF [ J ]. University of Shandong university journal (physical edition), 2020,55 (1): 102-109.

[6] VASWANI A, SHAZEER N, PARMAR N, et al, patents All you seed [ C/OL ]// Advances in Neural Information Processing Systems, volume 30.Curran Associates,Inc, 2017[2023-04-04]. Https:// procedures.tissues.cc/paper_files/paper/2017/hash/3 f5ee243547dee91fbd053C1C4a845 aa-abstractact.

[7]KIPF T N,WELLING M.Semi-Supervised Classification with Graph Convolutional Networks[M/OL].arXiv,2017[2023-04-04].http://arxiv.org/abs/1609.02907.

[8]LIU Z,ZHU C,ZHAO T.Chinese Named Entity Recognition with a Sequence Labeling Approach:Based on Characters,or Based on Words？[C/OL]//International Conferenceon Advanced Intelligent Computing Theories&Applications.2010[2023-04-04].http://link.springer.com/chapter/10.1007/978-3-642-14932-0_78.

[9]DONG C,ZHANG J,ZONG C,et al.Character-Based LSTM-CRF with Radical-Level Features for Chinese NamedEntity Recognition[C/OL]//International Conference on Computer Processing of Oriental Languages National CCF Conference on Natural Language Processing and Chinese Computing.2016[2023-04-04].http://link.springer.com/chapter/10.1007/978-3-319-50496-4_20.

[10]MIKOLOV T,SUTSKEVER I,KAI C,et al.Distributed Representations of Words and Phrases and their Compositionality[P/OL].2013[2023-04-04].http://doc.paperpass.com/patent/arXiv13104546.html.

[11]PENNINGTON J,SOCHER R,MANNING C.Glove:Global Vectors for Word Representation[C/OL]//Conference onEmpirical Methods in Natural Language Processing.2014[2023-04-04].http://www.xueshufan.com/publication/2250539671.

[12]PETERS M,NEUMANN M,IYYER M,et al.Deep Contextualized Word Representations[J/OL].2018[2023-04-04].ht tp://arxiv.org/abs/1802.05365v1.

[13]DEVLIN J,CHANG M W,LEE K,et al.BERT:Pre-training of Deep Bidirectional Transformers for Language Under standing[J/OL].2018[2023-04-04].http://arxiv.org/abs/1810.04805v2.

[14]LAN Z,CHEN M,GOODMAN S,et al.ALBERT:A Lite BERT for Self-supervised Learning of Language Represent ations[J/OL].2019[2023-04-04].http://arxiv.org/abs/1909.11942.

[15]ZHANG Y,YANG J.Chinese NER Using Lattice LSTM[M/OL].arXiv,2018[2023-04-04].http://arxiv.org/abs/1805.02023.

[16]LIU W,XU T,XU Q,et al.An Encoding Strategy BasedWord-Character LSTM for Chinese NER[C/OL]//Proceedings of the 2019Conference of the North American Chapter of the Association for Computational Linguistics:Human La nguage Technologies,Volume 1(Long and Short Papers).Minneapolis,Minnesota:Association for ComputationalLingui stics,2019:2379-2389[2023-04-04].https://aclanthology.org/N19-1247.

[17]LI X,YAN H,QIU X,et al.FLAT:Chinese NER UsingFlat-Lattice Transformer[M/OL].arXiv,2020[2023-04-04].htt p://arxiv.org/abs/2004.11795.

[18]ZHANG Y,LIU Y,ZHU J,et al.Learning Chinese WordEmbeddings from Stroke,Structure and Pinyin of Characters[C/OL]//the 28th ACM International Conference.2019[2023-04-04].http://dl.acm.org/doi/epdf/10.1145/3357384.3358005.

[19] Li Dan, xu Tong, zheng Yi, etc. radical-aware Chinese medical named entity recognition [ J ]. Chinese informatics report, 2020,34 (12): 54-64.

[20]CAO S,LU W,ZHOU J,et al.cw2vec:Learning Chinese Word Embeddings with Stroke n-gram Information[J/OL].Proceedings of the AAAI Conference on Artificial Intelligence,2018,32(1)[2023-04-04].https://ojs.aaai.org/index.php/AAAI/article/view/12029.

[21]SU T R,LEE H Y.Learning Chinese Word Representations From Glyphs Of Characters[J/OL].2017[2023-04-04].http://arxiv.org/abs/1708.04755.

[22]DAI F Z,CAI Z.Glyph-aware Embedding of Chinese Characters[J/OL].2017[2023-04-04].http://arxiv.org/abs/1709.00028v1.

[23]SHAO Y,HARDMEIER C,TIEDEMANN J,et al.Character-based Joint Segmentation and POS Tagging for Chineseusing Bidirectional RNN-CRF[J/OL].2017[2023-04-04].http://arxiv.org/abs/1704.01314v3.

[24]MENG Y,WU W,WANG F,et al.Glyce:Glyph-vectors for Chinese Character Representations[M/OL].arXiv,2020[2023-04-04].http://arxiv.org/abs/1901.10125.

[25]XUAN Z,BAO R,JIANG S.FGN:Fusion Glyph Network for Chinese Named Entity Recognition|SpringerLink[EB/OL].[2023-04-04].https://link.springer.com/chapter/10.1007/978-981-16-1964-9_3.

[26] Zhang Yi, wang Shuangsheng. BERT-based elementary mathematical text naming entity recognition method [ J/OL ]. Computer application, 2022,42 (2): 433.

[27]HINTON G,VINYALS O,DEAN J.Distilling the Knowledge in a Neural Network[J/OL].Computer Science,2015,14(7):38-39.

Disclosure of Invention

In order to solve the problems in the prior art, the invention aims to provide a Chinese naming entity identification method and system based on multi-feature embedding, and the F1 value is obviously improved.

In order to achieve the above purpose, the invention adopts the following technical scheme: a Chinese named entity recognition method based on multi-feature embedding comprises the following steps:

Step 1, extracting word vectors containing rich context information by using a BERT model, and processing to obtain word root embedded vectors and font embedded vectors;

step 2, extracting characteristics of the font embedded vector and the root embedded vector by utilizing a bi-directional long-short-term memory network BiLSTM, extracting the characteristics of the font embedded vector by using an iterative expansion convolutional neural network IDCNN, and splicing the three characteristic vectors;

step 3, inputting the three spliced feature vectors into a multi-head self-attention mechanism layer, and dynamically fusing the spliced vectors to extract key features;

and 4, performing labeling decoding on the sequence by using a conditional random field CRF.

As a further improvement of the invention, the method further comprises the following steps:

and 5, setting a teacher model and a student model as the same model according to the knowledge distillation method, and guiding training of the student model by using the output probability distribution of the teacher model.

As a further improvement of the present invention, in step 1, the root embedding vector is processed as follows:

obtaining a plurality of Chinese character-root mapping through crawling, and storing pictures corresponding to the characters; for the encoding of root information, traversing a root set corresponding to each Chinese character in a Chinese character-root table, distributing a unique index ID for each root, and constructing a root table; if the characters in the text are in a plurality of Chinese characters, finding a corresponding root set according to a Chinese character-root table, traversing each root in the root set, finding a corresponding index ID in the root table, and constructing a root vector; setting the root vector length of each character to be 9, setting a filling label, and filling the root vector with an index ID corresponding to the filling label if the root of the character is less than 9 or no Chinese character is found.

As a further improvement of the invention, in step 1, the BERT model adopts a bidirectional transducer encoder, is formed by stacking a plurality of layers Transformers Encoder, and can capture finer semantic information by simultaneously considering all word information in the context when processing text; the BERT model adopts a mask language model and a next sentence prediction task, in the mask language model task, the BERT model randomly masks some words in the text, and then predicts the masked words through the context; in the next sentence prediction task, the BERT model predicts whether two sentences are adjacent.

As a further improvement of the invention, the two-way long-short-term memory network BiLSTM comprises two long-short-term memory networks LSTM in two directions, respectively modeling the forward direction and the reverse direction of the sequence, obtaining the context information of each position in the sequence, and combining the outputs in the two directions to obtain a more comprehensive characteristic representation;

the long-short-time memory network LSTM utilizes an input gate, a forget gate and an output gate to control the flow and the storage of information, the input gate controls the influence degree of the current input information on the memory unit, the forget gate controls the influence degree of the history information on the memory unit, and the output gate controls the influence degree of the memory unit on the current output; the input gate, the forget gate, the output gate and the memory unit are updated as follows:

i _t ＝σ(W _i [x _t ,h _t-1 ]+b _i )

f _t ＝σ(W _f [x _t ,h _t-1 ]+b _f )

o _t ＝σ(Wo[x _t ,h _t-1 ]+bo)

c _t ＝f _t *c _t-1 +i _t *tanh(W _c [x _t ,h _t-1 ]+b _c )

h _t ＝o _t *tanh(c _t )

Wherein i is _t 、f _t 、o _t 、c _t Respectively representing the states of an input gate, a forgetting gate, an output gate and a memory unit at the moment t, h _t Represents the hidden layer state at time t, sigma represents a sigmoid function, W represents a weight matrix, b is a bias term, x _t Information is input for time t.

As a further improvement of the present invention, in step 2, the expanded convolutional neural network IDCNN is spliced together by 4 modules with the same structure, and each module is a three-layer expanded convolutional DCNN with a cavity of 1, 2.

As a further improvement of the invention, in step 3, a zoom point is usedAttention is a method of mapping a query Q and key value pair K-V to an output, wherein Q, K, V is a vector representing information different from task to task, the output is obtained by weighted summing V, the weight is Q, K similarity, similarity is calculated by dot product of two vectors, and a scaling factor d is divided by _k The attention weight distribution is more gentle, and the situation that the numerical value is too large or too small after normalization is avoided. The calculation formula is as follows, wherein dk takes the key vector dimension, T is the transpose:

when q=k=v, three vectors represent text sequence feature information, namely, a self-attention mechanism is applied to an encoder end of a named entity recognition task to extract features, so that a model distributes different weights according to the importance degree of the information, and thus, more attention is paid to related characters of an entity;

The multi-head self-attention mechanism layer carries out linear transformation on a characteristic vector of a certain word in an input sequence to obtain a query, a key and a value vector, and then the query, the key and the value vector are divided into a plurality of heads; in each head, respectively calculating attention weight, and then weighting the value vector to obtain an output vector of the head; finally, the output vectors of all heads are connected in series, and the final output vector is obtained through linear transformation and nonlinear transformation again.

As a further improvement of the present invention, the step 4 is specifically as follows:

given an input sequence x=x ₁ ,x ₂ ,x ₃ ,…,x _n Corresponding predicted tag sequence y=y ₁ ,y ₂ ,y ₃ ,…,y _n The score of Y was calculated as follows:

wherein the method comprises the steps ofRepresenting that the ith word is mapped to tag y _i Probability of->Representing label y _i Transfer to tag y _i+1 Probability of (2);

then, the probability of the predicted tag sequence Y is obtained, and the calculation formula is as follows:

wherein the method comprises the steps ofRepresenting possible annotation sequences, Y _X Representing all possible annotation sequences;

and finally outputting the optimal labeling sequence Y corresponding to the maximum likelihood function of P (Y|X), wherein the optimal labeling sequence Y is as follows:

as a further improvement of the present invention, the step 5 is specifically as follows:

two loss functions are introduced into the knowledge distillation method, namely distillation loss L _dis And student loss L _stu Using KL divergence as a distillation loss function, wherein student loss is a loss function for measuring the difference between the output of a student model and a real label, weighting and summing the two loss functions to obtain a final loss function L, and calculating the following formula, wherein alpha is used for adjusting the specific gravity of the two loss functions and is a super parameter:

L＝(1-α)L _dis +αL _stu 。

the invention also provides a Chinese named entity recognition system based on multi-feature embedding, which consists of four layers, wherein the first layer is an input layer, word vectors containing rich semantic features are obtained through a BERT model, and word root embedded vectors and font embedded vectors are obtained through processing; the second layer is a coding layer of a two-way long-short-term memory network BiLSTM and an iterative expansion convolutional neural network IDCNN, the layer is divided into two branches, one branch uses the two-way long-short-term memory network BiLSTM to respectively extract characteristics of word vectors and word root embedded vectors, the other branch uses the iterative expansion convolutional neural network IDCNN to extract characteristics of font embedded vectors, and then three output vectors of the second layer are spliced; the third layer is a multi-head self-attention mechanism layer, and the vectors obtained by splicing are dynamically fused to extract key features; and the fourth layer is a conditional random field CRF decoding layer, and decodes the feature vector to obtain a final labeling sequence.

The beneficial effects of the invention are as follows:

1. the invention introduces root features and font features of Chinese characters and enhances the semantic expression of character granularity. Root feature information and contextual root information are extracted using a two-way long and short term memory (BiLSTM) network. The Chinese characters are treated as image processing, and font information is extracted from the image layer by using an iterative expansion convolutional neural network (IDCNN). Experimental results show that on Weibo and OntNotes 4.0 data sets, the F1 value is obviously improved by adding the two characteristics independently.

2. The invention explores different fusion modes of a plurality of features, discovers that a multi-head self-attention mechanism can effectively extract key features and inter-word association dependency features, dynamically fuses a plurality of feature vectors and further improves the performance of the model.

3. The invention introduces a knowledge distillation method, sets a teacher model and a student model as the same model, and guides the study of the student model by using the category prediction information of the teacher model. Experimental results show that on Weibo and OntNotes 4.0 data sets, the F1 value is obviously improved after knowledge distillation is added.

Drawings

FIG. 1 is a schematic diagram of a model structure established by a Chinese named entity recognition method based on multi-feature embedding in an embodiment of the invention;

Fig. 2 is a schematic structural diagram of a Transformers Encoder structure according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of an LSTM cell structure in an embodiment of the invention;

FIG. 4 is a schematic diagram of the structure of BiLSTM structure in an embodiment of the invention;

FIG. 5 is a schematic diagram of the expansion of the convolution kernel in the dilation convolution operation in accordance with an embodiment of the present invention;

FIG. 6 is a graph of the expansion ratio of the expansion convolution in accordance with an embodiment of the present invention;

FIG. 7 is a schematic diagram illustrating the operation of the self-attention mechanism according to an embodiment of the present invention;

FIG. 8 is a diagram illustrating the case of a query, key, and value vector with a head number of 2 according to an embodiment of the present invention;

fig. 9 is a schematic diagram of experimental results of dropout size setting in an embodiment of the present invention.

Detailed Description

Embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

Examples

A Chinese named entity recognition method based on multi-feature embedding establishes a model, and the model structure is shown in figure 1; the model consists of four layers, wherein the first layer is an input layer, word vectors containing rich semantic features are obtained through the BERT model, and word root embedded vectors and font embedded vectors are obtained through processing; the second layer is a BiLSTM and IDCNN coding layer, the layer is divided into two branches, one branch uses BiLSTM to respectively extract characteristics of word vectors and root embedded vectors, the other branch uses IDCNN to extract characteristics of font embedded vectors, and then three output vectors of the second layer are spliced; the third layer is a multi-head self-attention mechanism layer, and the vectors obtained by splicing are dynamically fused to extract key features; and the fourth layer is a CRF decoding layer, and decodes the feature vector to obtain a final labeling sequence. It should be noted that, the root embedding vector in this document is obtained by extracting information through the BiLSTM network alone, the word vector obtained by the BERT pre-training and the root embedding vector obtained by the embedding layer are respectively processed through the BiLSTM, instead of splicing the word vector and the root embedding vector first, and then the root feature vector obtained by inputting the BiLSTM is obtained by doing so, and the root feature vector obtained by doing so not only contains rich root feature information, but also contains contextual root information.

The method for identifying the Chinese named entities in this embodiment is further described below:

1. root embedding:

in this embodiment, more than 7 thousand Chinese characters-root mapping is obtained by crawling, and the pictures corresponding to the characters are saved. For the encoding of root information, firstly traversing the root set corresponding to each Chinese character in the Chinese character-root table, distributing a unique index ID for each root, and constructing a root table. If the characters in the text are in more than 7 thousand Chinese characters, finding out a corresponding root set according to a Chinese character-root table, traversing each root in the root set, finding out a corresponding index ID in the root table, and constructing a root vector. The observation from the chinese-word root table shows that there are no more than 9 word roots of a chinese character, so this embodiment sets the word root vector length of each character to 9, and sets the supplementary tag to "pad". If the root of the character is less than 9 or no Chinese character is found, the root vector is complemented by the index ID corresponding to the complement label.

The problem of index size affects the model effect, so this embodiment adds an embedded layer. The embedding layer has only one weight parameter w, the shape is (root table size, root vector dimension). When the method is used for creating, the weight is randomly initialized to meet standard normal distribution, the parameter can be used as a search matrix, and the corresponding root vector is searched in the matrix w through the index ID so as to realize mapping of the root and the root vector, so that the size relationship does not exist. It should be noted that the input to the embedding layer should be a long integer tensor.

2. BERT pre-training model:

BERT (Bidirectional Encoder Representations from Transformers) is a language model based on a Transformer architecture that is pre-trained unsupervised on a large corpus so that the model can learn rich language knowledge and context related information. As shown in FIG. 2, the model adopts a bidirectional transducer encoder, is formed by stacking a plurality of layers Transformers Encoder, and can capture finer semantic information when processing text by simultaneously considering all word information in the context. In addition, the BERT model also adopts tasks such as a mask language model (Masked Language Model, MLM) and next sentence prediction (Next Sentence Prediction, NSP), wherein in the MLM task, the BERT randomly masks some words in the text, and then predicts the masked words through context; in the NSP task, BERT predicts whether two sentences are adjacent. In the execution process of the two tasks, the model can better understand the semantic relation between the context and the sentence, and the word vectors obtained through training can well represent the context semantic information of the word, so that the word vectors are excellent in the natural language processing task.

3. BiLSTM feature extraction model:

BiLSTM (Bidirectional Long Short-Term Memory) as a variant of the recurrent neural network RNN (Recurrent Neural Network) for sequence data solves the problem that the RNN model is difficult to capture long-Term dependencies. The BiLSTM neural network model comprises long and short time memory networks (LSTM) in two directions, respectively models the forward direction and the reverse direction of the sequence, acquires the context information of each position in the sequence, combines the outputs in the two directions to obtain a more comprehensive characteristic representation, and a structural diagram is shown in figure 4. The core is an LSTM unit which is a cyclic neural network unit capable of processing long-term dependence, and three gates are introduced: an input gate (input gate), a forget gate (forget gate), and an output gate (output gate) to control the flow and save of information, so that effective learning and prediction can be performed over a long sequence, and the structure is shown in fig. 3. The influence degree of the current input information on the memory unit is input, the influence degree of the history information on the memory unit is forgotten, and the influence degree of the memory unit on the current output is output. The refresh mode of the three gates and the memory cell is as follows.

i _t ＝σ(W _i [x _t ,h _t-1 ]+b _i ) (1)

f _t ＝σ(W _f [x _t ,h _t-1 ]+b _f ) (2)

o _t ＝σ(W _o [x _t ,h _t-1 ]+b _o ) (3)

c _t ＝f _t *c _t-1 +i _t *tanh(W _c [x _t ,h _t-1 ]+b _c ) (4)

h _t ＝o _t *tanh(c _t ) (5)

4. IDCNN feature extraction model:

for sequence labeling, the common CNN has a disadvantage that after convolution, the peripheral neurons can only obtain a small part of the information of the input text, while for NER, each word in the whole input sentence may have an effect on the labeling of the current position, so-called long-range dependency problem. To obtain context information, adding more convolution layers will result in deeper layers, more parameters, complex models and easy overfitting, and if one chooses to increase the convolution kernel to capture more context information, this will result in increased computation. The Dilation Convolutional (DCNN) neural network is a convolutional operation that is performed by inserting one or more 0 s into a convolutional kernel, thereby increasing the holes (position) in the convolutional kernel, which has a larger receptive field, so that a wider area of features can be extracted without increasing the number of parameters, as shown in fig. 5. However, DCNN has a problem in that the receptive field size of each convolution layer is fixed and the context information extraction capability is insufficient when the feature extraction is performed using a plurality of DCNN layers. In order to solve the problem, the IDCNN (Iterative Dilated Convolutional Neural Network) model adopts the idea of iterative expansion convolution, and the expansion rate is continuously increased in the convolution process, so that the receptive field of the model can be continuously expanded, and the local features and the global features can be more effectively extracted. The IDCNN model is formed by splicing 4 modules with the same structure, three layers of DCNN with the conditions of 1,1 and 2 are arranged in each module, and each layer of convolution is used for further extracting the output of the previous layer, so that the model can learn longer text sequence characteristics, and the accuracy of the model is improved. Note that the condition=1 is true expansion convolution as in the case of the normal CNN and the condition=2, as shown in fig. 6. Meanwhile, the robustness of the model can be improved by splicing the modules together, so that the model can be better adapted to the change of various input sequences. The present embodiment utilizes IDCNN model to extract font information, interactive knowledge between fonts of adjacent characters, and inter-character context dependent information from pictures of characters.

5. Multi-head self-attention mechanism:

in cognitive neurology, attention is an indispensable complex cognitive function of humans, meaning that a person can ignore some information while focusing on it's selective ability to other information. The attention mechanism is to imitate the cognitive function of a person, select a small part of useful information from a large amount of input information to be emphasized, ignore other information, and allocate limited information processing resources to the important part. The attention mechanism can be abstracted into a model, which can be macroscopically understood as a query (query) to a series of key-value pairs<key,value>Is mapped to the mapping of (a). The constituent elements in Source are thought to be composed of a series of key value pairs, a certain query in Target is given, the similarity or correlation of the query and each key is calculated, the weight coefficient of value corresponding to each key is obtained, normalization is carried out through softmax, and then weighted summation is carried out on the weight and the corresponding value, so that the final attention value is obtained. The similarity calculation rules of query and key are various, the scaling dot product attention used in the embodiment calculates the similarity through the dot product of two vectors, and divides one scaling factor, so that the attention weight distribution is more gentle, the situation that the numerical value is too large or too small after normalization is avoided, and d _k The key vector dimension is taken and the calculation formula is shown as follows.

When q=k=v, i.e. self-attentionAnd a mechanism (Self-attribute) applied to the encoder end of the named entity recognition task and used for extracting the features so that the model distributes different weights according to the information importance degree, thereby paying more Attention to the characters related to the entities. Fig. 7 illustrates the working principle of the self-attention mechanism applied to the NER task. For an input sequence of length 150, each word has a corresponding feature vector a _i In a, a ₁ For example, a query vector q is obtained by linear transformation ₁ Key vector k ₁ Sum vector v ₁ Handle q ₁ Performing dot product operation on key vectors corresponding to each character feature vector, dividing the dot product by a scaling factor, and performing softmax operation to obtain importance weight of the value vector corresponding to each character feature vector on the current characterFinally, weighted summation is carried out to obtain an output b ₁ The new feature vector contains important information about the entity, association information between words, etc.

The Multi-Head Attention mechanism (Multi-Head Attention) enables the model to learn different important features in different information subspaces based on the same Attention mechanism, so that the robustness of the model can be improved. Specifically, the input sequence is subjected to linear transformation to obtain a set of query vectors (query), key vectors (key), and value vectors (value). Then, the query, key and value vector are respectively divided into a plurality of heads (heads), the dimensions of each head are the same, as shown in fig. 8, taking 2 heads as an example, the feature vector of a certain word in the input sequence is subjected to linear transformation to obtain the query, key and value vector, and then the query, key and value vector is divided into a plurality of heads. In each head, the attention weight is calculated and the value vector is weighted again to obtain the output vector of the head. Finally, the output vectors of all heads are connected in series, and the final output vector is obtained through linear transformation and nonlinear transformation again. In addition, the multi-head attention mechanism has stronger parallelism and can effectively accelerate the training and deducing processes of the model.

And fusing the word feature vector, the root feature vector and the font feature vector obtained by the second layer of the model, and inputting the fusion into the CRF layer to obtain a prediction sequence. Three fusion modes are presented herein.

Splicing: this method is straightforward and computationally efficient, but may ignore the importance and relevance of the different features.

Splice + LSTM: the vectors are spliced together and input into an LSTM model to extract association information and context information.

Splice + multi-head self-attention mechanism: the vectors are spliced together and input into a multi-head self-attention mechanism layer for dynamic fusion.

In the embodiment, a multi-head self-attention mechanism is adopted, feature fusion is carried out on the feature vectors obtained by upper layer splicing, key information is extracted, meanwhile, different features are captured by a plurality of heads in learning and utilization, the expression capacity of the feature vectors on the important information is enhanced, and the model recognition accuracy is improved.

6. Conditional random field CRF:

although the label class to which each character belongs can be obtained by adding a softmax layer after the upper neural network model, the use of this label class as a final result is not ideal. Because the output of the softmax layer is independent, dependencies between tags and constraints of the sequence tags are not considered, which may result in the beginning of certain tags being erroneously determined to be "I", "E", etc. To solve this problem, the relationship between adjacent marks needs to be considered. Therefore, the CRF (Conditional Random Field) model is taken as the last layer of the model of the embodiment, and the dependency relationship among the labels is considered, so that the context information in the sequence is better captured, and the accuracy of sequence labeling is improved. The main implementation method is as follows: given an input sequence x=x ₁ ,x ₂ ,x ₃ ,…,x _n Corresponding predicted tag sequence y=y ₁ ,y ₂ ,y ₃ ,…,y _n Calculating the score of Y, wherein the score calculation formula is shown as formula (7)Representing that the ith word is mapped to tag y _i Probability of->Representing label y _i Transfer to tag y _i+1 Is a probability of (2).

Then, the probability of the predicted tag sequence Y is obtained, and the calculation formula is shown as formula (8), whereinRepresenting possible annotation sequences, Y _X Representing all possible annotation sequences.

And finally outputting the optimal labeling sequence Y corresponding to the maximum likelihood function of P (Y|X) as shown in a formula (9).

7. Knowledge distillation:

knowledge distillation (Knowledge Distillation) is a model compression-based technique aimed at migrating knowledge of a larger, more complex model (teacher model) into a smaller, simpler model (student model) to achieve the goal of reducing model volume and computational complexity while guaranteeing model performance. Specifically, knowledge distillation techniques assist in training of student models by utilizing the output of the teacher model as a "soft label" during training. "Soft tag" refers to a continuous probability distribution obtained by smoothing the original probability distribution. The goal of the student model is to fit the labels of the training data and the soft labels of the teacher model at the same time as accurately as possible.

During training, the output of the prototype model (teacher model) is typically processed through a softmax function to convert the output into a probability distribution. However, the result after using the softmax function easily goes to the extreme, that is, the confidence of a certain category is ultrahigh, the confidence of other categories is very low, at this time, the positive category information concerned by the student model may still belong to a certain category, but the negative category information of different categories is also important. To solve this problem, the temperature parameter T may be introduced to scale the output of the teacher model to obtain a softer profile, so that the student model learns more information.

The specific implementation of knowledge distillation introduces two loss functions, namely distillation loss L _dis And student loss L _stu The distillation loss is a loss function measuring the difference between the teacher model output and the student model output, a cross entropy loss function or KL divergence is commonly used, the KL divergence used in this embodiment is used as the distillation loss function, the student loss is a loss function measuring the difference between the student model output and the real label, and the two loss functions are generally weighted and summed to obtain a final loss function L, as shown in formula (10).

L＝(1-α)L _dis +αL _stu (10)

This example is further analyzed by experiments as follows:

1. Data set and labeling scheme:

to verify the robustness of the method of this example, the experiment was performed on two NER datasets, weibo and ontnot 4.0. Weibo is generated by screening and filtering historical data between 2013, 11 and 2014, 12 of new wave microblogs, and contains 1890 microblog messages. The OntNotes4.0 dataset was extracted from the news text, containing 24371 pieces of text. The labeling entities of the two data sets are four categories of place name LOC, person name PER, organization name ORG and geopolitical name GPE, the entity labeling schemes of the two data sets are different, the two data sets are preprocessed by the embodiment, the BIOES labeling method is adopted, B-represents the beginning of the entity, I-represents the middle of the entity, O-represents the non-entity part, E-represents the end of the entity, and S-represents the single-word entity.

Table 1 dataset information

2. Evaluation index:

in the embodiment, the accuracy P, the recall R and the F1 score are adopted to evaluate the performance of the model, a specific calculation formula is shown as follows, wherein TP is the number of entities correctly identified by the model, FP is the number of irrelevant entities identified by the model, and FN is the number of entities marked by a data set but not identified by the model.

3. Setting an experimental environment:

in this embodiment, a neural network model is built based on a pyrerch framework, and a specific experimental environment is shown in table 2.

Table 2 experimental environment settings

/>

4. And (3) determining experimental parameters:

the model parameters of the present embodiment include parameters of the BERT model, the BiLSTM model, the IDCNN model, and the CRF model. The BERT-base-Chinese model used by the BERT pre-training model is initialized, the number of bidirectional transformers is 12, the dimension of a Transformer hidden layer is 768, the number of attention mechanism heads is 12, and the random inactivation rate is 0.1. Other parameter settings are shown in table 3.

Table 3 experimental parameter settings

(1) The dimension of the root vector of the embedded layer is set as follows:

and converting the root code vector obtained by data preprocessing into vectorized expression of root information through an embedding layer, and inputting a feature extraction model to capture the context features. In this embodiment, experiments are performed on the setting of the root vector dimension, and the experimental results are shown in tables 4 and 5. The dimension of the root vector is the representing dimension of the root corresponding to each word in the vector space. The choice of root vector dimensions affects model performance. From the experimental results, when the dimension is 32, the expression capability of the root vector obtained by the embedding layer is insufficient, the root information contained in the vector is insufficient, the F1 value is not high, when the dimension is increased to 64, the F1 value is improved, but when the dimension is further increased to 128, the parameter of the model is greatly increased by the high-dimension word vector, and the model training is fitted.

Table 4 results on Weibo dataset

TABLE 5 results on OntNotes 4.0 dataset

(2) Multi-head self-attention mechanism dropout setting:

in a multi-head self-attention mechanism, since different heads may focus on different aspects of the input data, the final concatenation of the outputs of the individual heads may be considered as "aggregation" of the different aspects of the input data, resulting in a more comprehensive understanding of the input data. To increase the robustness and generalization ability of the model, the present embodiment performs a random dropout operation on the attention weight distribution, i.e., randomly resets some attention weights to 0. This allows the model to focus more on the entire input data, avoiding over-reliance on the output of one of the heads. In this example, experiments were performed on the setting of the dropout size, and the experimental results are shown in fig. 9. Experimental results show that when the random inactivation rate is 0.5 on two experimental data sets, the model performance is good, and the overfitting can be effectively prevented.

(3) Multi-head self-attention mechanism multi-head number setting:

in order to improve generalization of the model, a multi-head attention mechanism is adopted, so that the model learns features in different aspects in each head. The smaller multi-head number can reduce the parameter quantity and the calculation complexity of the model, but the information expression capability is insufficient; a larger number of heads may enhance the expressive power of the model, but may increase the computational complexity of the model. To avoid significant increases in computational cost and number of parameters, assuming the input vector dimension of the multi-headed self-attention mechanism layer is p and the number of heads is h, then p is in each head _q ＝p _k ＝p _v The number of heads can be calculated in parallel by setting the output vector dimension of the linear transformation of the query, key and value to p, so it should be noted that the input vector dimension p should be divided by the number of heads chosen for the experiment. In this example, multi-head selection experiments were performed on Weibo and ontnot 4.0 data sets, and the experimental results are shown in tables 6 and 7, wherein the effect is best when the head number is 16 on Weibo data set, and the effect is best when the head number is 12 on ontnot 4.0 data set.

Table 6 results on Weibo dataset

TABLE 7 results on OntNotes 4.0 dataset

(4) Knowledge distillation parameter settings:

in knowledge distillation, temperature parameters are used for controlling smoothness of soft labels, the higher the temperature parameters are, the smoother the soft label distribution is, the more information is learned by a student model, but the higher the temperature parameters can cause the student model to be interfered, so that the student model is difficult to accurately fit the real category distribution. The weight coefficient in the knowledge distillation loss function is used for controlling the proportion of the teacher model output in the loss function, the weight coefficient is generally smaller than or equal to 0.5, the larger the weight coefficient is, the more important the student model is to the information distilled in the teacher model, the smaller the weight coefficient is, and the more important the student model is to the self prediction effect. In this embodiment, experiments are performed on the setting of the two parameters, and the experimental results show that under the condition that high-temperature distillation is not performed, that is, when t=1, the classification prediction situation distribution of the teacher model and the student model is smooth, the information transmitted to the student model can achieve the best effect, and the F1 value of the model is highest when the weight coefficients on Weibo and ontnot 4.0 datasets are respectively 0.22 and 0.18.

Table 8 results on Weibo dataset

Table 9 results on OntNotes 4.0 dataset

5. Analysis of experimental results:

(1) Expression of different fusion strategies:

three fusion modes are provided for fusing the word feature vector, the root feature vector and the font feature vector output by the second layer of the model, and a comparison experiment is carried out, wherein the experimental results are shown in tables 10 and 11, and the best effect can be obtained by using the fusion mode of splicing and multi-head self-attention mechanism on two data sets. Because too much external information is introduced to add noise to influence the model recognition effect, the correlation and the importance degree between the features are not considered in simple splicing, the importance degree of a large amount of feature information is not considered in the unidirectional LSTM model, and the extraction capability of context related information is insufficient, but a multi-head self-attention mechanism can learn and extract key features and inter-word correlation dependency information from a large amount of features, and a plurality of heads can extract features in different aspects, so that the vector semantic expression capability is further enhanced, and the model recognition effect is improved.

Table 10 results on Weibo dataset

Table 11 results on OntNotes 4.0 dataset

(2) The performance of the various methods:

to verify the effectiveness of the method proposed in this example, the effect of adding each method on the accuracy, recall and F1 values was tested in the same environment, and the test results are shown in tables 12 and 13.

The root and the font external information are respectively added to the baseline model, so that the recognition effect of the model is greatly improved, because the characteristics of the Chinese characters themselves contain a large amount of semantic information, the recognition of the entities is facilitated; in order to integrate the lifting effect of two external information on the model, a multi-head self-attention mechanism is used, the recall rate and the F1 value are obviously improved on a Weibo data set, but the model may have an overfitting phenomenon due to excessive characteristic information, the accuracy is reduced, and the accuracy and the F1 value are improved on an OntoNotes4.0 data set, so that the effectiveness of the method is proved; in order to further improve the model identification effect, a knowledge distillation method is introduced, the output of a teacher model is used for guiding the training of a student model, the recall rate and the F1 value are continuously improved to a certain extent on a Weibo data set, the accuracy is also reduced, the fact that the model is possibly subjected to over-fitting phenomenon to cause incorrect entity identification is illustrated, and on an OntNotes 4.0 data set, obvious improvement of the recall rate and the F1 value is observed, and the accuracy is also reduced. Overall, the experimental best model was 9.92% and 1.31% higher than the F1 values of the baseline model on Weibo, ontnot 4.0 dataset, respectively.

Table 12 results on Weibo dataset

TABLE 13 results on OntNotes 4.0 dataset

The embodiment provides a Chinese named entity recognition method based on multi-feature embedding, which combines the root and font features of Chinese characters with a general model BERT-BiLSTM-CRF, uses BiLSTM and IDCNN to respectively extract the root and font features, extracts key features and related features by using a multi-head self-attention mechanism for fusing a plurality of feature embedding vectors, strengthens the semantic expression capability of the feature vectors, and shows that the method obtains obvious accuracy, recall rate and F1 value improvement on two experimental data sets. In order to further improve the model performance, a knowledge distillation method is introduced, a teacher model and a student model are set to be the same model, and in the training process of the student model, category prediction information of the teacher model is used for guiding, and experimental results show that the method can effectively improve the recognition performance of the model on the entity, so that the recall rate and the F1 value are improved.

The foregoing examples merely illustrate specific embodiments of the invention, which are described in greater detail and are not to be construed as limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention.

Claims

1. A Chinese named entity recognition method based on multi-feature embedding is characterized by comprising the following steps:

2. The multi-feature embedding-based recognition method of chinese named entities of claim 1, further comprising the steps of:

3. The method for identifying Chinese named entities based on multi-feature embedding according to claim 1 or 2, wherein in step 1, the root embedding vector is obtained by processing as follows:

4. The method for identifying Chinese named entities based on multi-feature embedding according to claim 3, wherein in step 1, the BERT model adopts a bidirectional transducer encoder, is formed by stacking a plurality of layers Transformers Encoder, and takes all word information in a context into consideration when processing text, so that finer semantic information can be captured; the BERT model adopts a mask language model and a next sentence prediction task, in the mask language model task, the BERT model randomly masks some words in the text, and then predicts the masked words through the context; in the next sentence prediction task, the BERT model predicts whether two sentences are adjacent.

5. The method for identifying Chinese named entity based on multi-feature embedding of claim 4, wherein the bi-directional long-short term memory network BiLSTM comprises two long-short term memory networks LSTM in two directions, respectively modeling forward and reverse directions of the sequence, obtaining context information of each position in the sequence, and combining outputs in the two directions to obtain a more comprehensive feature representation;

i _t ＝σ(W _i [x _t ,h _t-1 ]+b _i )

f _t ＝σ(W _f [x _t ,h _t-1 ]+b _f )

o _t ＝σ(W _o [x _t ,h _t-1 ]+b _o )

c _t ＝f _t *c _t-1 +i _t *tanh(W _c [x _t ,h _t-1 ]+b _c )

h _t ＝o _t *tanh(c _t )

6. The method for identifying Chinese named entities based on multi-feature embedding according to claim 5, wherein in step 2, the expanded convolutional neural network IDCNN is spliced together by 4 modules with the same structure, and each module is a three-layer expanded convolutional DCNN with 1,2 holes.

7. The method of claim 6, wherein in step 3, scaling dot product attention is used, attention is a method of mapping a query Q and key value pair K-V to an output, wherein Q, K, V are vectors, the represented information is different according to tasks, the output is obtained by weighting and summing V, the weight is Q, K similarity, similarity is calculated by dot product of two vectors, and a scaling factor d is divided by _k The attention weight distribution is more gentle, and the situation that the numerical value is too large or too small after normalization is avoided. The calculation formula is as follows, wherein d _k Key vector dimension, T is transposed:

8. The method for identifying Chinese named entities based on multi-feature embedding of claim 7, wherein the step 4 is specifically as follows:

given an input sequence x=x ₁ ,x ₂ ,x ₃ ,···,x _n Corresponding predicted tag sequence y=y ₁ ,y ₂ ,y ₃ ,···,y _n The score of Y was calculated as follows:

wherein the method comprises the steps ofRepresenting possible annotation sequences，Y _X Representing all possible annotation sequences;

9. the method for identifying Chinese named entities based on multi-feature embedding according to claim 2, wherein the step 5 is specifically as follows:

L＝(1-α)L _dis +αL _stu 。

10. The Chinese named entity recognition system based on multi-feature embedding is characterized by comprising four layers, wherein the first layer is an input layer, word vectors containing rich semantic features are obtained through a BERT model, and word root embedded vectors and font embedded vectors are obtained through processing; the second layer is a coding layer of a two-way long-short-term memory network BiLSTM and an iterative expansion convolutional neural network IDCNN, the layer is divided into two branches, one branch uses the two-way long-short-term memory network BiLSTM to respectively extract characteristics of word vectors and word root embedded vectors, the other branch uses the iterative expansion convolutional neural network IDCNN to extract characteristics of font embedded vectors, and then three output vectors of the second layer are spliced; the third layer is a multi-head self-attention mechanism layer, and the vectors obtained by splicing are dynamically fused to extract key features; and the fourth layer is a conditional random field CRF decoding layer, and decodes the feature vector to obtain a final labeling sequence.