CN113204970A

CN113204970A - BERT-BilSTM-CRF named entity detection model and device

Info

Publication number: CN113204970A
Application number: CN202110631994.7A
Authority: CN
Inventors: 彭涛; 王上; 姚田龙; 包铁; 张雪松
Original assignee: Jilin University
Current assignee: Jilin University
Priority date: 2021-06-07
Filing date: 2021-06-07
Publication date: 2021-08-03

Abstract

The invention discloses a BERT-BilSTM-CRF named entity detection model, belonging to the technical field of named entity identification, which comprises the following steps: the IDCNN-CRF named entity recognition model and the BERT-BilSTM-CRF named entity recognition model are structured as follows: the Embdding layer is a Word vector layer and is used for processing input data into Word vectors and sending the Word vectors into the model, and the Word2Vec is represented by adopting distributed vectors; and the IDCNN layer is used for sending the word vectors or the word vectors processed by the embedding layer into the IDCNN layer, and recalculating the input word vectors through the expansion convolution operation of the expansion convolution neural network to obtain new vector representation. According to the BERT-BilSTM-CRF named entity detection model and the device, the BilSTM-CRF model is used as a reference, an IDCNN-CRF model and a BERT-BilSTM-CRF model are constructed by using a data set labeled in the "people's daily newspaper" of Beijing university and a MSRA named entity identification data set of Microsoft's academy of research, so that the accuracy and the operating efficiency of named entity identification are improved, and the model training time is shortened.

Description

BERT-BilSTM-CRF named entity detection model and device

Technical Field

The invention relates to the technical field of named entity recognition, in particular to a BERT-BilSTM-CRF named entity detection model and a device.

Background

The current named entity recognition task is mainly to capture words or phrases in the input text and to perform qualitative classification. Several methods for naming entity recognition tasks have been proposed and can be generally classified into three categories. The first type is named recognition of entities in texts according to a rule method, the second type is named entity recognition by using feature engineering based on a traditional statistical machine learning method, and the third type is named entity recognition by automatically extracting features of text information based on a deep learning method.

For the Chinese named entity recognition task, the rule-based method seriously depends on the structure of the rule, the transportability is poor, and the maintenance cost is high; the method based on statistical learning needs to be implemented on feature engineering, and a large amount of labor cost is consumed; the named entity recognition based on deep learning automatically extracts features, the process of manually designing the features is omitted, and the effect of the deep learning model is better along with the improvement of computer hardware, so that the named entity recognition based on deep learning has important research value.

With the explosive growth of data, how to process massive data and extract effective information become the most important problem at present, and the named entity recognition technology can automatically extract key entity information from massive text data. Aiming at the problem, the invention realizes a BERT-BilSTM-CRF named entity detection model and a device.

Disclosure of Invention

This section is for the purpose of summarizing some aspects of embodiments of the invention and to briefly introduce some preferred embodiments. In this section, as well as in the abstract and the title of the invention of this application, simplifications or omissions may be made to avoid obscuring the purpose of the section, the abstract and the title, and such simplifications or omissions are not intended to limit the scope of the invention.

To solve the above technical problem, according to an aspect of the present invention, the present invention provides the following technical solutions:

a BERT-BilSTM-CRF named entity detection model comprises an IDCNN-CRF named entity recognition model and a BERT-BilSTM-CRF named entity recognition model:

the IDCNN-CRF named entity recognition model architecture is as follows:

the Embdding layer is a Word vector layer and is used for processing input data into Word vectors and sending the Word vectors into the model, and the Word2Vec is represented by adopting distributed vectors;

the IDCNN layer is used for sending the word vectors or the word vectors processed by the embedding layer into the IDCNN layer, and recalculating the input word vectors through the expansion convolution operation of the expansion convolution neural network to obtain new vector representation;

the projection layer is used for carrying out linear conversion on vector representations calculated from the IDCNN layer neural network, the converted dimensions are the dimensions of the tags and are consistent with the dimensions of the tags, then Softmax normalization processing is carried out to obtain a probability p, the probability representations of the m-dimensional word vectors are combined to obtain vectors if the mapped vector representation dimensions are m dimensions, each dimension vector can be regarded as the probability of each class of tags, the class with the maximum probability is selected to obtain a classification result, and then the named entity identification task can be completed;

the CRF layer is used for screening out the optimal result through the transfer matrix and feeding the optimal result back to the user;

the BERT-BilSTM-CRF named entity recognition model is structured as follows:

the BERT layer is used for inputting sentences composed of single characters, and after the BERT processes the text sequence and obtains vector representation of each character, the BERT layer is used as the input of the next layer of BilSTM;

and in a BilSTM-CRF layer, the text sequence is subjected to BERT processing to obtain vector representation of a corresponding BERT pre-training word vector, the vector representation enters a BilSTM unit, the output result of the BilSTM is calculated, the result is sent to the CRF, and the optimal sequence label is calculated.

As a preferred scheme of the BERT-BilSTM-CRF named entity detection model, the invention comprises the following steps: the Embdding layer obtains the dependency relationship of upper and lower characters by training a large-scale corpus, and sends pre-trained 100-dimensional wikipedia word vectors and 20-dimensional word segmentation characteristics as input into the next layer.

As a preferred scheme of the BERT-BilSTM-CRF named entity detection model, the invention comprises the following steps: the CRF layer combines the result obtained by deep learning with a statistical learning model, maintains a matrix by using the CRF, transfers the probability between labels, converts the label of m dimension into (m +2) × (m +2), and the two extra dimensions represent the beginning and the end of the state, and corrects the invalid label by learning the rule of label conversion according to the change of the two parameters.

As a preferred scheme of the BERT-BilSTM-CRF named entity detection model, the invention comprises the following steps: the beginning of the sentence in the BERT layer is marked by cls, the end of the sentence is represented by sep, and the input of the BERT is formed by combining a word vector, a segment vector and a position vector.

As a preferred scheme of the BERT-BilSTM-CRF named entity detection model, the following steps are included: and calculating the semantic representation of the current word and the left word by the forward LSTM of the BilTM in the BilSTM-CRF layer, calculating the semantic representation of the current word and the right word by the backward LSTM, and splicing the obtained state representations of the two hidden layers to obtain an output result of the BilTM.

As a preferred scheme of the BERT-BilSTM-CRF named entity detection model, the following steps are included: the algorithm is realized by the following main formula:

a BERT-BilSTM-CRF named entity detection device comprises:

the information extraction module is used for extracting entity information and semantic relations between entities;

the information extraction module is connected with the information retrieval module and is used for screening out information related to the keywords through the query of the keywords, identifying the entity type of the keywords by using a named entity, classifying the text information and reducing the retrieval range;

the information retrieval module is connected with the machine translation module and used for identifying entity information of a translation target and analyzing the lexical method by using a translation rule;

and the machine translation module is connected with a question-answering system, and the question-answering system searches answers of the questions by matching the relation between the keywords and the entities and feeds back the result output to the user.

Compared with the prior art: according to the BERT-BilSTM-CRF named entity detection model and the device, the obtained F1 value of the constructed IDCNN-CRF named entity recognition model is respectively 10.4% and 11.41% higher than that of a baseline model CRF on a data set of ' people's daily report ' and an MSRA data set, is 0.38% and 2.07% higher than that of a BilSTM-CRF model, is shortened by nearly 30% in training time, and obviously improves the operation efficiency; the F1 value of the constructed BERT-BilTM-CRF model in Chinese named entity recognition is improved by 5.29 percent and 6.7 percent in two data sets compared with the BilTM-CRF model and is improved by 4.91 percent and 4.63 percent compared with the IDCNN-CRF model.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the present invention will be described in detail with reference to the accompanying drawings and detailed embodiments, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise. Wherein:

FIG. 1 is a diagram of IDCNN-CRF model architecture according to the present invention;

FIG. 2 is a view of the internal structure of the LSTM of the present invention;

FIG. 3 is a BilSTM model architecture of the present invention;

FIG. 4 is a schematic diagram of a conditional random field of a linear chain according to the present invention;

FIG. 5 is a representation of the BERT input of the present invention;

FIG. 6 is a diagram of IDCNN convolution operations according to the present invention;

FIG. 7 is a graph comparing accuracy for different models of the present invention;

FIG. 8 is a graph comparing recall in different models of the present invention;

FIG. 9 is a graph of F1 values versus values for various models of the present invention;

FIG. 10 is a comparison graph of the results of the daily report experiments of the people of the present invention;

FIG. 11 is a graph comparing the results of the MSRA experiments of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways than those specifically described herein, and it will be apparent to those of ordinary skill in the art that the present invention may be practiced without departing from the spirit and scope of the present invention, and therefore the present invention is not limited to the specific embodiments disclosed below.

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

The basic algorithm adopted by the invention is as follows:

one, long and short term memory network (LSTM)

Long and short term memory is a type of neural network with memory, and thus it shows very good performance in solving timing problems and natural language processing. The appearance of the LSTM solves the problems of gradient extinction and gradient explosion in the recurrent neural network to a certain extent. The LSTM network itself is not only a sequence structure, but also a chain-cycle network structure, so the LSTM can handle long-term dependencies.

LSTM solves the problem of short-term memory because it introduces a core structure, here we call memory cell C_t. In addition, the LSTM has three control gates, input gate i_tOutput gate o_tAnd forget door f_t. Wherein the input gate controls the information to be able to be input to the current network configuration and the output gate controls the information to be output as the current network. In the long and short time memory network LSTM, the most important thing is to forget the gate. The memory before forgetting to decide is removed and the current information is retained.

Adding an internal state (internal state) c in the LSTM neural network structure_tFor linear transmission of information on the one hand and for nonlinear output of information to an external state h of the hidden layer on the other hand_tAs shown in formula (1) and formula (2).

c_t＝f_tc_t-1+i_ttanh(W_xcx_t+W_hch_t-1+b_c) (1)

h_t＝o_ttanh(c_t) (2)

Wherein f is_t，i_tAnd o_tThree gates (gates) respectively, which are mainly used for setting the transmission of information; c. C_t-1The memory unit at the previous moment; by internal state c_tThe LSTM neural network can obtain the history information from the start of each time t to the current time from the neural network at each time t.

In the LSTM neural network, a gating mechanism (gating mechanism) is introduced into the memory unit c to control the information transmission, as shown in fig. 1. Three gates in formula (3), formula (4) and formula (5) are input gates i_tForgetting door f_tAnd output/input gate o_t。

The value range of the gate in the LSTM neural network is (0, 1), which means that the information is passed through according to a certain proportion. The roles of the three gates in the LSTM network are: forget door f_tControlling how much information the internal state of the previous moment needs to be forgotten; input door i_tControlling how much information of the candidate state at the current moment needs to be stored; output gate o_tControlling the internal state c at the present moment_tHow much information is output to external state h_t. The transformation between the three "gates" is as follows.

i_t＝σ(W_xix_t+W_hih_t-1+W_cic_t-1+b_i) (3)

f_t＝σ(W_xfx_t+W_hfh_t-1+W_cfc_t-1+b_f) (4)

o_t＝σ(W_xox_t+W_hoh_t-1+W_coc_t+b_o) (5)

σ () is a logistic function with an interval of (0, 1), W is a shape weight matrix, x_tFor input at the current time, h_t-1The external state at the previous time. When f is_t＝0，i_tWhen the number of the input gates is 1, the information of the forgetting gate is deleted, the information of the input gate is completely reserved, the memory unit clears the history information, and the information is written into the candidate state vector. When f is_t＝1，i_tWhen the value is equal to 0, the memory unit writes the history information and does not update new information.

Two-way long-short term memory network (BilSTM)

The BiLSTM includes a forward LSTM layer, which is generally considered to be a sequential sequence, and a backward LSTM layer, which is defined to be a reverse sequence, as shown in fig. 1. For information x input at time t_tThe output of the hidden layer in the forward LSTM layer is denoted as

The output of the hidden layer in the backward LSTM layer is represented as

As shown in fig. 2, the output of the SLTM of the second layer

All the information before is fused, and the formula is shown as (6):

wherein W_h ⁽¹⁾And W_x ⁽¹⁾As a weight matrix of the first layer network, b⁽¹⁾Is a bias vector. Similarly, the output of the LSTM of the first layer

The future information is fused, and the formula is shown as (7):

and the output h of the current hidden layer under the whole network framework_tThe two are combined to obtain the formula shown in (8):

wherein, the symbol | | | represents the end-to-end concatenation of the vectors.

Conditional Random Field (CRF)

Conditional random fields are probabilistic undirected graph models based on conditional probabilities. In named entity recognition, a given text input sequence X ═ X (X)₁,x₂,...,x_n) N is the number of words，x_iThe i-th word representing the text input sequence, then Y ═ Y₁,y₂,...,y_n) Is an output sequence and the set of Y is T ═ { B-PER, I-PER, B-LOC, I-LOC, B-ORG, I-ORG, O }, the output sequence Y is_iAnd input sequence x_iAnd correspond to each other. And under the condition that the given input sequence X takes the value of X, the conditional probability that the output sequence Y takes the value of X is p (Y | X), and the conditional probability p (Y | X) is the conditional random field. As shown in formula (9) and formula (10):

wherein, t_k()、s_l() Representing transition matrices, lambda, as characteristic functions_k、μ_lIs the matrix weight parameter, and S (x) is the normalization factor.

X＝(x₁,x₂,...,x_n) And Y ═ Y₁,y₂,...,y_n) This one-to-one correspondence of CRFs having the same structure is a random field for the linear chain elements, as shown in fig. 4.

The model theory of the IDCNN-CRF named entity recognition model and the BERT-BilSTM-CRF named entity recognition model is described as follows:

IDCNN-CRF model theory

The expansion convolution neural network has no pooling, so that information loss while the dimension of the up-down sampling is kept the same is avoided. Thus, the receptive field is not reduced. Compared with the common convolutional neural network, the expansion convolutional neural network increases the expansion width parameter. The expansion width refers to the expansion size of the convolution operation, and the input matrix ignores the middle part according to the step size, so that a wider receptive field can be obtained.

The IDCNN framework is shown in fig. 6, where black dot positions represent the characteristic information of the convolution kernel, and outer squares represent the receptive field of the convolution operation, denoted by f.

The first plot in fig. 6 has an expansion width of 1, corresponding to a conventional convolutional neural network, and the receptive field f is 3 × 3 — 9; the second plot is a dilation convolution operation with dilation width 2, with the convolution kernel size unchanged, or 3 x 3, with the dilation width increased and data in the middle of the step ignored, resulting in an expanded field range of f 7 49.

In named entity recognition, each word x of a text sequence_iThe mapping matrix W is made by concatenating over a sliding window, i.e. the expansion width_cAnd obtaining output Z of t words after convolution operation_tAs shown in equation 11:

wherein x is_iRepresenting the ith word in the text sequence, k representing the number of iterations of the dilated convolutional neural network, and symbol

Indicating a coupling operation. When the expansion width omega is more than 1, the expansion convolutional neural network does not act on continuous text sequences any more, the text sequence range is expanded according to the expansion width, the characteristic information between far words is obtained, and the characteristic information and the affine matrix W are further combined_cAnd (3) connecting to obtain the output result of the expansion convolution operation, as shown in an equation 12:

as can be seen from fig. 6, IDCNN designs two layers, where the expansion width of the first layer is 1, and all feature information between adjacent characters is taken; the expansion width of the second layer is 2, and the character between the sliding windows f 3 and f 9 can be taken, so that the dependency relationship of the context characteristic information can be captured. The dilated convolution output sequence h of the first layer_nAs shown in equation 13:

wherein the content of the first and second substances,

first layer of a convolutional layer having an expansion width of 1

The convolution layer output of the j-th layer is

r () is a ReLU activation function, the last layer outputs feature information of all words, the dilation width is set to 1, and the output formula is shown as 14:

and finally, taking the convolution operations of the two layers as a whole, and obtaining the final output result of the text sequence after IDCNN iterates for k times.

BERT-BilSTM-CRF model theory

The BERT has an internal structure of a double-layer bidirectional Transformer, integrates characteristic information of the left side and the right side of a character, constructs a complete context environment, and provides two unsupervised pre-training tasks: mask language model (Mask LM) and Next Sentence Prediction (NSP). The two tasks capture word features and sentence features respectively and then combine the results.

Mask language model

The bi-directional LSTM prediction context information is the concatenation of forward and backward information, for example:

changchun is the province meeting of Jilin, when the Jilin is predicted, only Changchun can be seen from left to right, only province meeting can be seen from right to left, and the Changchun and the province meeting are not really seen. BERT solves the problem by means of Mask, and really realizes a bidirectional language model. The principle of Mask LM is to cover Jilin, predict with Changchun and province, the context information links together, get "Jilin". The method of masking by BERT is to randomly select 15% of words in a training sample, and among the words, 80% of the words may Mask the word, 10% of the words may be replaced by any word, and 10% of the words may remain the same.

Prediction of next sentence

In the Mask language model, BERT establishes word-level dependency, while NSP aims to learn

Learning sentence-level dependencies, such as question-answering systems, requires associations between sentences. The method for performing NSP by BERT is as follows: dividing training samples into two classes, and constructing a normal sentence pair A-B by using 50% of linguistic data, wherein B is the next sentence of A and is marked as isNext; and (B) constructing abnormal language sequence pairs by using 50% of the linguistic data, and selecting a random sentence in the linguistic data and marking the random sentence as NotNext. And (4) predicting whether B is the next sentence of A or not by classifying the relation of the sentences to represent the relation of sentence levels.

Experimental results and analysis:

IDCNN-CRF model experiment

The transverse comparison experiment for constructing the IDCNN-CRF model under different experimental parameters mainly comprises the following three aspects: whether the influence of the pre-training word vector on the experimental result, the influence of the dimension of the word vector on the experimental result and the influence of the learning rate on the experimental result are loaded or not. Experiments are carried out in different data sets by adjusting different parameters, an optimal parameter combination suitable for the experiments is found out, and the longitudinal comparison experiments of different models are laid.

Loading pre-training word vector comparisons

In order to compare the influence of the pre-training Word vector on the experimental results, the large-scale wiki encyclopedia corpus is trained by using Word2Vec in the Gensim toolkit to obtain the pre-training Word vector. And fixing other parameters, optimizing the objective function by using an Adam algorithm, wherein the dimension of the word vector is 100 dimensions, the dimension of the hidden layer is 128 dimensions, the clip is set to be 5, the learning rate is 0.001, and the dropout is 0.5. On the basis, the influence of whether the pre-training word vector is loaded or not is compared, and the experimental result is shown in table 1.

TABLE 1 Loading of Pre-training word vector test results

As can be seen from Table 1, the random Word vector has uncertainty, the accuracy in the data set of the "people's daily newspaper" is higher than that of the loaded pre-training Word vector, and in the comparison of the recall rate and the F1 value, the Word2Vec Word vector is higher than that of the initialized random Word vector, and is respectively improved by 0.78% and 0.73%; in the MSRA data set, the accuracy, recall rate and F1 value of Word2Vec Word vectors are higher than those of random Word vectors, and are respectively improved by 0.95%, 1.62% and 2.18%, so that pre-training Word vectors are loaded in subsequent experiments.

Word vector dimension comparison

In order to select the proper word vector dimension, the model is trained from 50 dimensions, 100 dimensions and 200 dimensions, and the obtained experimental results are shown in table 2.

TABLE 2 different word vector dimension experimental results

As can be seen from table 2, the word vector dimension of 100 dimensions has better effect than other dimensions under the premise of fixing other parameters. The dimensionality is less than 100, the parameters of model training are too few, the fitting capability of the model is weak, the effect is reduced along with the increase of the dimensionality of the word vector, and the over-fitting phenomenon is easy to occur. Therefore, it is appropriate to select a 100-dimensional word vector.

Comparison of learning rate

The learning rate determines the weight updating of the model, directly influences the convergence state of the model, compares the experimental results of training the model with different learning rate sizes, and selects the optimal learning rate parameter.

TABLE 3 learning rates of different sizes

As can be seen from table 3, the learning rate is preferably 0.001, and the subsequent model learning rate can be set to 0.001.

Comparison with the baseline model

Training a deep learning model by comparing three parameters of pre-training word vectors, the size of the dimension of the word vectors and the size of the learning rate

And selecting a group of optimal parameters to train different models for comparison, wherein the experimental results of comparison with the base line model CRF are as follows:

TABLE 4 comparison of experimental results with baseline model

As can be seen from Table 4, the combination of the deep learning method and the statistical method has a better effect than the traditional statistical method, wherein the F1 value in the data set of the "people's daily report" is improved by 10.4%, and the F1 value in the MSRA data set is improved by 11.41%. The IDCNN-CRF model has excellent performance in the named entity recognition task

First, BERT-BilSTM-CRF model experiment

1, configuration of experimental parameters

Through previous experiments, experimental parameters of the BERT-BilSTM-CRF model based on BERT-Base can be determined as shown in Table 5:

TABLE 5 Experimental parameters

2. Comparative analysis with other models

In order to make a model perform comprehensive evaluation, on the premise of ensuring the same parameters, comparing the model with different models longitudinally, and comparing and analyzing the accuracy, recall rate and F1 value of three types of entities (name of person, name of place and name of organization), the experimental results are shown in table 6:

table 6 results of different model experiments

Table 6 shows the performance of different models in three types of entities, and according to the above experimental results, histograms are plotted, which respectively show the accuracy, recall rate, and F1 value of different models in the three types of entities. The accuracy of the different models is shown in fig. 7.

As can be seen from the figure, the accuracy of the BERT-BilSTM-CRF model on three entities, namely the human name, the place name and the organization structure name, is higher than that of other models, and is respectively 5.94%, 8.99% and 4.64% higher than that of the IDCNN-CRF model. Referring to FIG. 8, recall rates for different models are shown.

As can be seen from the figure, the recall rate of the BERT-BilSTM-CRF model on three entities, namely the human name, the place name and the organization structure name, is higher than that of other models, and is respectively 3.79 percent, 6.25 percent and 4.08 percent higher than that of the IDCNN-CRF model.

As in fig. 9, F1 values for different models are shown.

As can be seen from the figure, the F1 value of the BERT-BilSTM-CRF model is higher than that of other models in three entities of human name, place name and organization structure name, and is respectively 4.67%, 7.63% and 4.36% higher than that of the IDCNN-CRF model.

In conclusion, the performance of the BERT-BilSTM-CRF model designed in this chapter is better than that of other models, and the introduction of the BERT pre-training word vector has a remarkable improvement effect. The effectiveness of the BERT-BilSTM-CRF model in the named entity recognition task is illustrated. From the perspective of an entity, the LOC effect is weaker than that of the ORG and the PER, and the LOC effect is caused by a large number of influence factors such as place name nesting, abbreviation, semantic divergence and the like of organization names.

The above analysis is based on the entity, comparing the effects of different models, and now based on the overall effect of the models, comparing the performances of different models on the data set of the "daily report of people" and the MSRA data set, and the experimental results are shown in table 7.

TABLE 7 Overall comparison test results of different models

In order to clearly see the level of the model performance, the results are shown in bar chart form as shown in fig. 10 and fig. 11.

As can be seen from the figure, the BERT-BilSTM-CRF model has higher accuracy, recall rate and F1 value evaluation indexes than other models. The data set of the ' people's daily newspaper ' is respectively 5.2 percent, 5.93 percent and 4.91 percent higher than the IDCNN-CRF model, and the data set of the MSRA is respectively 4.01 percent, 6.98 percent and 4.63 percent higher than the IDCNN-CRF model. The effectiveness of the BERT-BilSTM-CRF model is verified from the aspect of the whole model.

While the invention has been described above with reference to an embodiment, various modifications may be made and equivalents may be substituted for elements thereof without departing from the scope of the invention. In particular, the various features of the disclosed embodiments of the invention may be used in any combination, provided that no structural conflict exists, and the combinations are not exhaustively described in this specification merely for the sake of brevity and resource conservation. Therefore, it is intended that the invention not be limited to the particular embodiments disclosed, but that the invention will include all embodiments falling within the scope of the appended claims.

Claims

1. A BERT-BilSTM-CRF named entity detection model, comprising: IDCNN-CRF named entity recognition model and BERT-BilSTM-CRF named entity recognition model, which are characterized in that,

the IDCNN-CRF named entity recognition model architecture is as follows:

the BERT-BilSTM-CRF named entity recognition model is structured as follows:

2. The BERT-BilSt-CRF named entity detection model as claimed in claim 1, wherein the Embdding layer obtains the dependency relationship between upper and lower characters by training a large-scale corpus, and feeds pre-trained 100-dimensional wikipedia word vectors and 20-dimensional participle features as input into the next layer.

3. The model of claim 1, wherein the CRF layer combines the results of deep learning with the statistical learning model, maintains a matrix using CRF, shifts the probability between labels, converts m-dimensional labels to (m +2) (m +2), and corrects invalid labels by learning the rule of label conversion according to the change of the two parameters.

4. The model of claim 1, wherein the beginning of a sentence in the BERT layer is labeled with cls, the end and separation of the sentence are represented by sep, and the input of BERT is composed of three parts of a word vector, a segment vector and a position vector.

5. The model of claim 1, wherein forward LSTM of the BilTM in the BilTM-CRF layer computes semantic representations of the current word and its left word, backward LSTM computes semantic representations of the current word and its right word, and the state representations of the two hidden layers are concatenated to obtain the output result of the BilTM.

6. The BERT-BilSTM-CRF named entity detection model of claim 1, wherein the algorithm implements the main formula:

7. a BERT-BilSTM-CRF named entity detection device is characterized by comprising: