CN108920445B - Named entity identification method and device based on Bi-LSTM-CRF model - Google Patents

Named entity identification method and device based on Bi-LSTM-CRF model Download PDF

Info

Publication number
CN108920445B
CN108920445B CN201810369183.2A CN201810369183A CN108920445B CN 108920445 B CN108920445 B CN 108920445B CN 201810369183 A CN201810369183 A CN 201810369183A CN 108920445 B CN108920445 B CN 108920445B
Authority
CN
China
Prior art keywords
lstm
matrix
character
vector
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810369183.2A
Other languages
Chinese (zh)
Other versions
CN108920445A (en
Inventor
莫益军
姚澜
杨帆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Ezhou Institute of Industrial Technology Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Ezhou Institute of Industrial Technology Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology, Ezhou Institute of Industrial Technology Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN201810369183.2A priority Critical patent/CN108920445B/en
Publication of CN108920445A publication Critical patent/CN108920445A/en
Application granted granted Critical
Publication of CN108920445B publication Critical patent/CN108920445B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Abstract

The invention provides a named entity recognition method and a named entity recognition device based on a Bi-LSTM-CRF model, which are used for preprocessing data of a natural language, and separating an input first natural language under the training condition to obtain a first character sequence; mapping according to each character in the first character sequence to obtain a vector matrix, and inputting the vector matrix into a Bi-LSTM module; inputting the emission matrix of the Bi-LSTM module into a CRF layer to form a Bi-LSTM-CRF model, and carrying out sentence entity identification on the natural language by the Bi-LSTM-CRF model; wherein the hyper-parameters of the Bi-LSTM-CRF module are determined according to cross-checking. The technical problem that the model in the prior art cannot consider long-term context information and cannot model the dependency relationships so that the recognition accuracy is limited is solved, the named entity in the sentence can be extracted by directly inputting the original sentence into the model, the adaptability is high, the application range is wide, and the entity recognition accuracy is improved is achieved.

Description

Named entity identification method and device based on Bi-LSTM-CRF model
Technical Field
The invention relates to the technical field of information processing, in particular to a named entity identification method and device based on a Bi-LSTM-CRF model.
Background
Named entity recognition is the most basic and most widely used one of all natural language processing applications. It is an entity with specific meaning in the identification text, mainly including name of person, place, organization, proper noun, etc. Named entity recognition application is an important basic tool for subsequent application fields of other applications, such as information extraction, question and answer systems, syntactic analysis, machine translation, semantic web-oriented metadata annotation and the like. Through the application of this tool to named entity recognition, a natural language model can be constructed that can understand, analyze, and answer the results of natural language like a human. The existing named entity recognition scheme can be basically divided into two types: a statistical-based approach and a neural network-based approach. The statistical-based method mainly comprises an HMM model and a CRF model, and the neural network-based method mainly comprises a convolutional neural network and an LSTM neural network.
In the prior art, a statistical-based method cannot consider long-distance context information, a neural-network-based method has the technical problems that outputs are mutually independent, and if output identification marks have strong dependency relationships, the neural-network-based scheme cannot model the dependency relationships, so that the identification accuracy is limited.
Disclosure of Invention
The embodiment of the invention provides a named entity identification method and device based on a Bi-LSTM-CRF model, and solves the technical problem that the model in the prior art cannot consider long-term context information and cannot model the dependency relationships, so that the identification accuracy is limited.
In view of the above problems, the present application provides a method and apparatus for identifying a named entity based on a Bi-LSTM-CRF model.
In a first aspect, the invention provides a named entity identification method based on a Bi-LSTM-CRF model, which comprises the following steps:
carrying out data preprocessing on the natural language, and separating an input first natural language under the training condition to obtain a first character sequence; under the prediction condition, separating an input second natural language to obtain a second character sequence, and classifying characters of the second character sequence, which do not exist in the first character sequence, as unregistered characters according to the comparison of the second character sequence and the first character sequence; mapping according to each character in the first character sequence to obtain a vector matrix, wherein the vector matrix comprises a vector with fixed dimensionality corresponding to each character; inputting the vector matrix into a Bi-LSTM module, respectively carrying out forward-to-backward and backward-to-forward vector sequence nonlinear transformation on a forward LSTM module and a backward LSTM module in the Bi-LSTM module, and combining output results of the forward LSTM module and the backward LSTM module, wherein the output results are emission matrices; inputting the emission matrix of the Bi-LSTM module into a CRF layer to form a Bi-LSTM-CRF model, and carrying out sentence entity identification on the natural language by the Bi-LSTM-CRF model; wherein, the hyperparameter of the Bi-LSTM-CRF module is determined according to cross-checking.
Preferably, the data preprocessing is performed on the natural language, and under the training condition, the input first natural language is separated to obtain the first character sequence, and the method further includes: carrying out artificial marking according to the first character sequence to obtain marking data; inputting the label data into a neural network.
Preferably, the mapping according to each character in the character sequence to obtain a vector matrix further includes: constructing a character co-occurrence matrix; performing matrix decomposition on the co-occurrence matrix according to a gradient descent method to obtain a character vector; and mapping the characters to obtain a feature matrix, and uniformly mapping the character vector to the character vector of the unregistered character set according to the unregistered characters.
Preferably, the inputting the vector matrix into the Bi-LSTM module, performing forward LSTM module and backward LSTM module in the Bi-LSTM module with forward LSTM and backward LSTM vector sequence non-linear transformation from front to back and from back to front respectively, and combining output results of the forward LSTM module and the backward LSTM module, where the output result is a transmission matrix, further includes: determining a hyper-parameter of the neural network; in the training process, the batch normalization method is used for accelerating the training; and adding a Bi-LSTM module into the forward LSTM module and the backward LSTM module, wherein at each time point of the forward LSTM module, each row in the feature matrix of the natural language is input, the output of each time point is a vector and an implicit vector, the implicit vector is input into the LSTM module at the next time point, and the output of the next time point is generated by each row in the feature matrix input at the next time point.
Preferably, inputting the emission matrix of a Bi-LSTM module into a CRF layer to form a Bi-LSTM-CRF module, the Bi-LSTM-CRF module performing sentence-by-sentence entity recognition on the natural language, further comprising: carrying out whole sentence entity identification on the natural language to obtain a first matrix; and obtaining a label sequence according to the first matrix.
Preferably, the cross-checking comprises: dividing the label data into a training data set and a testing data set; training different parameter sets according to the test data set to obtain a series of models with different hyper-parameters; and evaluating the series of models with different hyper-parameters according to the test data set, and determining the parameters of the optimal model.
In a second aspect, the present invention provides a named entity recognition apparatus based on a Bi-LSTM-CRF model, the apparatus comprising:
the first obtaining unit is used for carrying out data preprocessing on the natural language and separating the input first natural language under the training condition to obtain a first character sequence;
the first comparison unit is used for separating an input second natural language under the prediction condition to obtain a second character sequence, and classifying characters of the second character sequence, which do not exist in the first character sequence, into unregistered characters according to the comparison of the second character sequence and the first character sequence;
a second obtaining unit, configured to map each character in the first character sequence to obtain a vector matrix, where the vector matrix includes a vector with a fixed dimension corresponding to each character;
a third obtaining unit, configured to input the vector matrix into a Bi-LSTM module, perform forward-to-backward and forward-to-backward vector sequence nonlinear transformation on a forward LSTM module and a backward LSTM module in the Bi-LSTM module, respectively, and combine output results of the forward LSTM module and the backward LSTM module, where the output result is a transmit matrix;
the first identification unit is used for inputting the emission matrix of the Bi-LSTM module into a CRF layer to form a Bi-LSTM-CRF model, and the Bi-LSTM-CRF model carries out sentence entity identification on the natural language;
a first determination unit for determining the hyper-parameters of the Bi-LSTM-CRF module based on cross-checking.
Preferably, the apparatus further comprises:
a fourth obtaining unit, configured to perform manual marking according to the first character sequence to obtain marking data;
a first input unit for inputting the marker data into a neural network.
Preferably, the apparatus further comprises:
the first construction unit is used for constructing a character co-occurrence matrix;
a fifth obtaining unit, configured to perform matrix decomposition on the co-occurrence matrix according to a gradient descent method to obtain a character vector;
a sixth obtaining unit, configured to map the characters to obtain a feature matrix, and map the feature matrix to word vectors of an unregistered character set in a unified manner according to the unregistered characters.
Preferably, the apparatus further comprises:
a second determination unit for determining a hyper-parameter of the neural network;
a first acceleration unit for accelerating training using a batch normalization method during training;
a first output unit, configured to add the forward LSTM module and the backward LSTM module into a Bi-LSTM module, where at each time point of the forward LSTM module, each row in the feature matrix of the natural language is input, and an output of each time point is a vector and a hidden vector, where the hidden vector is input into the LSTM module at a next time point, and each row in the feature matrix input at the next time point jointly generates an output of the next time point.
Preferably, the apparatus further comprises:
a seventh obtaining unit, configured to obtain a first matrix according to entity identification of a whole sentence of the natural language;
an eighth obtaining unit, configured to obtain a tag sequence according to the first matrix.
Preferably, the cross-checking comprises:
a first sorting unit to separate the label data into a training data set and a test data set;
a ninth obtaining unit, configured to train different parameter sets according to the test data set, and obtain a series of models with different hyper-parameters;
a third determination unit for evaluating the series of models of different hyper-parameters from the test data set, determining parameters of an optimal model.
In a third aspect, the present invention provides a named entity recognition apparatus based on a Bi-LSTM-CRF model, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the program to implement the following steps: carrying out data preprocessing on the natural language, and separating an input first natural language under the training condition to obtain a first character sequence; under the prediction condition, separating an input second natural language to obtain a second character sequence, and classifying characters of the second character sequence, which do not exist in the first character sequence, as unregistered characters according to the comparison of the second character sequence and the first character sequence; mapping according to each character in the first character sequence to obtain a vector matrix, wherein the vector matrix comprises a vector with fixed dimensionality corresponding to each character; inputting the vector matrix into a Bi-LSTM module, respectively carrying out forward-to-backward and backward-to-forward vector sequence nonlinear transformation on a forward LSTM module and a backward LSTM module in the Bi-LSTM module, and combining output results of the forward LSTM module and the backward LSTM module, wherein the output results are emission matrices; inputting the emission matrix of the Bi-LSTM module into a CRF layer to form a Bi-LSTM-CRF model, and carrying out sentence entity identification on the natural language by the Bi-LSTM-CRF model; wherein the hyper-parameters of the Bi-LSTM-CRF module are determined according to cross-checking.
One or more technical solutions in the embodiments of the present application have at least one or more of the following technical effects:
1. according to the named entity recognition method and device based on the Bi-LSTM-CRF model, data preprocessing is carried out on natural language, under the training condition, the input first natural language is separated, and a first character sequence is obtained; under the prediction condition, separating an input second natural language to obtain a second character sequence, and classifying characters of the second character sequence, which do not exist in the first character sequence, as unregistered characters according to the comparison of the second character sequence and the first character sequence; mapping according to each character in the first character sequence to obtain a vector matrix, wherein the vector matrix comprises a vector with fixed dimensionality corresponding to each character; inputting the vector matrix into a Bi-LSTM module, respectively carrying out forward-to-backward and backward-to-forward vector sequence nonlinear transformation on a forward LSTM module and a backward LSTM module in the Bi-LSTM module, and combining output results of the forward LSTM module and the backward LSTM module, wherein the output results are emission matrices; inputting the emission matrix of the Bi-LSTM module into a CRF layer to form a Bi-LSTM-CRF model, and carrying out sentence entity identification on the natural language by the Bi-LSTM-CRF model; wherein the hyper-parameters of the Bi-LSTM-CRF module are determined according to cross-checking. The technical problem that the model in the prior art cannot consider long-distance context information and cannot model the dependency relationships, so that the recognition accuracy is limited is solved, the named entities in the sentences can be extracted by directly inputting the original sentences into the model, the adaptability is strong, the application range is wide, and the entity recognition accuracy is improved.
2. The embodiment of the application determines the hyper-parameters of the neural network; in the training process, the batch normalization method is used for accelerating the training; and adding a forward LSTM module and a backward LSTM module into a Bi-LSTM module, wherein at each time point of the forward LSTM module, each row in the feature matrix of the natural language is input, the output of each time point is a vector and an implicit vector, wherein the implicit vector is input into the LSTM module of the next time point, and the input of each row in the feature matrix of the next time point and the output of the next time point are jointly generated. Further, through the end-to-end model, the named entities in the sentences can be extracted by directly inputting the original sentences into the model without specially processing the natural language data into a certain specific format or performing feature conversion on the original data, so that the adaptability is higher.
The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.
Drawings
FIG. 1 is a schematic flow chart of a named entity recognition method based on a Bi-LSTM-CRF model according to an embodiment of the present invention;
FIG. 2 is a schematic structural diagram of a named entity recognition apparatus based on a Bi-LSTM-CRF model according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of the Bi-LSTM network model provided in the embodiment of the present invention;
FIG. 4 is a structural diagram of a Bi-LSTM-CRF model provided in an embodiment of the present invention.
Detailed Description
The embodiment of the invention provides a named entity identification method and a named entity identification device based on a Bi-LSTM-CRF model, and the technical scheme provided by the invention has the following general idea: carrying out data preprocessing on the natural language, and separating an input first natural language under the training condition to obtain a first character sequence; under the prediction condition, separating an input second natural language to obtain a second character sequence, and classifying characters of the second character sequence, which do not exist in the first character sequence, as unregistered characters according to the comparison of the second character sequence and the first character sequence; mapping according to each character in the first character sequence to obtain a vector matrix, wherein the vector matrix comprises a vector with fixed dimensionality corresponding to each character; inputting the vector matrix into a Bi-LSTM module, respectively carrying out forward-to-backward and backward-to-forward vector sequence nonlinear transformation on a forward LSTM module and a backward LSTM module in the Bi-LSTM module, and combining output results of the forward LSTM module and the backward LSTM module, wherein the output results are emission matrices; inputting the emission matrix of the Bi-LSTM module into a CRF layer to form a Bi-LSTM-CRF model, and carrying out sentence entity identification on the natural language by the Bi-LSTM-CRF model; wherein the hyper-parameters of the Bi-LSTM-CRF module are determined according to cross-checking. The method is used for solving the technical problem that the model in the prior art cannot consider long-distance context information and cannot model the dependency relationships, so that the recognition accuracy is limited, achieves the technical effects that named entities in sentences can be extracted by directly inputting original sentences into the model, is strong in adaptability, wide in application range and improves the accuracy of entity recognition.
The technical solutions of the present invention are described in detail below with reference to the drawings and specific embodiments, and it should be understood that the specific features in the embodiments and examples of the present invention are described in detail in the technical solutions of the present application, and are not limited to the technical solutions of the present application, and the technical features in the embodiments and examples of the present application may be combined with each other without conflict.
Example one
FIG. 1 is a schematic flow chart of a named entity recognition method based on a Bi-LSTM-CRF model according to an embodiment of the present invention. As shown in fig. 1, the method includes:
step 110: and performing data preprocessing on the natural language, and separating the input first natural language under the training condition to obtain a first character sequence.
Step 120: under the prediction condition, separating the input second natural language to obtain a second character sequence, and classifying characters of the second character sequence, which do not exist in the first character sequence, as unregistered characters according to the comparison of the second character sequence and the first character sequence.
Further, carrying out artificial marking according to the first character sequence to obtain marking data; inputting the tagged data into a neural network.
Specifically, under the training condition, an input natural language is cut to obtain a first character sequence C, wherein elements in the first character sequence C are single characters and punctuations in the original input language; and manually marking according to the first character sequence to obtain marking data, and inputting the marking data into a neural network. For example, the character sequence is marked by the letter sequence as the mark sequence, each element in the mark sequence is the entity category of the corresponding character, wherein B represents that the character is the start character of the entity, E represents that the character is the end character, N represents other categories, L represents the location, O represents the mechanism, T represents the time, and P represents the name of the person. The entity types to be identified can be increased or decreased according to the needs, and the corresponding types are only required to be correspondingly changed when the data is marked. Under the condition of prediction, separating an input second natural language to obtain a second character sequence, and classifying characters of the second character sequence, which do not exist in the first character sequence, as unregistered characters < UNK > according to the comparison of the second character sequence and the first character sequence.
Step 130: mapping according to each character in the character sequence to obtain a vector matrix, wherein the vector matrix comprises a vector with fixed dimensionality corresponding to each character;
further, constructing a character co-occurrence matrix; performing matrix decomposition on the co-occurrence matrix according to a gradient descent method to obtain a character vector; and mapping the characters to obtain a feature matrix, and uniformly mapping the character vector to the character vector of the unregistered character set according to the unregistered characters.
Specifically, each character in the first character sequence C is mapped, so that one character is expressed as a vector with a fixed dimension, in this embodiment, the number of vector dimensions is fixed to 50 dimensions. For a sentence of natural language sentence, a characteristic matrix can be obtained through the steps, each row of the characteristic matrix is a character vector represented by a character, and the number of rows of the matrix is consistent with the number of characters. The vector corresponding to the character is obtained by a pre-trained model. In order to achieve the above object, the present step can be divided into three steps:
step 131: the corpus data is counted to obtain a co-occurrence matrix at a character level, each element in the co-occurrence matrix is a logarithm value of the frequency of any two characters appearing in the same sentence, and the co-occurrence matrix obtained at the moment has the characteristics of discrete word vectors, high dimension and sparseness and cannot be used as the word vectors.
Step 132: the dimension reduction is carried out on the co-occurrence matrix, the gradient descent method is adopted for carrying out matrix decomposition on the co-occurrence matrix, and the distributed representation of each character is determined by minimizing a loss function, wherein the loss function is defined as follows:
Figure BDA0001638061370000091
in the above formula, Xij is the logarithm of the number of times that the character wi and the character wj appear together in the same sentence, vi is the distributed representation when the character wi is used as the center character, uj is the distributed representation when the character wj is used as the context, the loss function is minimized by using a random gradient descent method, and the finally obtained convergence value is the character vector Mc.
Step 133: and mapping each element in the character sequence, wherein the mapping mode is a table look-up mode. The final result is the feature matrix M _ s corresponding to the original natural language. For unregistered characters, they are mapped uniformly to word vectors represented by < UNK >. I.e. the ambiguity of the character is eliminated and the noise effect is reduced.
Step 140: inputting the vector matrix into a Bi-LSTM module, respectively carrying out forward LSTM module and backward LSTM module in the Bi-LSTM module to carry out forward-backward and backward-forward vector sequence nonlinear transformation, and combining output results of the forward LSTM module and the backward LSTM module, wherein the output result is a transmitting matrix.
Further, determining a hyper-parameter of the neural network; in the training process, the batch normalization method is used for accelerating the training; and adding a Bi-LSTM module into the forward LSTM module and the backward LSTM module, wherein at each time point of the forward LSTM module, each row in the feature matrix of the natural language is input, the output of each time point is a vector and an implicit vector, the implicit vector is input into the LSTM module at the next time point, and the output of the next time point is generated by each row in the feature matrix input at the next time point.
Specifically, after the vector matrix of the original natural language is input into the Bi-LSTM network, a forward LSTM module in the network performs nonlinear transformation on a vector sequence from front to back, a backward LSTM module performs nonlinear transformation on the vector sequence from back to front, the results output by the forward LSTM module and the backward LSTM module are combined, and finally, after the combined result is subjected to softmax operation, an abstract language expression mode considering forward and backward information simultaneously can be obtained. The method mainly comprises the following steps:
step 141: and determining hyper-parameters of the neural network, namely the specific parameters of the neural network mainly comprise 50-dimensional input vector dimension, 256-dimensional hidden layer vector dimension, 50-dimensional output dimension and random initialization of hidden layer vector initialization mode.
Step 142: during training, batch normalization techniques are used to accelerate training. I.e. one batch size of training data per input into the Bi-LSTM network instead of a single training data. Specifically, the method is to normalize a feature matrix M _ s represented by the input batch size sentences to make the feature matrix M _ s have a mean value of 0 and a unit variance, and is specifically as follows:
Figure BDA0001638061370000111
in the above formula, M _ s (i, j) represents the element in the ith row and jth column of the feature matrix, E [ ] operation represents expectation, and Var operation represents variance. The data obtained by the normalization operation all satisfy 0 mean, unit variance and weak correlation.
Step 143: in order to be able to consider both past and future information in a sentence, the forward LSTM module and backward LSTM module are added to the Bi-LSTM module. For the forward LSTM module, at each time point, each row in the feature matrix M _ s of the natural language is input, the output at each time point is a vector and an implicit vector, the output vector is the probability of transmitting a single character to an entity class, the implicit vector is input to the LSTM module at the next time point, and the output at the next time point is generated by the input of the LSTM module and each row in the feature matrix input at the next time point. For the backward LSTM module, the operation of the forward LSTM module is completed in the reverse direction. After the forward LSTM module and the backward LSTM module obtain output vectors, the output vectors of the forward LSTM module and the backward LSTM module are input into a connection layer and are subjected to nonlinear transformation, and then the emission probability can be obtained. The emission matrix Me from character to entity class is obtained and will be used as an evaluation score to be one of the considerations for evaluating a certain marking path.
Fig. 3 is a schematic diagram of the operation of the Bi-LSTM module, from which it can be seen that each circle in the input feature matrix represents a word vector, and the forward LSTM module and backward LSTM module read the word vector sequence from front to back and back to front, respectively.
For the problems of overfitting and the like which may occur in the training process, the following methods can be adopted to solve the problems:
1) early stopping: and setting a maximum continuous round, testing the obtained latest model parameters on the test set at the end of each round to obtain corresponding accuracy and recording the optimal accuracy, and stopping training and storing the optimal parameters when the times that the accuracy of the latest model parameters on the test set is continuously smaller than the optimal accuracy are larger than the maximum continuous round.
2) dropout tracing: during the training process, the output values of partial neurons are discarded with a certain probability p.
3) In the embodiment of the invention, the regularization term is the L2 norm of the model parameter vector, and a simpler model can be obtained by minimizing the objective function with the regularization term, and the model has stronger generalization capability.
Step 150: inputting the emission matrix of the Bi-LSTM module into a CRF layer to form a Bi-LSTM-CRF model, and carrying out sentence entity identification on the natural language by the Bi-LSTM-CRF model;
further, a first matrix is obtained according to the sentence entity identification of the natural language; and obtaining a label sequence according to the first matrix.
Specifically, please refer to fig. 4, in which fig. 4 is a schematic structural diagram of the entire model. And inputting the emission matrix Me output by the Bi-LSTM module into the CRF layer, and performing sentence-level entity identification on the natural language by the Bi-LSTM-CRF model. The output of which is the first matrix, the elements in which are the scores for the transitions from one entity class to another entity class. From the first matrix, a tag sequence is obtained, i.e. the entity corresponds to category y ═(y1, y2, …), and the Bi-LSTM-CRF model scores the result that the tag of a sentence is y:
Figure BDA0001638061370000121
in the above equation, Pi, yi are the elements of the emission matrix output by the Bi-LSTM layer, while Ayi-byi are the elements of the transfer matrix output by the CRF layer. It can be seen that the score of the whole sequence is the sum of the scores of the modules, and the score of each position is obtained by two modules, one part is the output emission probability matrix of the Bi-LSTM model, and the other part is the output transition probability matrix of the CRF layer. Obtaining the probability that y corresponds to the natural language after normalization processing is carried out through the formula:
Figure BDA0001638061370000131
adjusting parameters through a maximum likelihood function when the Bi-LSTM-CRF model is trained;
Figure BDA0001638061370000132
the highest scoring sequence is solved during the prediction (decoding) by using the dynamically planned algorithm viterbi.
Figure BDA0001638061370000133
Step 160: and determining the hyperparameter of the Bi-LSTM-CRF module according to cross-checking.
Further, the cross-checking comprises: dividing the label data into a training data set and a testing data set; training different parameter sets according to the test data set to obtain a series of models with different hyper-parameters; and evaluating the series of models with different hyper-parameters according to the test data set, and determining the parameters of the optimal model.
Specifically, for the Bi-LSTM-CRF model, in order to make the model better fit to the data, the embodiment of the present application uses a cross-validation method to determine the optimal hyper-parameter of the Bi-LSTM-CRF model, which includes the following specific steps:
step 161: randomly dividing the marking data into two parts, wherein one part is used as a training data set, and the other part is used as a testing data set;
step 162: training different parameter sets according to the training data set to obtain a series of models with different hyper-parameters;
step 163: and evaluating the series of models with different hyper-parameters according to the test data set, wherein the parameter of the model with the best performance is the optimal hyper-parameter.
Example 2
Based on the same inventive concept as the named entity recognition method based on the Bi-LSTM-CRF model in the foregoing embodiment, the present invention further provides a named entity recognition apparatus based on the Bi-LSTM-CRF model, as shown in fig. 2, the apparatus includes:
the first obtaining unit is used for carrying out data preprocessing on the natural language and separating the input first natural language under the training condition to obtain a first character sequence;
the first comparison unit is used for separating an input second natural language under the prediction condition to obtain a second character sequence, and classifying characters of the second character sequence, which do not exist in the first character sequence, into unregistered characters according to the comparison of the second character sequence and the first character sequence;
a second obtaining unit, configured to perform mapping according to each character in the first character sequence to obtain a vector matrix, where the vector matrix includes a vector with a fixed dimension corresponding to each character;
a third obtaining unit, configured to input the vector matrix into a Bi-LSTM module, perform forward-to-backward and forward-to-backward vector sequence nonlinear transformation on a forward LSTM module and a backward LSTM module in the Bi-LSTM module, respectively, and combine output results of the forward LSTM module and the backward LSTM module, where the output result is a transmit matrix;
the first identification unit is used for inputting the emission matrix of the Bi-LSTM module into a CRF layer to form a Bi-LSTM-CRF model, and the Bi-LSTM-CRF model carries out sentence entity identification on the natural language;
a first determining unit for determining the hyper-parameters of the Bi-LSTM-CRF module based on cross-checking.
Further, the apparatus further comprises:
a fourth obtaining unit, configured to perform manual marking according to the first character sequence to obtain marking data;
a first input unit for inputting the marker data into a neural network.
Further, the apparatus further comprises:
the first construction unit is used for constructing a character co-occurrence matrix;
a fifth obtaining unit, configured to perform matrix decomposition on the co-occurrence matrix according to a gradient descent method to obtain a character vector;
a sixth obtaining unit, configured to map the character to obtain a feature matrix.
Further, the apparatus further comprises:
a second determination unit for determining a hyper-parameter of the neural network;
a first acceleration unit for accelerating training using a batch normalization method during training;
a first output unit, configured to add the forward LSTM module and the backward LSTM module into a Bi-LSTM module, where at each time point of the forward LSTM module, each row in the feature matrix of the natural language is input, and an output of each time point is a vector and a hidden vector, where the hidden vector is input into the LSTM module at a next time point, and each row in the feature matrix input at the next time point jointly generates an output of the next time point.
Further, the apparatus further comprises:
a seventh obtaining unit, configured to obtain a first matrix according to the sentence entity identification of the natural language;
an eighth obtaining unit, configured to obtain a tag sequence according to the first matrix.
Further, the cross-checking comprises:
a first sorting unit to separate the label data into a training data set and a test data set;
a ninth obtaining unit, configured to train different parameter sets according to the test data set, and obtain a series of models with different hyper-parameters;
a third determination unit for evaluating the series of models of different hyper-parameters from the test data set, determining parameters of an optimal model.
Various changes and specific examples of the named entity recognition method based on the Bi-LSTM-CRF model in embodiment 1 of fig. 1 are also applicable to the named entity recognition apparatus based on the Bi-LSTM-CRF model in this embodiment, and through the foregoing detailed description of the named entity recognition method based on the Bi-LSTM-CRF model, those skilled in the art can clearly know the implementation method of the named entity recognition apparatus based on the Bi-LSTM-CRF model in this embodiment, so for the brevity of description, detailed description is omitted here.
One or more technical solutions in the embodiments of the present application have at least one or more of the following technical effects:
1. according to the named entity recognition method and device based on the Bi-LSTM-CRF model, data preprocessing is carried out on natural language, under the training condition, the input first natural language is separated, and a first character sequence is obtained; under the prediction condition, separating an input second natural language to obtain a second character sequence, and classifying characters of the second character sequence, which do not exist in the first character sequence, as unregistered characters according to the comparison of the second character sequence and the first character sequence; mapping according to each character in the first character sequence to obtain a vector matrix, wherein the vector matrix comprises a vector with fixed dimensionality corresponding to each character; inputting the vector matrix into a Bi-LSTM module, respectively carrying out forward-to-backward and backward-to-forward vector sequence nonlinear transformation on a forward LSTM module and a backward LSTM module in the Bi-LSTM module, and combining output results of the forward LSTM module and the backward LSTM module, wherein the output results are emission matrices; inputting the emission matrix of the Bi-LSTM module into a CRF layer to form a Bi-LSTM-CRF model, and carrying out sentence entity identification on the natural language by the Bi-LSTM-CRF model; wherein the hyper-parameters of the Bi-LSTM-CRF module are determined according to cross-checking. The technical problem that the model in the prior art cannot consider long-distance context information and cannot model the dependency relationships, so that the recognition accuracy is limited is solved, the named entities in the sentences can be extracted by directly inputting the original sentences into the model, the adaptability is strong, the application range is wide, and the entity recognition accuracy is improved.
2. The embodiment of the application determines the hyper-parameters of the neural network; in the training process, the batch normalization method is used for accelerating the training; and adding a Bi-LSTM module into the forward LSTM module and the backward LSTM module, wherein at each time point of the forward LSTM module, each row in the feature matrix of the natural language is input, the output of each time point is a vector and an implicit vector, the implicit vector is input into the LSTM module at the next time point, and the output of the next time point is generated by each row in the feature matrix input at the next time point. Further, through an end-to-end model, named entities in sentences can be extracted by directly inputting the original sentences into the model without specially processing natural language data into a certain specific format or performing characteristic conversion on the original data, so that the adaptability is stronger.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (6)

1. A named entity identification method based on a Bi-LSTM-CRF model is characterized by comprising the following steps:
performing data preprocessing on a natural language, and separating an input first natural language under the training condition to obtain a first character sequence;
under the condition of prediction, separating an input second natural language to obtain a second character sequence, and classifying characters which do not exist in the first character sequence in the second character sequence as unregistered characters according to comparison of the second character sequence and the first character sequence;
mapping according to each character in the first character sequence to obtain a vector matrix, wherein the vector matrix comprises a vector with fixed dimensionality corresponding to each character;
inputting the vector matrix into a Bi-LSTM module, respectively carrying out forward LSTM module and backward LSTM module in the Bi-LSTM module to carry out forward-backward and backward-forward vector sequence nonlinear transformation, and combining output results of the forward LSTM module and the backward LSTM module, wherein the output results are emission matrices;
inputting the emission matrix of the Bi-LSTM module into a CRF layer to form a Bi-LSTM-CRF model, and carrying out sentence entity identification on the natural language by the Bi-LSTM-CRF model;
determining the hyper-parameters of the Bi-LSTM-CRF model according to cross examination;
the mapping according to each character in the first character sequence to obtain a vector matrix further comprises:
constructing a character co-occurrence matrix;
performing matrix decomposition on the co-occurrence matrix according to a gradient descent method to obtain a character vector;
mapping the characters to obtain a feature matrix, and uniformly mapping the character vector to a character vector of an unregistered character set according to the unregistered characters;
inputting the emission matrix of the Bi-LSTM module into a CRF layer to form a Bi-LSTM-CRF model, wherein the Bi-LSTM-CRF model carries out sentence entity recognition on the natural language, and the method further comprises the following steps:
carrying out whole sentence entity identification on the natural language to obtain a first matrix;
and obtaining a label sequence according to the first matrix.
2. The method of claim 1, wherein pre-processing data in a natural language, and separating a first natural language input under training conditions to obtain a first character sequence, further comprises:
carrying out artificial marking according to the first character sequence to obtain marking data;
inputting the tagged data into a neural network.
3. The method of claim 2, wherein the inputting the vector matrix into a Bi-LSTM module, performing a forward LSTM module and a backward LSTM module in the Bi-LSTM module with a forward LSTM sequence and a backward LSTM sequence non-linear transformation from front to back and from back to front, respectively, and combining output results of the forward LSTM module and the backward LSTM module, wherein the output results are transmit matrices, further comprising:
determining a hyper-parameter of the neural network;
in the training process, the batch normalization method is used for accelerating the training;
and adding a Bi-LSTM module into the forward LSTM module and the backward LSTM module, wherein at each time point of the forward LSTM module, each row in the feature matrix of the natural language is input, the output of each time point is a vector and an implicit vector, the implicit vector is input into the LSTM module at the next time point, and the output of the next time point is generated by each row in the feature matrix input at the next time point.
4. The method of claim 2, wherein the cross-checking comprises:
dividing the label data into a training data set and a testing data set;
training different parameter sets according to the test data set to obtain a series of models with different hyper-parameters;
and evaluating the series of models with different hyper-parameters according to the test data set, and determining the parameters of the optimal model.
5. A named entity recognition apparatus based on a Bi-LSTM-CRF model, the apparatus comprising:
the first obtaining unit is used for carrying out data preprocessing on natural language and separating the input natural language under the training condition to obtain a character sequence;
a second obtaining unit, configured to perform mapping according to each character in the character sequence to obtain a vector matrix, where the vector matrix includes a vector with a fixed dimension corresponding to each character;
a third obtaining unit, configured to input the vector matrix into a Bi-LSTM module, perform forward-to-backward and forward-to-backward vector sequence nonlinear transformation on a forward LSTM module and a backward LSTM module in the Bi-LSTM module, respectively, and combine output results of the forward LSTM module and the backward LSTM module, where the output result is a transmit matrix;
the first identification unit is used for inputting the emission matrix of the Bi-LSTM module into a CRF layer to form a Bi-LSTM-CRF model, and the Bi-LSTM-CRF model carries out sentence entity identification on the natural language;
a first determining unit, which is used for determining the hyper-parameters of the Bi-LSTM-CRF model according to cross-checking;
the first construction unit is used for constructing a character co-occurrence matrix;
a fifth obtaining unit, configured to perform matrix decomposition on the co-occurrence matrix according to a gradient descent method to obtain a character vector;
a sixth obtaining unit, configured to map the character to obtain a feature matrix;
a seventh obtaining unit, configured to obtain a first matrix according to the sentence entity identification of the natural language;
an eighth obtaining unit, configured to obtain a tag sequence according to the first matrix.
6. A named entity recognition apparatus based on a Bi-LSTM-CRF model, comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the program to perform the following steps:
carrying out data preprocessing on a natural language, and separating the input natural language under the training condition to obtain a character sequence;
mapping according to each character in the character sequence to obtain a vector matrix, wherein the vector matrix comprises a vector with fixed dimensionality corresponding to each character;
inputting the vector matrix into a Bi-LSTM module, respectively carrying out forward-to-backward and backward-to-forward vector sequence nonlinear transformation on a forward LSTM module and a backward LSTM module in the Bi-LSTM module, and combining output results of the forward LSTM module and the backward LSTM module, wherein the output results are emission matrices;
inputting the emission matrix of the Bi-LSTM module into a CRF layer to form a Bi-LSTM-CRF model, and carrying out sentence entity identification on the natural language by the Bi-LSTM-CRF model;
determining the hyper-parameters of the Bi-LSTM-CRF model according to cross examination;
the mapping according to each character in the character sequence to obtain a vector matrix further comprises:
constructing a character co-occurrence matrix;
performing matrix decomposition on the co-occurrence matrix according to a gradient descent method to obtain a character vector;
mapping the characters to obtain a feature matrix, and uniformly mapping the character vector to a character vector of an unregistered character set according to the unregistered characters;
inputting the emission matrix of the Bi-LSTM module into a CRF layer to form a Bi-LSTM-CRF model, wherein the Bi-LSTM-CRF model carries out sentence entity recognition on the natural language, and the method further comprises the following steps:
carrying out whole sentence entity identification on the natural language to obtain a first matrix;
and obtaining a label sequence according to the first matrix.
CN201810369183.2A 2018-04-23 2018-04-23 Named entity identification method and device based on Bi-LSTM-CRF model Active CN108920445B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810369183.2A CN108920445B (en) 2018-04-23 2018-04-23 Named entity identification method and device based on Bi-LSTM-CRF model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810369183.2A CN108920445B (en) 2018-04-23 2018-04-23 Named entity identification method and device based on Bi-LSTM-CRF model

Publications (2)

Publication Number Publication Date
CN108920445A CN108920445A (en) 2018-11-30
CN108920445B true CN108920445B (en) 2022-06-17

Family

ID=64403296

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810369183.2A Active CN108920445B (en) 2018-04-23 2018-04-23 Named entity identification method and device based on Bi-LSTM-CRF model

Country Status (1)

Country Link
CN (1) CN108920445B (en)

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109753653B (en) * 2018-12-25 2023-07-11 金蝶软件(中国)有限公司 Entity name recognition method, entity name recognition device, computer equipment and storage medium
CN111401064B (en) * 2019-01-02 2024-04-19 中国移动通信有限公司研究院 Named entity identification method and device and terminal equipment
CN111428501A (en) * 2019-01-09 2020-07-17 北大方正集团有限公司 Named entity recognition method, recognition system and computer readable storage medium
CN109871535B (en) * 2019-01-16 2020-01-10 四川大学 French named entity recognition method based on deep neural network
CN109948154B (en) * 2019-03-12 2023-05-05 南京邮电大学 Character acquisition and relationship recommendation system and method based on mailbox names
CN109902307B (en) * 2019-03-15 2023-06-02 北京金山数字娱乐科技有限公司 Named entity recognition method, named entity recognition model training method and device
CN110287480B (en) * 2019-05-27 2023-01-24 广州多益网络股份有限公司 Named entity identification method, device, storage medium and terminal equipment
CN110321560B (en) * 2019-06-25 2021-10-01 北京邮电大学 Method and device for determining position information from text information and electronic equipment
CN110347837B (en) * 2019-07-17 2022-02-18 电子科技大学 Cardiovascular disease unplanned hospitalization risk prediction method
CN111062213B (en) * 2019-11-19 2024-01-12 竹间智能科技(上海)有限公司 Named entity identification method, device, equipment and medium
CN111222318B (en) * 2019-11-19 2023-09-12 南京审计大学 Trigger word recognition method based on double-channel bidirectional LSTM-CRF network
CN110929521B (en) * 2019-12-06 2023-10-27 北京知道创宇信息技术股份有限公司 Model generation method, entity identification method, device and storage medium
CN111222320B (en) * 2019-12-17 2020-10-20 共道网络科技有限公司 Character prediction model training method and device
CN111581387B (en) * 2020-05-09 2022-10-11 电子科技大学 Entity relation joint extraction method based on loss optimization
CN111611775B (en) * 2020-05-14 2023-07-18 沈阳东软熙康医疗系统有限公司 Entity identification model generation method, entity identification device and equipment
CN111597342B (en) * 2020-05-22 2024-01-26 北京慧闻科技(集团)有限公司 Multitasking intention classification method, device, equipment and storage medium
CN112907301B (en) * 2021-03-29 2022-06-14 哈尔滨工业大学 Bi-LSTM-CRF model-based content-related advertisement delivery method and system
CN113377953B (en) * 2021-05-31 2022-06-21 电子科技大学 Entity fusion and classification method based on PALC-DCA model
CN115470871B (en) * 2022-11-02 2023-02-17 江苏鸿程大数据技术与应用研究院有限公司 Policy matching method and system based on named entity recognition and relation extraction model

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106569998A (en) * 2016-10-27 2017-04-19 浙江大学 Text named entity recognition method based on Bi-LSTM, CNN and CRF
CN107797992A (en) * 2017-11-10 2018-03-13 北京百分点信息科技有限公司 Name entity recognition method and device
CN107908614A (en) * 2017-10-12 2018-04-13 北京知道未来信息技术有限公司 A kind of name entity recognition method based on Bi LSTM

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106569998A (en) * 2016-10-27 2017-04-19 浙江大学 Text named entity recognition method based on Bi-LSTM, CNN and CRF
CN107908614A (en) * 2017-10-12 2018-04-13 北京知道未来信息技术有限公司 A kind of name entity recognition method based on Bi LSTM
CN107797992A (en) * 2017-11-10 2018-03-13 北京百分点信息科技有限公司 Name entity recognition method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"Bidirectional LSTM-CRF Models for Sequence Tagging";Zhiheng Huang等;《Proceedings of the 21st International Conference on Asian Language Processing》;20151231;第2、3章,图7 *
"基于双向LSTM的维吾尔语事件因果关系抽取";田生伟等;《电子与信息学报》;20170913;第3.4-3.6节,图2 *

Also Published As

Publication number Publication date
CN108920445A (en) 2018-11-30

Similar Documents

Publication Publication Date Title
CN108920445B (en) Named entity identification method and device based on Bi-LSTM-CRF model
CN108628823B (en) Named entity recognition method combining attention mechanism and multi-task collaborative training
CN113239700A (en) Text semantic matching device, system, method and storage medium for improving BERT
CN110222163A (en) A kind of intelligent answer method and system merging CNN and two-way LSTM
CN111046670B (en) Entity and relationship combined extraction method based on drug case legal documents
CN112257449B (en) Named entity recognition method and device, computer equipment and storage medium
CN111914556B (en) Emotion guiding method and system based on emotion semantic transfer pattern
CN112016313B (en) Spoken language element recognition method and device and warning analysis system
CN113128203A (en) Attention mechanism-based relationship extraction method, system, equipment and storage medium
CN110851593B (en) Complex value word vector construction method based on position and semantics
Wu et al. Chinese text classification based on character-level CNN and SVM
CN111859964A (en) Method and device for identifying named entities in sentences
CN111222318A (en) Trigger word recognition method based on two-channel bidirectional LSTM-CRF network
CN113934835B (en) Retrieval type reply dialogue method and system combining keywords and semantic understanding representation
Huo et al. Terg: Topic-aware emotional response generation for chatbot
CN111553140A (en) Data processing method, data processing apparatus, and computer storage medium
Younis et al. A new parallel bat algorithm for musical note recognition.
CN114662477A (en) Stop word list generating method and device based on traditional Chinese medicine conversation and storage medium
CN116775846A (en) Domain knowledge question and answer method, system, equipment and medium
CN111125378A (en) Closed-loop entity extraction method based on automatic sample labeling
CN115934948A (en) Knowledge enhancement-based drug entity relationship combined extraction method and system
CN115600595A (en) Entity relationship extraction method, system, equipment and readable storage medium
CN115510230A (en) Mongolian emotion analysis method based on multi-dimensional feature fusion and comparative reinforcement learning mechanism
CN113190681B (en) Fine granularity text classification method based on capsule network mask memory attention
Kumar et al. Self-attention enhanced recurrent neural networks for sentence classification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant