CN108920445B

CN108920445B - Named entity identification method and device based on Bi-LSTM-CRF model

Info

Publication number: CN108920445B
Application number: CN201810369183.2A
Authority: CN
Inventors: 莫益军; 姚澜; 杨帆
Original assignee: Huazhong University of Science and Technology; Ezhou Institute of Industrial Technology Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology; Ezhou Institute of Industrial Technology Huazhong University of Science and Technology
Priority date: 2018-04-23
Filing date: 2018-04-23
Publication date: 2022-06-17
Anticipated expiration: 2038-04-23
Also published as: CN108920445A

Abstract

The invention provides a named entity recognition method and a named entity recognition device based on a Bi-LSTM-CRF model, which are used for preprocessing data of a natural language, and separating an input first natural language under the training condition to obtain a first character sequence; mapping according to each character in the first character sequence to obtain a vector matrix, and inputting the vector matrix into a Bi-LSTM module; inputting the emission matrix of the Bi-LSTM module into a CRF layer to form a Bi-LSTM-CRF model, and carrying out sentence entity identification on the natural language by the Bi-LSTM-CRF model; wherein the hyper-parameters of the Bi-LSTM-CRF module are determined according to cross-checking. The technical problem that the model in the prior art cannot consider long-term context information and cannot model the dependency relationships so that the recognition accuracy is limited is solved, the named entity in the sentence can be extracted by directly inputting the original sentence into the model, the adaptability is high, the application range is wide, and the entity recognition accuracy is improved is achieved.

Description

Named entity identification method and device based on Bi-LSTM-CRF model

Technical Field

The invention relates to the technical field of information processing, in particular to a named entity identification method and device based on a Bi-LSTM-CRF model.

Background

Named entity recognition is the most basic and most widely used one of all natural language processing applications. It is an entity with specific meaning in the identification text, mainly including name of person, place, organization, proper noun, etc. Named entity recognition application is an important basic tool for subsequent application fields of other applications, such as information extraction, question and answer systems, syntactic analysis, machine translation, semantic web-oriented metadata annotation and the like. Through the application of this tool to named entity recognition, a natural language model can be constructed that can understand, analyze, and answer the results of natural language like a human. The existing named entity recognition scheme can be basically divided into two types: a statistical-based approach and a neural network-based approach. The statistical-based method mainly comprises an HMM model and a CRF model, and the neural network-based method mainly comprises a convolutional neural network and an LSTM neural network.

In the prior art, a statistical-based method cannot consider long-distance context information, a neural-network-based method has the technical problems that outputs are mutually independent, and if output identification marks have strong dependency relationships, the neural-network-based scheme cannot model the dependency relationships, so that the identification accuracy is limited.

Disclosure of Invention

The embodiment of the invention provides a named entity identification method and device based on a Bi-LSTM-CRF model, and solves the technical problem that the model in the prior art cannot consider long-term context information and cannot model the dependency relationships, so that the identification accuracy is limited.

In view of the above problems, the present application provides a method and apparatus for identifying a named entity based on a Bi-LSTM-CRF model.

In a first aspect, the invention provides a named entity identification method based on a Bi-LSTM-CRF model, which comprises the following steps:

carrying out data preprocessing on the natural language, and separating an input first natural language under the training condition to obtain a first character sequence; under the prediction condition, separating an input second natural language to obtain a second character sequence, and classifying characters of the second character sequence, which do not exist in the first character sequence, as unregistered characters according to the comparison of the second character sequence and the first character sequence; mapping according to each character in the first character sequence to obtain a vector matrix, wherein the vector matrix comprises a vector with fixed dimensionality corresponding to each character; inputting the vector matrix into a Bi-LSTM module, respectively carrying out forward-to-backward and backward-to-forward vector sequence nonlinear transformation on a forward LSTM module and a backward LSTM module in the Bi-LSTM module, and combining output results of the forward LSTM module and the backward LSTM module, wherein the output results are emission matrices; inputting the emission matrix of the Bi-LSTM module into a CRF layer to form a Bi-LSTM-CRF model, and carrying out sentence entity identification on the natural language by the Bi-LSTM-CRF model; wherein, the hyperparameter of the Bi-LSTM-CRF module is determined according to cross-checking.

Preferably, the data preprocessing is performed on the natural language, and under the training condition, the input first natural language is separated to obtain the first character sequence, and the method further includes: carrying out artificial marking according to the first character sequence to obtain marking data; inputting the label data into a neural network.

Preferably, the mapping according to each character in the character sequence to obtain a vector matrix further includes: constructing a character co-occurrence matrix; performing matrix decomposition on the co-occurrence matrix according to a gradient descent method to obtain a character vector; and mapping the characters to obtain a feature matrix, and uniformly mapping the character vector to the character vector of the unregistered character set according to the unregistered characters.

Preferably, the inputting the vector matrix into the Bi-LSTM module, performing forward LSTM module and backward LSTM module in the Bi-LSTM module with forward LSTM and backward LSTM vector sequence non-linear transformation from front to back and from back to front respectively, and combining output results of the forward LSTM module and the backward LSTM module, where the output result is a transmission matrix, further includes: determining a hyper-parameter of the neural network; in the training process, the batch normalization method is used for accelerating the training; and adding a Bi-LSTM module into the forward LSTM module and the backward LSTM module, wherein at each time point of the forward LSTM module, each row in the feature matrix of the natural language is input, the output of each time point is a vector and an implicit vector, the implicit vector is input into the LSTM module at the next time point, and the output of the next time point is generated by each row in the feature matrix input at the next time point.

Preferably, inputting the emission matrix of a Bi-LSTM module into a CRF layer to form a Bi-LSTM-CRF module, the Bi-LSTM-CRF module performing sentence-by-sentence entity recognition on the natural language, further comprising: carrying out whole sentence entity identification on the natural language to obtain a first matrix; and obtaining a label sequence according to the first matrix.

Preferably, the cross-checking comprises: dividing the label data into a training data set and a testing data set; training different parameter sets according to the test data set to obtain a series of models with different hyper-parameters; and evaluating the series of models with different hyper-parameters according to the test data set, and determining the parameters of the optimal model.

In a second aspect, the present invention provides a named entity recognition apparatus based on a Bi-LSTM-CRF model, the apparatus comprising:

the first obtaining unit is used for carrying out data preprocessing on the natural language and separating the input first natural language under the training condition to obtain a first character sequence;

the first comparison unit is used for separating an input second natural language under the prediction condition to obtain a second character sequence, and classifying characters of the second character sequence, which do not exist in the first character sequence, into unregistered characters according to the comparison of the second character sequence and the first character sequence;

a second obtaining unit, configured to map each character in the first character sequence to obtain a vector matrix, where the vector matrix includes a vector with a fixed dimension corresponding to each character;

a third obtaining unit, configured to input the vector matrix into a Bi-LSTM module, perform forward-to-backward and forward-to-backward vector sequence nonlinear transformation on a forward LSTM module and a backward LSTM module in the Bi-LSTM module, respectively, and combine output results of the forward LSTM module and the backward LSTM module, where the output result is a transmit matrix;

the first identification unit is used for inputting the emission matrix of the Bi-LSTM module into a CRF layer to form a Bi-LSTM-CRF model, and the Bi-LSTM-CRF model carries out sentence entity identification on the natural language;

a first determination unit for determining the hyper-parameters of the Bi-LSTM-CRF module based on cross-checking.

Preferably, the apparatus further comprises:

a fourth obtaining unit, configured to perform manual marking according to the first character sequence to obtain marking data;

a first input unit for inputting the marker data into a neural network.

Preferably, the apparatus further comprises:

the first construction unit is used for constructing a character co-occurrence matrix;

a fifth obtaining unit, configured to perform matrix decomposition on the co-occurrence matrix according to a gradient descent method to obtain a character vector;

a sixth obtaining unit, configured to map the characters to obtain a feature matrix, and map the feature matrix to word vectors of an unregistered character set in a unified manner according to the unregistered characters.

Preferably, the apparatus further comprises:

a second determination unit for determining a hyper-parameter of the neural network;

a first acceleration unit for accelerating training using a batch normalization method during training;

a first output unit, configured to add the forward LSTM module and the backward LSTM module into a Bi-LSTM module, where at each time point of the forward LSTM module, each row in the feature matrix of the natural language is input, and an output of each time point is a vector and a hidden vector, where the hidden vector is input into the LSTM module at a next time point, and each row in the feature matrix input at the next time point jointly generates an output of the next time point.

Preferably, the apparatus further comprises:

a seventh obtaining unit, configured to obtain a first matrix according to entity identification of a whole sentence of the natural language;

an eighth obtaining unit, configured to obtain a tag sequence according to the first matrix.

Preferably, the cross-checking comprises:

a first sorting unit to separate the label data into a training data set and a test data set;

a ninth obtaining unit, configured to train different parameter sets according to the test data set, and obtain a series of models with different hyper-parameters;

a third determination unit for evaluating the series of models of different hyper-parameters from the test data set, determining parameters of an optimal model.

In a third aspect, the present invention provides a named entity recognition apparatus based on a Bi-LSTM-CRF model, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the program to implement the following steps: carrying out data preprocessing on the natural language, and separating an input first natural language under the training condition to obtain a first character sequence; under the prediction condition, separating an input second natural language to obtain a second character sequence, and classifying characters of the second character sequence, which do not exist in the first character sequence, as unregistered characters according to the comparison of the second character sequence and the first character sequence; mapping according to each character in the first character sequence to obtain a vector matrix, wherein the vector matrix comprises a vector with fixed dimensionality corresponding to each character; inputting the vector matrix into a Bi-LSTM module, respectively carrying out forward-to-backward and backward-to-forward vector sequence nonlinear transformation on a forward LSTM module and a backward LSTM module in the Bi-LSTM module, and combining output results of the forward LSTM module and the backward LSTM module, wherein the output results are emission matrices; inputting the emission matrix of the Bi-LSTM module into a CRF layer to form a Bi-LSTM-CRF model, and carrying out sentence entity identification on the natural language by the Bi-LSTM-CRF model; wherein the hyper-parameters of the Bi-LSTM-CRF module are determined according to cross-checking.

One or more technical solutions in the embodiments of the present application have at least one or more of the following technical effects:

1. according to the named entity recognition method and device based on the Bi-LSTM-CRF model, data preprocessing is carried out on natural language, under the training condition, the input first natural language is separated, and a first character sequence is obtained; under the prediction condition, separating an input second natural language to obtain a second character sequence, and classifying characters of the second character sequence, which do not exist in the first character sequence, as unregistered characters according to the comparison of the second character sequence and the first character sequence; mapping according to each character in the first character sequence to obtain a vector matrix, wherein the vector matrix comprises a vector with fixed dimensionality corresponding to each character; inputting the vector matrix into a Bi-LSTM module, respectively carrying out forward-to-backward and backward-to-forward vector sequence nonlinear transformation on a forward LSTM module and a backward LSTM module in the Bi-LSTM module, and combining output results of the forward LSTM module and the backward LSTM module, wherein the output results are emission matrices; inputting the emission matrix of the Bi-LSTM module into a CRF layer to form a Bi-LSTM-CRF model, and carrying out sentence entity identification on the natural language by the Bi-LSTM-CRF model; wherein the hyper-parameters of the Bi-LSTM-CRF module are determined according to cross-checking. The technical problem that the model in the prior art cannot consider long-distance context information and cannot model the dependency relationships, so that the recognition accuracy is limited is solved, the named entities in the sentences can be extracted by directly inputting the original sentences into the model, the adaptability is strong, the application range is wide, and the entity recognition accuracy is improved.

2. The embodiment of the application determines the hyper-parameters of the neural network; in the training process, the batch normalization method is used for accelerating the training; and adding a forward LSTM module and a backward LSTM module into a Bi-LSTM module, wherein at each time point of the forward LSTM module, each row in the feature matrix of the natural language is input, the output of each time point is a vector and an implicit vector, wherein the implicit vector is input into the LSTM module of the next time point, and the input of each row in the feature matrix of the next time point and the output of the next time point are jointly generated. Further, through the end-to-end model, the named entities in the sentences can be extracted by directly inputting the original sentences into the model without specially processing the natural language data into a certain specific format or performing feature conversion on the original data, so that the adaptability is higher.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

FIG. 1 is a schematic flow chart of a named entity recognition method based on a Bi-LSTM-CRF model according to an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of a named entity recognition apparatus based on a Bi-LSTM-CRF model according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of the Bi-LSTM network model provided in the embodiment of the present invention;

FIG. 4 is a structural diagram of a Bi-LSTM-CRF model provided in an embodiment of the present invention.

Detailed Description

The embodiment of the invention provides a named entity identification method and a named entity identification device based on a Bi-LSTM-CRF model, and the technical scheme provided by the invention has the following general idea: carrying out data preprocessing on the natural language, and separating an input first natural language under the training condition to obtain a first character sequence; under the prediction condition, separating an input second natural language to obtain a second character sequence, and classifying characters of the second character sequence, which do not exist in the first character sequence, as unregistered characters according to the comparison of the second character sequence and the first character sequence; mapping according to each character in the first character sequence to obtain a vector matrix, wherein the vector matrix comprises a vector with fixed dimensionality corresponding to each character; inputting the vector matrix into a Bi-LSTM module, respectively carrying out forward-to-backward and backward-to-forward vector sequence nonlinear transformation on a forward LSTM module and a backward LSTM module in the Bi-LSTM module, and combining output results of the forward LSTM module and the backward LSTM module, wherein the output results are emission matrices; inputting the emission matrix of the Bi-LSTM module into a CRF layer to form a Bi-LSTM-CRF model, and carrying out sentence entity identification on the natural language by the Bi-LSTM-CRF model; wherein the hyper-parameters of the Bi-LSTM-CRF module are determined according to cross-checking. The method is used for solving the technical problem that the model in the prior art cannot consider long-distance context information and cannot model the dependency relationships, so that the recognition accuracy is limited, achieves the technical effects that named entities in sentences can be extracted by directly inputting original sentences into the model, is strong in adaptability, wide in application range and improves the accuracy of entity recognition.

The technical solutions of the present invention are described in detail below with reference to the drawings and specific embodiments, and it should be understood that the specific features in the embodiments and examples of the present invention are described in detail in the technical solutions of the present application, and are not limited to the technical solutions of the present application, and the technical features in the embodiments and examples of the present application may be combined with each other without conflict.

Example one

FIG. 1 is a schematic flow chart of a named entity recognition method based on a Bi-LSTM-CRF model according to an embodiment of the present invention. As shown in fig. 1, the method includes:

step 110: and performing data preprocessing on the natural language, and separating the input first natural language under the training condition to obtain a first character sequence.

Step 120: under the prediction condition, separating the input second natural language to obtain a second character sequence, and classifying characters of the second character sequence, which do not exist in the first character sequence, as unregistered characters according to the comparison of the second character sequence and the first character sequence.

Further, carrying out artificial marking according to the first character sequence to obtain marking data; inputting the tagged data into a neural network.

Specifically, under the training condition, an input natural language is cut to obtain a first character sequence C, wherein elements in the first character sequence C are single characters and punctuations in the original input language; and manually marking according to the first character sequence to obtain marking data, and inputting the marking data into a neural network. For example, the character sequence is marked by the letter sequence as the mark sequence, each element in the mark sequence is the entity category of the corresponding character, wherein B represents that the character is the start character of the entity, E represents that the character is the end character, N represents other categories, L represents the location, O represents the mechanism, T represents the time, and P represents the name of the person. The entity types to be identified can be increased or decreased according to the needs, and the corresponding types are only required to be correspondingly changed when the data is marked. Under the condition of prediction, separating an input second natural language to obtain a second character sequence, and classifying characters of the second character sequence, which do not exist in the first character sequence, as unregistered characters < UNK > according to the comparison of the second character sequence and the first character sequence.

Step 130: mapping according to each character in the character sequence to obtain a vector matrix, wherein the vector matrix comprises a vector with fixed dimensionality corresponding to each character;

further, constructing a character co-occurrence matrix; performing matrix decomposition on the co-occurrence matrix according to a gradient descent method to obtain a character vector; and mapping the characters to obtain a feature matrix, and uniformly mapping the character vector to the character vector of the unregistered character set according to the unregistered characters.

Specifically, each character in the first character sequence C is mapped, so that one character is expressed as a vector with a fixed dimension, in this embodiment, the number of vector dimensions is fixed to 50 dimensions. For a sentence of natural language sentence, a characteristic matrix can be obtained through the steps, each row of the characteristic matrix is a character vector represented by a character, and the number of rows of the matrix is consistent with the number of characters. The vector corresponding to the character is obtained by a pre-trained model. In order to achieve the above object, the present step can be divided into three steps:

step 131: the corpus data is counted to obtain a co-occurrence matrix at a character level, each element in the co-occurrence matrix is a logarithm value of the frequency of any two characters appearing in the same sentence, and the co-occurrence matrix obtained at the moment has the characteristics of discrete word vectors, high dimension and sparseness and cannot be used as the word vectors.

Step 132: the dimension reduction is carried out on the co-occurrence matrix, the gradient descent method is adopted for carrying out matrix decomposition on the co-occurrence matrix, and the distributed representation of each character is determined by minimizing a loss function, wherein the loss function is defined as follows:

in the above formula, Xij is the logarithm of the number of times that the character wi and the character wj appear together in the same sentence, vi is the distributed representation when the character wi is used as the center character, uj is the distributed representation when the character wj is used as the context, the loss function is minimized by using a random gradient descent method, and the finally obtained convergence value is the character vector Mc.

Step 133: and mapping each element in the character sequence, wherein the mapping mode is a table look-up mode. The final result is the feature matrix M _ s corresponding to the original natural language. For unregistered characters, they are mapped uniformly to word vectors represented by < UNK >. I.e. the ambiguity of the character is eliminated and the noise effect is reduced.

Step 140: inputting the vector matrix into a Bi-LSTM module, respectively carrying out forward LSTM module and backward LSTM module in the Bi-LSTM module to carry out forward-backward and backward-forward vector sequence nonlinear transformation, and combining output results of the forward LSTM module and the backward LSTM module, wherein the output result is a transmitting matrix.

Further, determining a hyper-parameter of the neural network; in the training process, the batch normalization method is used for accelerating the training; and adding a Bi-LSTM module into the forward LSTM module and the backward LSTM module, wherein at each time point of the forward LSTM module, each row in the feature matrix of the natural language is input, the output of each time point is a vector and an implicit vector, the implicit vector is input into the LSTM module at the next time point, and the output of the next time point is generated by each row in the feature matrix input at the next time point.

Specifically, after the vector matrix of the original natural language is input into the Bi-LSTM network, a forward LSTM module in the network performs nonlinear transformation on a vector sequence from front to back, a backward LSTM module performs nonlinear transformation on the vector sequence from back to front, the results output by the forward LSTM module and the backward LSTM module are combined, and finally, after the combined result is subjected to softmax operation, an abstract language expression mode considering forward and backward information simultaneously can be obtained. The method mainly comprises the following steps:

step 141: and determining hyper-parameters of the neural network, namely the specific parameters of the neural network mainly comprise 50-dimensional input vector dimension, 256-dimensional hidden layer vector dimension, 50-dimensional output dimension and random initialization of hidden layer vector initialization mode.

Step 142: during training, batch normalization techniques are used to accelerate training. I.e. one batch size of training data per input into the Bi-LSTM network instead of a single training data. Specifically, the method is to normalize a feature matrix M _ s represented by the input batch size sentences to make the feature matrix M _ s have a mean value of 0 and a unit variance, and is specifically as follows:

in the above formula, M _ s (i, j) represents the element in the ith row and jth column of the feature matrix, E [ ] operation represents expectation, and Var operation represents variance. The data obtained by the normalization operation all satisfy 0 mean, unit variance and weak correlation.

Step 143: in order to be able to consider both past and future information in a sentence, the forward LSTM module and backward LSTM module are added to the Bi-LSTM module. For the forward LSTM module, at each time point, each row in the feature matrix M _ s of the natural language is input, the output at each time point is a vector and an implicit vector, the output vector is the probability of transmitting a single character to an entity class, the implicit vector is input to the LSTM module at the next time point, and the output at the next time point is generated by the input of the LSTM module and each row in the feature matrix input at the next time point. For the backward LSTM module, the operation of the forward LSTM module is completed in the reverse direction. After the forward LSTM module and the backward LSTM module obtain output vectors, the output vectors of the forward LSTM module and the backward LSTM module are input into a connection layer and are subjected to nonlinear transformation, and then the emission probability can be obtained. The emission matrix Me from character to entity class is obtained and will be used as an evaluation score to be one of the considerations for evaluating a certain marking path.

Fig. 3 is a schematic diagram of the operation of the Bi-LSTM module, from which it can be seen that each circle in the input feature matrix represents a word vector, and the forward LSTM module and backward LSTM module read the word vector sequence from front to back and back to front, respectively.

For the problems of overfitting and the like which may occur in the training process, the following methods can be adopted to solve the problems:

1) early stopping: and setting a maximum continuous round, testing the obtained latest model parameters on the test set at the end of each round to obtain corresponding accuracy and recording the optimal accuracy, and stopping training and storing the optimal parameters when the times that the accuracy of the latest model parameters on the test set is continuously smaller than the optimal accuracy are larger than the maximum continuous round.

2) dropout tracing: during the training process, the output values of partial neurons are discarded with a certain probability p.

3) In the embodiment of the invention, the regularization term is the L2 norm of the model parameter vector, and a simpler model can be obtained by minimizing the objective function with the regularization term, and the model has stronger generalization capability.

Step 150: inputting the emission matrix of the Bi-LSTM module into a CRF layer to form a Bi-LSTM-CRF model, and carrying out sentence entity identification on the natural language by the Bi-LSTM-CRF model;

further, a first matrix is obtained according to the sentence entity identification of the natural language; and obtaining a label sequence according to the first matrix.

Specifically, please refer to fig. 4, in which fig. 4 is a schematic structural diagram of the entire model. And inputting the emission matrix Me output by the Bi-LSTM module into the CRF layer, and performing sentence-level entity identification on the natural language by the Bi-LSTM-CRF model. The output of which is the first matrix, the elements in which are the scores for the transitions from one entity class to another entity class. From the first matrix, a tag sequence is obtained, i.e. the entity corresponds to category y ═(y1, y2, …), and the Bi-LSTM-CRF model scores the result that the tag of a sentence is y:

in the above equation, Pi, yi are the elements of the emission matrix output by the Bi-LSTM layer, while Ayi-byi are the elements of the transfer matrix output by the CRF layer. It can be seen that the score of the whole sequence is the sum of the scores of the modules, and the score of each position is obtained by two modules, one part is the output emission probability matrix of the Bi-LSTM model, and the other part is the output transition probability matrix of the CRF layer. Obtaining the probability that y corresponds to the natural language after normalization processing is carried out through the formula:

adjusting parameters through a maximum likelihood function when the Bi-LSTM-CRF model is trained;

the highest scoring sequence is solved during the prediction (decoding) by using the dynamically planned algorithm viterbi.

Step 160: and determining the hyperparameter of the Bi-LSTM-CRF module according to cross-checking.

Further, the cross-checking comprises: dividing the label data into a training data set and a testing data set; training different parameter sets according to the test data set to obtain a series of models with different hyper-parameters; and evaluating the series of models with different hyper-parameters according to the test data set, and determining the parameters of the optimal model.

Specifically, for the Bi-LSTM-CRF model, in order to make the model better fit to the data, the embodiment of the present application uses a cross-validation method to determine the optimal hyper-parameter of the Bi-LSTM-CRF model, which includes the following specific steps:

step 161: randomly dividing the marking data into two parts, wherein one part is used as a training data set, and the other part is used as a testing data set;

step 162: training different parameter sets according to the training data set to obtain a series of models with different hyper-parameters;

step 163: and evaluating the series of models with different hyper-parameters according to the test data set, wherein the parameter of the model with the best performance is the optimal hyper-parameter.

Example 2

Based on the same inventive concept as the named entity recognition method based on the Bi-LSTM-CRF model in the foregoing embodiment, the present invention further provides a named entity recognition apparatus based on the Bi-LSTM-CRF model, as shown in fig. 2, the apparatus includes:

a second obtaining unit, configured to perform mapping according to each character in the first character sequence to obtain a vector matrix, where the vector matrix includes a vector with a fixed dimension corresponding to each character;

a first determining unit for determining the hyper-parameters of the Bi-LSTM-CRF module based on cross-checking.

Further, the apparatus further comprises:

a first input unit for inputting the marker data into a neural network.

Further, the apparatus further comprises:

a sixth obtaining unit, configured to map the character to obtain a feature matrix.

Further, the apparatus further comprises:

a seventh obtaining unit, configured to obtain a first matrix according to the sentence entity identification of the natural language;

Further, the cross-checking comprises:

Various changes and specific examples of the named entity recognition method based on the Bi-LSTM-CRF model in embodiment 1 of fig. 1 are also applicable to the named entity recognition apparatus based on the Bi-LSTM-CRF model in this embodiment, and through the foregoing detailed description of the named entity recognition method based on the Bi-LSTM-CRF model, those skilled in the art can clearly know the implementation method of the named entity recognition apparatus based on the Bi-LSTM-CRF model in this embodiment, so for the brevity of description, detailed description is omitted here.

2. The embodiment of the application determines the hyper-parameters of the neural network; in the training process, the batch normalization method is used for accelerating the training; and adding a Bi-LSTM module into the forward LSTM module and the backward LSTM module, wherein at each time point of the forward LSTM module, each row in the feature matrix of the natural language is input, the output of each time point is a vector and an implicit vector, the implicit vector is input into the LSTM module at the next time point, and the output of the next time point is generated by each row in the feature matrix input at the next time point. Further, through an end-to-end model, named entities in sentences can be extracted by directly inputting the original sentences into the model without specially processing natural language data into a certain specific format or performing characteristic conversion on the original data, so that the adaptability is stronger.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A named entity identification method based on a Bi-LSTM-CRF model is characterized by comprising the following steps:

performing data preprocessing on a natural language, and separating an input first natural language under the training condition to obtain a first character sequence;

under the condition of prediction, separating an input second natural language to obtain a second character sequence, and classifying characters which do not exist in the first character sequence in the second character sequence as unregistered characters according to comparison of the second character sequence and the first character sequence;

mapping according to each character in the first character sequence to obtain a vector matrix, wherein the vector matrix comprises a vector with fixed dimensionality corresponding to each character;

inputting the vector matrix into a Bi-LSTM module, respectively carrying out forward LSTM module and backward LSTM module in the Bi-LSTM module to carry out forward-backward and backward-forward vector sequence nonlinear transformation, and combining output results of the forward LSTM module and the backward LSTM module, wherein the output results are emission matrices;

inputting the emission matrix of the Bi-LSTM module into a CRF layer to form a Bi-LSTM-CRF model, and carrying out sentence entity identification on the natural language by the Bi-LSTM-CRF model;

determining the hyper-parameters of the Bi-LSTM-CRF model according to cross examination;

the mapping according to each character in the first character sequence to obtain a vector matrix further comprises:

constructing a character co-occurrence matrix;

performing matrix decomposition on the co-occurrence matrix according to a gradient descent method to obtain a character vector;

mapping the characters to obtain a feature matrix, and uniformly mapping the character vector to a character vector of an unregistered character set according to the unregistered characters;

inputting the emission matrix of the Bi-LSTM module into a CRF layer to form a Bi-LSTM-CRF model, wherein the Bi-LSTM-CRF model carries out sentence entity recognition on the natural language, and the method further comprises the following steps:

carrying out whole sentence entity identification on the natural language to obtain a first matrix;

and obtaining a label sequence according to the first matrix.

2. The method of claim 1, wherein pre-processing data in a natural language, and separating a first natural language input under training conditions to obtain a first character sequence, further comprises:

carrying out artificial marking according to the first character sequence to obtain marking data;

inputting the tagged data into a neural network.

3. The method of claim 2, wherein the inputting the vector matrix into a Bi-LSTM module, performing a forward LSTM module and a backward LSTM module in the Bi-LSTM module with a forward LSTM sequence and a backward LSTM sequence non-linear transformation from front to back and from back to front, respectively, and combining output results of the forward LSTM module and the backward LSTM module, wherein the output results are transmit matrices, further comprising:

determining a hyper-parameter of the neural network;

in the training process, the batch normalization method is used for accelerating the training;

and adding a Bi-LSTM module into the forward LSTM module and the backward LSTM module, wherein at each time point of the forward LSTM module, each row in the feature matrix of the natural language is input, the output of each time point is a vector and an implicit vector, the implicit vector is input into the LSTM module at the next time point, and the output of the next time point is generated by each row in the feature matrix input at the next time point.

4. The method of claim 2, wherein the cross-checking comprises:

dividing the label data into a training data set and a testing data set;

training different parameter sets according to the test data set to obtain a series of models with different hyper-parameters;

and evaluating the series of models with different hyper-parameters according to the test data set, and determining the parameters of the optimal model.

5. A named entity recognition apparatus based on a Bi-LSTM-CRF model, the apparatus comprising:

the first obtaining unit is used for carrying out data preprocessing on natural language and separating the input natural language under the training condition to obtain a character sequence;

a second obtaining unit, configured to perform mapping according to each character in the character sequence to obtain a vector matrix, where the vector matrix includes a vector with a fixed dimension corresponding to each character;

a first determining unit, which is used for determining the hyper-parameters of the Bi-LSTM-CRF model according to cross-checking;

a sixth obtaining unit, configured to map the character to obtain a feature matrix;

6. A named entity recognition apparatus based on a Bi-LSTM-CRF model, comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the program to perform the following steps:

carrying out data preprocessing on a natural language, and separating the input natural language under the training condition to obtain a character sequence;

mapping according to each character in the character sequence to obtain a vector matrix, wherein the vector matrix comprises a vector with fixed dimensionality corresponding to each character;

inputting the vector matrix into a Bi-LSTM module, respectively carrying out forward-to-backward and backward-to-forward vector sequence nonlinear transformation on a forward LSTM module and a backward LSTM module in the Bi-LSTM module, and combining output results of the forward LSTM module and the backward LSTM module, wherein the output results are emission matrices;

the mapping according to each character in the character sequence to obtain a vector matrix further comprises:

constructing a character co-occurrence matrix;

and obtaining a label sequence according to the first matrix.