CN110851597A

CN110851597A - Method and device for sentence annotation based on similar entity replacement

Info

Publication number: CN110851597A
Application number: CN201911032391.4A
Authority: CN
Inventors: 胡伟凤; 高雪松; 陈维强
Original assignee: Qingdao Juhaolian Technology Co Ltd
Current assignee: Qingdao Juhaolian Technology Co Ltd
Priority date: 2019-10-28
Filing date: 2019-10-28
Publication date: 2020-02-28

Abstract

The invention discloses a sentence labeling method and device based on similar entity replacement, the method comprises the steps of obtaining sentences input by a user, determining an entity label sequence corresponding to the sentences input by the user according to the sentences input by the user and a named entity identification model, determining whether similar entities exist in the entities in the entity label sequence according to the entity labels in the entity label sequence, and if so, generating a new entity label sequence of the sentences according to the similar entities and the entity label sequence. Compared with the existing serial solution of a long-time memory model and a probability distribution prediction model, the entity label sequence identified by the character embedding layer, the first feature learning layer, the second feature learning layer and the probability prediction layer has the advantages that the identification accuracy is remarkably improved, the discovery capability of new words can be effectively improved through the replacement of similar entities, and the data in the training set of the model is expanded.

Description

Method and device for sentence annotation based on similar entity replacement

Technical Field

The embodiment of the invention relates to the technical field of natural language processing, in particular to a method and a device for sentence annotation based on similar entity replacement.

Background

Named entity recognition is a basic task in natural language processing, and lays a foundation for a series of tasks such as downstream entity linking, relation extraction, semantic search, automatic question answering and the like. The industry widely applies a serial solution of a long-time memory model and a probability distribution prediction model, but the model training needs to rely on a large amount of manually labeled data, and particularly in the application of the Chinese vertical field, the industrial application effect of the model completely depends on a large amount of field knowledge for training. In practical application, the expressive force of named entity recognition of the system needs to consider not only accuracy but also recall rate, and in the vertical field, the ability to find new words which do not appear in a training set or appear in the training set frequently needs to be improved urgently.

Disclosure of Invention

The embodiment of the invention provides a sentence marking method and device based on similar entity replacement, which are used for increasing the discovery capability of new words and expanding the data of a training set.

In a first aspect, an embodiment of the present invention provides a method for statement annotation based on homogeneous entity replacement, including:

acquiring a sentence input by a user;

determining an entity tag sequence corresponding to the sentence input by the user according to the sentence input by the user and the named entity recognition model; the named entity recognition model comprises a character embedding layer, a first feature learning layer, a second feature learning layer and a probability prediction layer; the named entity recognition model is obtained by training an entity label sequence training set;

and determining whether the entity in the entity label sequence has the same kind of entity or not according to the entity label in the entity label sequence, and if so, generating a new entity label sequence of the sentence according to the same kind of entity and the entity label sequence.

In the technical scheme, the entity label sequences identified by the character embedding layer, the first feature learning layer, the second feature learning layer and the probability prediction layer are compared with the existing LSTM + CRF model, the identification accuracy is remarkably improved, the discovery capability of new words can be effectively improved through the replacement of similar entities, and the data in the training set of the model is expanded.

Optionally, the determining, according to the sentence input by the user and the named entity recognition model, an entity tag sequence corresponding to the sentence input by the user includes:

converting the sentence input by the user into a first embedded space vector through the character embedding layer;

inputting the first embedded space vector to the first feature learning layer, and extracting a first feature of the sentence input by the user;

inputting the first features of the sentences input by the user to the second feature learning layer, and extracting the second features of the sentences input by the user;

and inputting the second characteristic of the sentence input by the user to the probability prediction layer to obtain an entity tag sequence corresponding to the sentence input by the user.

Optionally, the generating an entity tag sequence of a new sentence according to the homogeneous entity and the entity tag sequence includes:

and replacing the entity with the same entity label in the entity label sequence with the same entity as the entity label sequence of the new statement.

Optionally, after generating the entity tag sequence of the new sentence, the method further includes:

and putting the entity label sequence of the new sentence into the entity label sequence training set, and retraining the named entity recognition model.

In a second aspect, an embodiment of the present invention provides a device for statement annotation based on homogeneous entity replacement, including:

the acquiring unit is used for acquiring the sentence input by the user;

the processing unit is used for determining an entity tag sequence corresponding to the sentence input by the user according to the sentence input by the user and the named entity recognition model; the named entity recognition model comprises a character embedding layer, a first feature learning layer, a second feature learning layer and a probability prediction layer; the named entity recognition model is obtained by training an entity label sequence training set; and determining whether the entity in the entity label sequence has a homogeneous entity or not according to the entity label in the entity label sequence, if so, generating a new entity label sequence of the sentence according to the homogeneous entity and the entity label sequence.

Optionally, the processing unit is specifically configured to:

Optionally, the processing unit is further configured to:

after generating the entity label sequence of the new sentence, putting the entity label sequence of the new sentence into the entity label sequence training set, and retraining the named entity recognition model.

In a third aspect, an embodiment of the present invention further provides a computing device, including:

a memory for storing program instructions;

and the processor is used for calling the program instruction stored in the memory and executing the statement marking method based on the similar entity replacement according to the obtained program.

In a fourth aspect, an embodiment of the present invention further provides a computer-readable non-volatile storage medium, which includes computer-readable instructions, and when the computer reads and executes the computer-readable instructions, the computer is caused to execute the above method for statement annotation based on entity-of-same-class replacement.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic diagram of a system architecture according to an embodiment of the present invention;

fig. 2 is a schematic flowchart of a method for sentence annotation based on homogeneous entity replacement according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a word segmentation and named entity recognition annotation provided in an embodiment of the present invention;

FIG. 4 is a diagram illustrating a named entity recognition model according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating a new word discovery according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of experimental results provided by an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a device for sentence annotation based on homogeneous entity replacement according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 illustrates an exemplary system architecture to which embodiments of the present invention may be applied, which may be a server 100, where the server 100 may include a processor 110, a communication interface 120, and a memory 130.

The communication interface 120 is used for the smart device to perform communication, receive and transmit information transmitted by the smart device, and implement communication.

The processor 110 is a control center of the server 100, connects various parts of the entire server 100 using various interfaces and routes, performs various functions of the server 100 and processes data by operating or executing software programs and/or modules stored in the memory 130 and calling data stored in the memory 130. Alternatively, processor 110 may include one or more processing units.

The memory 130 may be used to store software programs and modules, and the processor 110 executes various functional applications and data processing by operating the software programs and modules stored in the memory 130. The memory 130 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to a business process, and the like. Further, the memory 130 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

It should be noted that the structure shown in fig. 1 is only an example, and the embodiment of the present invention is not limited thereto.

Based on the above description, fig. 2 shows in detail a flow of a method for statement annotation based on homogeneous entity replacement according to an embodiment of the present invention, where the flow may be performed by a device for statement annotation based on homogeneous entity replacement, and the device may be located in the server 100 shown in fig. 1, or may be the server 100.

As shown in fig. 2, the process specifically includes:

step 201, obtaining a sentence input by a user.

In the embodiment of the invention, the recognition of the participle and the named entity is a basic task in natural language processing, lays a foundation for a series of tasks such as downstream entity linking, relation extraction, semantic search and automatic question and answer, and both the participle and the named entity recognition can be used as sequence tagging problems to be solved (for example, a BIO (B-begin; I-inside, inside; O-outside, outside) tagging set), wherein the input and output of the participle and the named entity recognition can be shown in FIG. 3, and as can be seen from FIG. 3, the input sentence of ' I's air conditioner is suddenly uncooled at present ', and the entity tag sequence output by the participle model is different from the entity tag sequence output by the named entity recognition model.

Step 202, determining an entity tag sequence corresponding to the sentence input by the user according to the sentence input by the user and the named entity recognition model.

In the embodiment of the present invention, the named entity recognition model may include a character embedding layer, a first feature learning layer, a second feature learning layer, and a probability prediction layer, and is obtained by training an entity tag sequence training set. In a specific implementation process, the first feature layer may be a CNN (Convolutional Neural Networks) layer, and may also be referred to as a short-distance feature learning layer. The second feature layer may be a Bi-directional Long Short-Term Memory (Bi-LSTM) layer, and may also be referred to as a Long distance feature learning layer. The probabilistic prediction layer may be a CRF (Conditional Random Field) layer, wherein the structure of the model may be as shown in fig. 4.

Specifically, when an entity tag sequence corresponding to a sentence input by a user is obtained, the sentence input by the user can be converted into a first embedded space vector through a character embedding layer, then the first embedded space vector is input into a first feature learning layer, a first feature of the sentence input by the user is extracted, the first feature of the sentence input by the user is input into a second feature learning layer, a second feature of the sentence input by the user is extracted, and finally the second feature of the sentence input by the user is input into a probability prediction layer, so that the entity tag sequence corresponding to the sentence input by the user is obtained.

Based on the named entity recognition model shown in fig. 4, when the named entity recognition is performed, the input chinese character sequence (sentence) can be converted into an embedded space vector at the character embedding layer:

wherein, w_iFor one-hot vector representation of each character, V is the size of the dictionary space, N is the input sequence length, and D is the embedding dimension size.

Then, through a CNN feature extraction layer: local features of the text are extracted as input to the Bi-LSTM layer.

Specifically, c ═ c₁,c₂,…c_N],c_i∈R^M；

Wherein the content of the first and second substances,

represents from

To

F is an activation function RELU, M is the number of filters, w belongs to R^KDIs a filter of CNN, K is the window size, and the context association information of each ci is a link of the values of all window filters at the current location.

The Bi-LSTM layer can be used for extracting medium-long distance relevant information at two sides of the text, finally the CRF layer decodes the relevant information, the characteristics extracted by the Bi-LSTM layer are used as input, and the label of each element in the sequence is calculated. I.e. for a given input h ═ h₁,h₂,…,h_N]Calculating its output label y ═ y₁,y₂,…,y_N],y_i∈R^LAnd representing the one-hot value of the ith character sequence, wherein L is the size of the label space. In the probabilistic model CRF, for a given input hThe conditional probability of the output sequence y is:

wherein Y(s) is the set of all possible tag sequences for all input sequences s,

W∈R^2S×L,T∈R^L×Lfor the parameter denoted by θ, { W, T }.

In a CRF layer, the loss function may be:

L_NER＝-∑_s∈Slog(p(y_s|h_s；θ))；

wherein S is a training set statement set, y_s、h_sRespectively, the hidden layer and the tag sequence of statement S.

Under the condition that only a small amount of labeled data exist in a training set, the named entity recognition model provided by the embodiment of the invention can improve the accuracy and the recall rate of the named entity recognition method.

Step 203, determining whether there is a homogeneous entity in the entity tag sequence according to the entity tag in the entity tag sequence, and if so, generating an entity tag sequence of a new sentence according to the homogeneous entity and the entity tag sequence.

Specifically, entities with the same entity tag in the entity tag sequence may be replaced with similar entities, and used as the entity tag sequence of the new sentence.

By the automatic labeling data construction method based on the similar entity replacement, more pseudo labeling samples are constructed from a small amount of existing labeling data, the generalization capability of the model to new words which do not appear in a training set or have low appearance frequency is remarkably improved, and if a certain entity name in a sentence is replaced by another entity of the same type, the new sentence is still correct in syntax and semantics.

Therefore, the entity identification tag sequence of the known sentence is available, and the entity tag sequence of the new sentence can be generated. As shown in detail in fig. 5.

In addition, after the entity label sequence of the new sentence is generated, the entity label sequence of the new sentence can be put into the entity label sequence training set, and the named entity recognition model is retrained, so that the accuracy and the recall rate of the named entity recognition training are improved.

Experiments show that the automatic labeling data and the manual labeling data are combined in a ratio of 50% + 50% to generate a training set of the model, and experimental results shown in fig. 6 show that the model added with the automatic labeling data can obviously improve the accuracy rate, the recall rate and the new word discovery capability of the model. Particularly, when the training data is less, the model expressive force advantage is more obvious.

The embodiment shows that a sentence input by a user is obtained, an entity tag sequence corresponding to the sentence input by the user is determined according to the sentence input by the user and a named entity recognition model, the named entity recognition model comprises a character embedding layer, a first feature learning layer, a second feature learning layer and a probability prediction layer, the named entity recognition model is obtained by training an entity tag sequence training set, whether similar entities exist in entities in the entity tag sequence is determined according to entity tags in the entity tag sequence, and if so, a new entity tag sequence of the sentence is generated according to the similar entities and the entity tag sequence. Compared with the existing serial solution of a long-time memory model and a probability distribution prediction model, the entity label sequence identified by the character embedding layer, the first feature learning layer, the second feature learning layer and the probability prediction layer has the advantages that the identification accuracy is remarkably improved, the discovery capability of new words can be effectively improved through the replacement of similar entities, and the data in the training set of the model is expanded.

Based on the same technical concept, fig. 7 exemplarily shows a structure of an apparatus for sentence tagging based on homogeneous entity replacement, which can perform a flow of sentence tagging based on homogeneous entity replacement and is located in the server 100 shown in fig. 1, or in the server 100.

As shown in fig. 7, the apparatus specifically includes:

an obtaining unit 701, configured to obtain a sentence input by a user;

a processing unit 702, configured to determine, according to the sentence input by the user and the named entity identification model, an entity tag sequence corresponding to the sentence input by the user; the named entity recognition model comprises a character embedding layer, a first feature learning layer, a second feature learning layer and a probability prediction layer; the named entity recognition model is obtained by training an entity label sequence training set; and determining whether the entity in the entity label sequence has a homogeneous entity or not according to the entity label in the entity label sequence, if so, generating a new entity label sequence of the sentence according to the homogeneous entity and the entity label sequence.

Optionally, the processing unit 702 is specifically configured to:

Optionally, the processing unit 702 is further configured to:

Based on the same technical concept, an embodiment of the present invention further provides a computing device, including:

a memory for storing program instructions;

and the processor is used for calling the program instructions stored in the memory and executing the statement marking method based on the similar entity replacement according to the obtained program.

Based on the same technical concept, the embodiment of the present invention further provides a computer-readable non-volatile storage medium, which includes computer-readable instructions, and when the computer reads and executes the computer-readable instructions, the computer is enabled to execute the above statement annotation method based on entity-of-the-same-class replacement.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A method for sentence annotation based on homogeneous entity replacement is characterized by comprising the following steps:

acquiring a sentence input by a user;

2. The method of claim 1, wherein determining the entity tag sequence corresponding to the sentence input by the user according to the sentence input by the user and the named entity recognition model comprises:

3. The method of claim 1, wherein said generating an entity tag sequence for a new sentence from said homogeneous entity and said entity tag sequence comprises:

4. The method of any of claims 1 to 3, further comprising, after generating the entity tag sequence of the new sentence:

5. An apparatus for sentence annotation based on homogeneous entity replacement, comprising:

the acquiring unit is used for acquiring the sentence input by the user;

6. The apparatus as claimed in claim 5, wherein said processing unit is specifically configured to:

7. The apparatus as claimed in claim 5, wherein said processing unit is specifically configured to:

8. The apparatus of any of claims 5 to 7, wherein the processing unit is further to:

9. A computing device, comprising:

a memory for storing program instructions;

a processor for calling program instructions stored in said memory to execute the method of any one of claims 1 to 4 in accordance with the obtained program.

10. A computer-readable non-transitory storage medium including computer-readable instructions which, when read and executed by a computer, cause the computer to perform the method of any one of claims 1 to 4.