CN116956924A

CN116956924A - Named entity recognition method and system based on contrast learning

Info

Publication number: CN116956924A
Application number: CN202310929718.8A
Authority: CN
Inventors: 冯落落; 李志芸; 李晓瑜
Original assignee: Shandong New Generation Information Industry Technology Research Institute Co Ltd
Current assignee: Shandong New Generation Information Industry Technology Research Institute Co Ltd
Priority date: 2023-07-27
Filing date: 2023-07-27
Publication date: 2023-10-27

Abstract

The invention discloses a named entity recognition method and a named entity recognition system based on contrast learning, belongs to the technical field of data processing, and aims to solve the technical problem of how to overcome the influence of a large number of negative sample labels on the recognition precision of named entities. The method comprises the following steps: constructing an entity type coding model and a text coding model based on a pretrained Bert model; defining the description of the entity, and inputting the description of the entity into a trained entity type coding model to obtain the entity type; defining description of the text, and inputting the description of the text into a trained text coding model to obtain the text; the entity type and the similar entity block are mapped to the similar space, and the dissimilar entity block is mapped to the different similar space by the contrast learning algorithm and mapping the entity type and the text.

Description

Named entity recognition method and system based on contrast learning

Technical Field

The invention relates to the technical field of data processing, in particular to a named entity recognition method and system based on contrast learning.

Background

Named entity recognition technology is one of the common technologies of NLP at present, and solves the problem that a text block is classified into one of a predefined class set, such as characters, places, organizations and the like, and is commonly used in the fields of information extraction tasks, recommendation systems, knowledge maps and the like. Named entity recognition generally has a great effect on tasks such as downstream relation extraction, co-fingering disambiguation and the like.

At present, NER algorithms are basically based on sequence labeling algorithms, and are basically based on classification algorithms, and the differences of the NER algorithms are differences of labeling methods, such as BIO, BIOES, multi-head pointer network and fragment sequence algorithms. Such as the BIO algorithm, as shown in fig. 1.

Wherein I represents the inside, O represents the outside, B represents the beginning, for example, we have three types of entities of name, place name and organization structure, B-PER represents the beginning of name, I-PER represents the inside of name, B-LOC represents the beginning of place name, I-LOC represents the place name content, B-ORG represents the beginning of organization, I-ORG represents the inside of organization, O represents the other (i.e. not entity). As shown in the figure, zhou Jielun is a song, where the week is denoted as B-PER, the J is denoted as I-PER, the Lun is denoted as I-PER, and O is denoted as other, so the sum song is denoted as O. The problem with this approach is that much data will be marked as O, which introduces a lot of negative sample noise.

The current named entity recognition methods are basically classification methods based on sequence labeling, and the differences among the classification methods are differences of labeling methods, such as BIO, BIOES and the like, but the problem that a large number of negative O label samples exist to influence the final precision is caused. How to overcome the influence of a large number of negative sample labels on the recognition precision of named entities is a technical problem to be solved.

Disclosure of Invention

The technical task of the invention is to provide a named entity recognition method and a named entity recognition system based on contrast learning to solve the technical problem of how to overcome the influence of a large number of negative sample labels on the recognition precision of the named entity.

In a first aspect, the invention provides a named entity recognition method based on contrast learning, which comprises the following steps:

constructing an entity type coding model and a text coding model based on a pretrained Bert model, wherein the Bert model is a natural language understanding model composed of encoders based on a transducer;

for each type of entity, defining the description of the entity, and inputting the description of the entity into a trained entity type coding model to obtain the entity type;

for a text, defining description of the text, and inputting the description of the text into a trained text coding model to obtain the text email;

the entity type and the similar entity block are mapped to the similar space, and the dissimilar entity block is mapped to the different similar space by the contrast learning algorithm and mapping the entity type and the text.

Preferably, the entity type coding model is based on a Bert model and is used for executing the following steps:

and encoding the description of the entity through the Bert model, and obtaining the output of the [ CLS ] token, and accessing the output to a linear-layer to obtain the entity type of the eboding.

Preferably, the text coding model is based on a Bert model and is used for executing the following steps:

encoding the description of the text by a Bert model;

acquiring the output of each token, and performing sum operation on the output of the token according to the label to acquire Span representation;

the Span representation is input to the linear layer, resulting in the unbinding of each spathoken of the text.

Preferably, when mapping the entity type of the unbinding and the text of the unbinding to the same space by a contrast learning algorithm, a contrast loss is constructed based on an info NCE loss function, and a contrast loss calculation formula is as follows:

wherein s is _i,j The unbinding, sim representing the span token, represents calculating the cosine values of the two vectors,representing a negative span set, which is a block of text of an unwanted entity type.

Preferably, training is performed on the constructed entity type coding model and the text coding model based on an AdamW optimizer to obtain a trained entity type coding model and a trained text coding model.

In a second aspect, the present invention is a named entity recognition system based on contrast learning, for recognizing a named entity by a named entity recognition method based on contrast learning as described in any one of the first aspects, the system comprising:

the model construction module is used for constructing an entity type coding model and a text coding model based on a pre-trained Bert model, and carrying out model training on the entity type coding model and the text coding model, wherein the Bert model is a natural language understanding model formed by an Encoder based on a transducer;

the entity type coding module is used for defining the description of the entity for each type of entity, and inputting the description of the entity into the trained entity type coding model to obtain the entity type eboding;

the text coding module is used for defining the description of the text and inputting the description of the text into the trained text coding model to obtain the text email;

and the contrast learning module is used for mapping the entity type and the similar entity block to the similar space and mapping the dissimilar entity block to different similar spaces through a contrast learning algorithm.

encoding the description of the text by a Bert model;

Preferably, when the entity type and the text are mapped to the same space through a contrast learning algorithm, the contrast learning module is used for constructing contrast loss based on an info NCE loss function, and a contrast loss calculation formula is as follows:

Preferably, for the constructed entity type coding model and the text coding model, the model construction module is used for training based on an AdamW optimizer to obtain a trained entity type coding model and a trained text coding model.

The named entity identification method and system based on contrast learning have the following advantages: the method weakens the influence of a large number of negative sample noise of sequence labeling, improves the accuracy of named entity identification, and improves the accuracy of downstream tasks such as information extraction on a multi-mode cognitive interactive platform, unlike the traditional sequence labeling-based method.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments or the description of the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

The invention is further described below with reference to the accompanying drawings.

FIG. 1 is a block diagram of a prior art BIO algorithm;

FIG. 2 is a flow chart of a named entity recognition method based on contrast learning of embodiment 1;

fig. 3 is a functional block diagram of coding entity types and texts in a named entity recognition method based on contrast learning in embodiment 1.

Detailed Description

The invention will be further described with reference to the accompanying drawings and specific examples, so that those skilled in the art can better understand the invention and implement it, but the examples are not meant to limit the invention, and the technical features of the embodiments of the invention and the examples can be combined with each other without conflict.

The embodiment of the invention provides a named entity recognition method and a named entity recognition system based on contrast learning, which are used for solving the technical problem of how to overcome the influence of a large number of negative sample labels on the recognition precision of named entities.

Example 1:

the invention discloses a named entity identification method based on contrast learning, which comprises the following steps:

s100, constructing an entity type coding model and a text coding model based on a pretrained Bert model, wherein the Bert model is a natural language understanding model formed by an Encoder based on a transducer;

s200, defining the description of the entity for each type of entity, and inputting the description of the entity into a trained entity type coding model to obtain the entity type embeding;

s300, mapping the entity type and the similar entity block to the similar space, and mapping the dissimilar entity block to different similar spaces by contrast learning algorithm.

Bert is a natural language understanding model composed of encodings by a transducer, pre-trained in a large number of chinese. In this embodiment, both the entity type coding model and the text coding model are used as the basic coding network, and then a full connection layer is accessed after the coding for the adaptation fine tuning of the downstream task.

Wherein the entity type coding model is based on the Bert model and is used for executing the following steps: and encoding the description of the entity through the Bert model, and obtaining the output of the [ CLS ] token, and accessing the output to a linear-layer to obtain the entity type of the eboding.

The text encoding model is based on the Bert model and is used for executing the following steps: encoding the description of the text by a Bert model; acquiring the output of each token, and performing sum operation on the output of the token according to the label to acquire Span representation; the Span representation is input to the linear layer, resulting in the unbinding of each spathoken of the text.

Step S200 utilizes two trained models to respectively encode the text and the entity type, and obtains the entity type of the entity type and the text.

For the entity type of the unbedding, a detailed description is defined for each type of entity, and then the output of the [ CLS ] token is obtained through Bert coding and is accessed to a linear-layer, so that the final unbedding representation of the entity type is obtained.

For the text email, a description of the text is defined, the description of the text is encoded by the Bert, then the sum operation is performed according to the label, the output of the Bert is performed to obtain a Span representation, and then the Span representation is input to a linear-layer, so that the email representation of each Spantusen of the text is obtained.

The workflow is shown in fig. 3. For example, to extract a name, input a description, to extract the name inside, pass the description through the Bert model, and then obtain the ebedding output by the [ CLS ] token as an embedded representation of the name entity type. The unbedding of the token of the input text is represented by a circle of other colors.

Then, step S300 is executed to use a contrast learning algorithm, that is, to maximize the similarity between the entity type of the embeding and the embeding of the token of the type in the sentence, and to minimize the similarity between the entity type of the embeding and the token of the type.

Step S300 is to construct a contrast loss based on an info NCE loss function by mapping the entity type and the text of the entity type to the same space through a contrast learning algorithm, wherein the contrast loss calculation formula is as follows:

In this embodiment, an AdamW optimizer is used to perform model training on the constructed entity type coding model and the text coding model, so as to obtain a trained entity type coding model and a trained text coding model.

The method of the embodiment utilizes the encoder of the transducer to encode the text and the entity types respectively, maps the obtained empoding to the same vector space, can apply the same text representation to different entity types, and utilizes the contrast learning method to map the vectors to the unified space. By contrast learning mandatory entity types and similar entity blocks map to similar spaces, dissimilar entity blocks map to different similar spaces. On the other hand, classification-based methods introduce noise, e.g., O is often used to represent other types, which can introduce negative noise during training; the method of this embodiment does not show the introduction of non-physical O-tags, but only introduces a dynamic threshold at the time of contrast learning, which is learnable.

Example 2:

the invention discloses a named entity recognition system based on contrast learning, which comprises a model construction module, an entity type coding module, a text coding module and a contrast learning module, wherein the system can execute the method disclosed in the embodiment 1 to recognize the named entity.

The model construction module is used for constructing an entity type coding model and a text coding model based on a pre-trained Bert model, and carrying out model training on the entity type coding model and the text coding model, wherein the Bert model is a natural language understanding model formed by an Encoder based on a transducer.

For each type of entity, the entity type coding module is used for defining the description of the entity, and inputting the description of the entity into the trained entity type coding model to obtain the entity type of the embeding.

For the text, the text coding module is used for defining the description of the text, and inputting the description of the text into the trained text coding model to obtain the text empoding.

For encoding, the following example is given: for example, to extract a name, input a description, to extract the name inside, pass the description through the Bert model, and then obtain the ebedding output by the [ CLS ] token as an embedded representation of the name entity type. The unbedding of the token of the input text is represented by a circle of other colors.

The contrast learning module is used for mapping the entity type and the similar entity block to the similar space and mapping the dissimilar entity block to different similar spaces through a contrast learning algorithm and mapping the entity type and the text.

When the entity type and the text are mapped to the same space through a contrast learning algorithm, the contrast learning module is used for constructing contrast loss based on an info NCE loss function, and a contrast loss calculation formula is as follows:

In this embodiment, the model building module is configured to perform model training on the built entity type coding model and the text coding model by using an AdamW optimizer, so as to obtain a trained entity type coding model and a trained text coding model.

While the invention has been illustrated and described in detail in the drawings and in the preferred embodiments, the invention is not limited to the disclosed embodiments, but it will be apparent to those skilled in the art that many more embodiments of the invention can be made by combining the means of the various embodiments described above and still fall within the scope of the invention.

Claims

1. A named entity recognition method based on contrast learning is characterized by comprising the following steps:

2. The named entity recognition method based on contrast learning according to claim 1, wherein the entity type coding model is based on a Bert model and is used for executing the following steps:

3. The named entity recognition method based on contrast learning of claim 1, wherein the text encoding model is based on a Bert model for performing the following:

encoding the description of the text by a Bert model;

4. The named entity recognition method based on contrast learning according to claim 1, wherein when the entity type is mapped to the same space by a contrast learning algorithm and the entity type is mapped to the same space, a contrast loss is constructed based on an info NCE loss function, and a contrast loss calculation formula is as follows:

5. The named entity recognition method based on contrast learning according to claim 1, wherein training is performed on the constructed entity type coding model and the text coding model based on an AdamW optimizer to obtain a trained entity type coding model and a trained text coding model.

6. A named entity recognition system based on contrast learning for recognizing named entities by a named entity recognition method based on contrast learning as claimed in any one of claims 1-5, the system comprising:

7. The contrast learning based named entity recognition system of claim 6, wherein the entity-type encoding model is based on a Bert model for performing the following:

8. The contrast learning based named entity recognition system of claim 6, wherein the text encoding model is based on a Bert model for performing the following:

encoding the description of the text by a Bert model;

9. The named entity recognition system based on contrast learning according to claim 6, wherein when the entity type is mapped to the same space by a contrast learning algorithm and the text is mapped, the contrast learning module is configured to construct a contrast loss based on an info nce loss function, and a contrast loss calculation formula is as follows:

10. The named entity recognition system based on contrast learning of claim 6, wherein for the constructed entity type coding model and text coding model, the model construction module is configured to perform training based on an AdamW optimizer to obtain a trained entity type coding model and a trained text coding model.