CN114462391B

CN114462391B - Nested entity identification method and system based on contrast learning

Info

Publication number: CN114462391B
Application number: CN202210247571.XA
Authority: CN
Inventors: 胡碧峰; 王艳飞; 胡茂海; 尹光荣
Original assignee: Workway Shenzhen Information Technology Co ltd
Current assignee: Workway Shenzhen Information Technology Co ltd
Priority date: 2022-03-14
Filing date: 2022-03-14
Publication date: 2024-05-14
Anticipated expiration: 2042-03-14
Also published as: CN114462391A

Abstract

The invention provides a nested entity classification method and a system based on contrast learning, and provides a target nested entity classification model for nested entity classification, which is obtained through two stages, wherein the first stage utilizes the contrast learning method to learn the representation of an entity, and the second stage adopts a segment method, and as the first stage learns the characteristics of a sample, the proportion of negative samples can be reduced, the convergence of the model can be accelerated, the model result is more stable, and the boundary distinction degree of the entity is higher.

Description

Nested entity identification method and system based on contrast learning

Technical Field

The invention relates to the technical field of natural language processing, in particular to a nested entity identification method and system based on contrast learning.

Background

In the current nested entity identification technology, there are two main methods: firstly, a sequence labeling method is used for decoding for a plurality of times in the decoding process so as to identify nested entities in a sentence; and secondly, a fragment method, namely converting entity identification into classification of fragments, enumerating all fragments in a sentence, and classifying the fragments so as to identify nested entities.

Compared with a sequence labeling method, the fragment method has lower missing recognition rate, so that the fragment method is widely adopted. However, the negative sample size to be considered in the training process of the method is very large, n (n+1)/2 fragments are generated on the assumption that one sentence has n characters, so that sample unbalance is caused, the convergence speed of the model is slower, the training efficiency is affected, and the timeliness requirement of the model on line is not met especially when the sequence is longer.

Disclosure of Invention

Aiming at the technical problems, the embodiment of the invention provides a nested entity classification method and a nested entity classification system based on contrast learning, which are used for at least solving one of the technical problems.

The invention adopts the technical scheme that:

the embodiment of the invention provides a nested entity classification method based on contrast learning, which comprises the following steps:

S1, acquiring an input sentence data table; wherein, the jth line of the sentence data table i comprises (X_ij,L_ij),X_ij＝(X¹ _ij,X² _ij,…,X^nij _ij),X^k _ij as the kth character in the jth sentence of the sentence data table i, the value of k is 1 to nij, nij is the number of characters ;L_ij＝{(E¹ _ij,T¹ _ij),(E² _ij,T² _ij),…,(E^mij _ij,T^mij _ij)},E^r _ij in the jth sentence of the sentence data table i as the r entity in the jth sentence of the sentence data table i, T ^r _ij is the actual entity type corresponding to E ^r _ij, the value of r is 1 to mij, and mij is the number of entities in the jth sentence of the sentence data table i; i has a value of 1 to N, j has a value of 1 to Pi, and Pi is the number of sentences in the sentence data table i; n is the number of statement data tables;

S2, for statement j in statement data table i, executing the following operation:

s201, adopting a pre-training language model to encode a sentence j twice to respectively obtain a first characterization vector h1 _ij＝(h1¹ _ij,h1² _ij,…,h1^nij _ij) and a second characterization vector h2 _ij＝(h2¹ _ij,h2² _ij,…,h2^nij _ij), wherein h1 ^k _ij and h2 ^k _ij are characterizations obtained by performing first encoding and second encoding on X ^k _ij;

S202, obtaining Wherein/>First and second entity representations of an r-th entity in entity representation vectors corresponding to h1 _ij and h2 _ij, respectively,/>Characterizing a second entity of a t-th entity in an s-th statement except the statement j in the statement data table i; τ is a temperature super parameter; /(I)Representation/>And/>Cosine similarity between/(Representation/>And/>

S203, obtainingWherein B1' _i is the entity representation of any entity of the same type as the entity corresponding to the r-th entity representation except for statement j in statement data table i,/>A first entity representation of an entity q with different types corresponding to the r-th entity representation in a p-th statement except the statement j in the statement data table i is obtained;

S204, optimizing τ and dropout in the pre-training language model to minimize Loss1 ^k _ij and Loss2 ^k _ij;

s205, setting j=j+1; if j is less than or equal to Pi, S2 is executed; otherwise, executing S3;

s3, enumerating fragments of each statement t, randomly extracting a set number of fragments except for an entity to serve as negative samples, and obtaining a training set comprising N training samples;

s4, inputting the training set into the optimized pre-training language model, and classifying the types of the entities in each sentence to obtain a classification prediction result;

s5, optimizing the optimized pre-training language model based on the classification prediction result and the actual entity type in each sentence to obtain a target nested entity classification model;

S6, classifying the input sentences by using the target nested entity classification model.

The invention also provides a nested entity classification system based on contrast learning, which comprises a server and a database which are in communication connection, wherein the server comprises a processor and a memory which stores a computer program, N statement data tables are stored in the database, the j-th row of the statement data table i comprises (X_ij,L_ij),X_ij＝(X¹ _ij,X² _ij,…,X^nij _ij),X^k _ij which is the k character in the j-th statement of the statement data table i, the value of k is 1 to nij, the nij is the number ;L_ij＝{(E¹ _ij,T¹ _ij),(E² _ij,T² _ij),…,(E^mij _ij,T^mij _ij)},E^r _ij of the characters in the j-th statement of the statement data table i, the r entity in the j-th statement of the statement data table i, T ^r _ij is the actual entity type corresponding to E ^r _ij, the value of r is 1 to mij, and mij is the number of the entities in the j-th statement of the statement data table i; i has a value of 1 to N, j has a value of 1 to Pi, and Pi is the number of sentences in the sentence data table i;

the processor is configured to execute a computer program to implement the steps of:

s10, for a statement j in a statement data table i, executing the following operation:

S101, adopting a pre-training language model Bert to encode a sentence j twice to respectively obtain a first characterization vector h1 _ij＝(h1¹ _ij,h1² _ij,…,h1^nij _ij) and a second characterization vector h2 _ij＝(h2¹ _ij,h2² _ij,…,h2^nij _ij), wherein h1 ^k _ij and h2 ^k _ij are characterizations obtained by performing first encoding and second encoding on X ^k _ij;

s102, obtaining Wherein/>First and second entity representations of an r-th entity in entity representation vectors corresponding to h1 _ij and h2 _ij, respectively,/>Characterizing a second entity of a t-th entity in an s-th statement except the statement j in the statement data table i; τ is a temperature super parameter; /(I)Representation/>And/>Cosine similarity between/(Representation/>And/>

s104, optimizing tau and dropout in the pre-training language model to enable Loss1 ^k _ij and Loss2 ^k _ij to be minimum;

s105, setting j=j+1; if j is less than or equal to Pi, S10 is executed; otherwise, executing S20;

s20, enumerating fragments of each statement t, randomly extracting a set number of fragments except for an entity to serve as negative samples, and obtaining a training set comprising N training samples;

S30, inputting the training set into the optimized pre-training language model, and classifying the types of the entities in each sentence to obtain a classification prediction result;

And S40, optimizing the optimized pre-trained language model based on the classification prediction result and the actual entity type in each sentence to obtain a target nested entity classification model.

The embodiment of the invention has at least the following technical effects: the target nested entity classification model for nested entity classification is obtained through two stages, wherein the first stage utilizes a contrast learning method to learn the representation of the entity, and the second stage adopts a segment method, and as the characteristics of the sample are learned in the first stage, the proportion of negative samples can be reduced, the convergence of the model can be accelerated, the model result is more stable, and the boundary distinction degree of the entity is higher.

Detailed Description

In order to make the technical problems, technical solutions and advantages to be solved more clear, the technical solutions in the embodiments of the present invention will be clearly and completely described below.

An embodiment of the invention provides a nested entity classification method based on contrast learning, which can include the following steps:

S1, acquiring an input sentence data table; wherein, the jth line of the sentence data table i comprises (X_ij,L_ij),X_ij＝(X¹ _ij,X² _ij,…,X^nij _ij),X^k _ij as the kth character in the jth sentence of the sentence data table i, the value of k is 1 to nij, nij is the number of characters ;L_ij＝{(E¹ _ij,T¹ _ij),(E² _ij,T² _ij),…,(E^mij _ij,T^mij _ij)},E^r _ij in the jth sentence of the sentence data table i as the r entity in the jth sentence of the sentence data table i, T ^r _ij is the actual entity type corresponding to E ^r _ij, the value of r is 1 to mij, and mij is the number of entities in the jth sentence of the sentence data table i; i has a value of 1 to N, j has a value of 1 to Pi, and Pi is the number of sentences in the sentence data table i; n is the number of statement data tables.

In an embodiment of the present invention, the number of sentences in each sentence data table may be the same, i.e., p1=p2= … =pn.

In another embodiment of the present invention, the number of sentences in the first N-1 sentence data tables may be the same, i.e., p1=p2= … =p (N-1) =p, and the number of sentences in the last sentence data table may be equal to the total number of sentences M- (N-1) ×p.

S201, adopting a pre-training language model to code the sentence j twice to respectively obtain a first characterization vector h1 _ij＝(h1¹ _ij,h1² _ij,…,h1^nij _ij) and a second characterization vector h2 _ij＝(h2¹ _ij,h2² _ij,…,h2^nij _ij), wherein h1 ^k _ij and h2 ^k _ij are characterizations obtained by performing first coding and second coding on X ^k _ij.

In an exemplary embodiment of the invention, the pre-trained language model may be a bert model. Because the mechanism of the random mask in the Bert causes that the results are different even though the result is encoded for a plurality of times for the same sentence, the sample is encoded twice by utilizing the characteristic to generate the positive sample required by contrast learning.

In another exemplary embodiment of the present invention, the pre-trained language model is a roberta model.

Those skilled in the art will appreciate that methods of encoding sentences using pre-trained language models may belong to the prior art.

S202, obtainingWherein/>The first entity representation and the second entity representation of the r-th entity in the entity representation vectors corresponding to h1 _ij and h2 _ij respectively. /(I)Characterizing a second entity of a t-th entity in an s-th statement except the statement j in the statement data table i; τ is a temperature super parameter; /(I)Representation/>And/>Cosine similarity between/(Representation/>And/>

In the embodiment of the invention, the Loss function Loss1 is used for enabling corresponding entity words to be similar in the twice coding result.

S203, obtainingWherein B1' _i is the entity representation of any entity of the same type as the entity corresponding to the r-th entity representation except for statement j in statement data table i,/>And (3) representing the first entity representation of the entity q with different types of the entity corresponding to the r-th entity representation in the p-th statement except the statement j in the statement data table i.

In the embodiment of the invention, the Loss function Loss2 is used for enabling the entity words of the same type to be similar and the entity words of different types to be far in the current statement data table.

S204, optimizing τ and dropout in the pre-trained language model to minimize Loss1 ^k _ij and Loss2 ^k _ij.

Those skilled in the art will appreciate that optimizing τ and dropout in the pre-trained language model, such that the minimum implementation of Loss1 ^k _ij and Loss2 ^k _ij, may be prior art.

Through S204, τ and dropout after the first-stage optimization can be obtained.

S205, setting j=j+1; if j is less than or equal to Pi, S2 is executed; otherwise, S3 is performed.

S3, enumerating fragments of each statement t, randomly extracting a set number of fragments except for the entity to serve as negative samples, and obtaining a training set comprising N training samples.

In the embodiment of the invention, the set number can be set based on actual needs. Specifically, each training sample is a sentence, including a positive sample and a negative sample, and the positive sample is an entity in each sentence.

S4, inputting the training set into the optimized pre-training language model, and classifying the types of the entities in each sentence to obtain a classification prediction result.

The classification prediction results may include classification results, i.e., types, for each segment in the training set.

And S5, optimizing the optimized pre-trained language model based on the classification prediction result and the actual entity type in each sentence to obtain a target nested entity classification model.

In embodiments of the present invention, the optimized pre-trained language model may be optimized based on the F1 Score. Because in the embodiment of the invention, the positive samples in the training set are labeled, namely the entity type of each entity is known, so that the fragments in the sentence are known to be of no type, and the classification accuracy can be obtained based on the comparison of the predicted type and the actual type. Those skilled in the art will appreciate that determining classification accuracy based on the F1 score may be prior art.

Under the condition that the classification accuracy is greater than or equal to the set threshold, the current classification model is accurate, so the current classification model can be used as the target nested entity classification model, and if the classification accuracy is smaller than the set threshold, tau and dropout are continuously adjusted until the classification accuracy is greater than or equal to the set threshold.

Because S1 and S2 can learn the characteristics of the sample, the proportion of the negative sample can be reduced, the convergence of the model can be quickened, the model result is more stable, and the boundary distinction degree of the entity is higher.

In practical application, the input sentences can be directly classified by using the obtained target nested entity classification model.

The invention provides a nested entity classification system based on contrast learning, which comprises a server and a database which are in communication connection, wherein the server comprises a processor and a memory which stores a computer program, N statement data tables are stored in the database, the j-th row of the statement data table i comprises (X_ij,L_ij),X_ij＝(X¹ _ij,X² _ij,…,X^nij _ij),X^k _ij which is the k character in the j-th statement of the statement data table i, the value of k is 1 to nij, the nij is the number ;L_ij＝{(E¹ _ij,T¹ _ij),(E² _ij,T² _ij),…,(E^mij _ij,T^mij _ij)},E^r _ij of the characters in the j-th statement of the statement data table i, the r entity in the j-th statement of the statement data table i, T ^r _ij is the actual entity type corresponding to E ^r _ij, the value of r is 1 to mij, and the mij is the number of the entities in the j-th statement of the statement data table i; i has a value of 1 to N, j has a value of 1 to Pi, and Pi is the number of sentences in the sentence data table i;

Further, in S40, the optimized pre-trained language model is optimized based on the F1 score.

Further, the pre-trained language model is bert models.

Further, the pre-trained language model is roberta models.

Further, p1=p2= … =pn.

The implementation of this embodiment can be seen in the previous embodiments.

While certain specific embodiments of the invention have been described in detail by way of example, it will be appreciated by those skilled in the art that the above examples are for illustration only and are not intended to limit the scope of the invention. Those skilled in the art will also appreciate that many modifications may be made to the embodiments without departing from the scope and spirit of the invention. The scope of the invention is defined by the appended claims.

Claims

1. The nested entity classification method based on contrast learning is applied to the technical field of natural language processing and is characterized by comprising the following steps of:

S202, obtaining Wherein/>First and second entity representations of an r-th entity in entity representation vectors corresponding to h1 _ij and h2 _ij, respectively,/>Characterizing a second entity of a t-th entity in an s-th statement except the statement j in the statement data table i; τ is a temperature super parameter; /(I)Representation ofAnd/>Cosine similarity between/(Representation/>And/>

2. The method of claim 1, wherein in S5 the optimized pre-trained language model is optimized based on the F1 score.

3. The method of claim 1, wherein the pre-trained language model is a bert model.

4. The method of claim 1, wherein the pre-trained language model is a roberta model.

5. The method of claim 1, wherein p1=p2= … =pn.

6. The nested entity classification system based on contrast learning is applied to the technical field of natural language processing and is characterized by comprising a server and a database which are in communication connection, wherein the server comprises a processor and a memory which stores a computer program, N sentence data tables are stored in the database, the j-th row of the sentence data table i comprises (X_ij,L_ij),X_ij＝(X¹ _ij,X² _ij,…,X^nij _ij),X^k _ij as the k-th character in the j-th sentence of the sentence data table i, the value of k is 1 to nij, nij is the number ;L_ij＝{(E¹ _ij,T¹ _ij),(E² _ij,T² _ij),…,(E^mij _ij,T^mij _ij)},E^r _ij of characters in the j-th sentence of the sentence data table i, the number of T ^r _ij is the r-th entity in the j-th sentence of the sentence data table i, the value of T ^r _ij is the actual entity type corresponding to E ^r _ij, the value of r is 1 to mij, and mij is the number of entities in the j-th sentence of the sentence data table i; i has a value of 1 to N, j has a value of 1 to Pi, and Pi is the number of sentences in the sentence data table i;

s102, obtaining Wherein/>First and second entity representations of an r-th entity in entity representation vectors corresponding to h1 _ij and h2 _ij, respectively,/>Characterizing a second entity of a t-th entity in an s-th statement except the statement j in the statement data table i; τ is a temperature super parameter; /(I)Representation ofAnd/>Cosine similarity between/(Representation/>And/>

7. The system of claim 6, wherein in S40 the optimized pre-trained language model is optimized based on the F1 score.

8. The system of claim 6, wherein the pre-trained language model is a bert model.

9. The system of claim 6, wherein the pre-trained language model is a roberta model.

10. The system of claim 6, wherein p1=p2= … =pn.