CN118114671A

CN118114671A - Gaussian function-based text data set small sample named entity recognition method and system

Info

Publication number: CN118114671A
Application number: CN202410237440.2A
Authority: CN
Inventors: 陈奕阳; 黄佳佳; 李鹏伟
Original assignee: NANJING AUDIT UNIVERSITY
Current assignee: NANJING AUDIT UNIVERSITY
Priority date: 2024-03-01
Filing date: 2024-03-01
Publication date: 2024-05-31
Anticipated expiration: 2044-03-01
Also published as: CN118114671B

Abstract

The invention discloses a method and a system for identifying named entities of small samples of a text data set based on a Gaussian function, wherein the text data set is firstly divided into Etrain sets, edev sets and Etest sets, then Etrain sets and Edev sets are divided twice and are respectively provided with a support set and a query set, then models are learned in the Etrain sets, the average value of the embedded marks sharing the same type is calculated in the support sets, a prototype of each entity type is obtained, and then Gaussian function values of each mark and each prototype of the support set in the Etrain sets are calculated; the invention realizes the functions of using a Gaussian function to replace the similarity between a distance function calculation variable and a prototype and improving the recognition rate and the robustness of the model, not only improves the tolerance to noise values and abnormal values, but also improves the tolerance to missing data, reduces the influence of the distribution difference between training data and test data on a prediction result, and is suitable for being widely popularized and used.

Description

Gaussian function-based text data set small sample named entity recognition method and system

Technical Field

The invention relates to the technical field of text data set recognition, in particular to a method and a system for recognizing named entities of small samples of text data sets based on a Gaussian function.

Background

NER is often expressed as a sequence table and a problem, and many approaches equipped with deep neural networks have met with great success, which is largely dependent on a large amount of training data. However, in real life, enough sample data cannot be obtained in many cases, or enough manpower and time cannot be possessed to manually label unlabeled data. Small sample learning it is desirable to have the machine learn a way for humans to solve problems with a small number of samples, and old categories that have been learned can help predict new categories when they have only one or a few tagged samples.

Currently, the text databases of named entity recognition in common use include OntoNotes, coNLL '03, WNUT'17, which face two challenges: 1. the database sample is insufficient; 2. because of the lack of a unified reference database, no comparison can be made; the dataset of FEW-NERD: A Few-shot NAMED ENTITY Recognition Dataset balances the dataset by selecting a section through a remote dictionary, the dataset selects the most recent neighbor mode to calculate the distance between a variable x and each prototype and predicts the possibility that the variable x falls in different categories according to the distance, but the model has poor noise resistance, the classification result is greatly influenced by the selection of prototype points, the prediction accuracy of samples with uneven distribution is lower, and only rough prediction classification can be performed; therefore, there is a need to design a method and a system for identifying named entities of small samples of text data sets based on gaussian functions.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, and provides a method and a system for identifying a named entity of a small sample of a text data set based on a Gaussian function, which are used for replacing similarity of a distance function calculation variable and a prototype and improving the identification rate and the robustness of the model, so that the method and the system can effectively solve the problems that the existing text data set is poor in noise resistance, a classification result is greatly influenced by selection of prototype points, sample prediction accuracy of uneven distribution is low, and only rough prediction classification can be performed, and realize the functions of using the Gaussian function to replace similarity of the distance function calculation variable and the prototype and improving the identification rate and the robustness of the model, thereby improving the tolerance of noise values and abnormal values, improving the tolerance of missing data, and reducing the influence of distribution difference between training data and test data on the prediction result.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:

a method for identifying named entities of small samples of text data sets based on Gaussian functions comprises the following steps,

Step A, dividing a text data set into Etrain sets, edev sets and Etest sets, wherein Etrain sets are training sets, edev sets are verification sets and Etest sets are test sets;

Step B, performing secondary division on Etrain sets and Edev sets, and respectively setting a support set and a query set;

step C, learning the model in Etrain sets, and calculating the average value of the embedded marks sharing the same type in the support set to obtain a prototype z of each entity type;

Step D, calculating Gaussian function values of each mark x _i and each prototype z of the support set in Etrain sets;

E, predicting unknown data tags by using a model learned in a supporting set for a query set in the Etrain set to obtain predicted tags;

Step F, judging whether the predicted label is the same as the prototype label with the maximum Gaussian function value, and if so, performing the next step;

G, predicting the model in Edev sets, adjusting the parameters A and mu, and determining a training model;

and step H, predicting the data label by using the determined parameters and the learned model in Etest sets, comparing the recall rate and the accuracy rate of the model, and completing the recognition operation of the named entity of the small sample of the text data set.

A method for identifying named entities of small samples of a text data set based on a Gaussian function comprises the following steps of A, dividing the text data set into Etrain sets, edev sets and Etest sets, wherein Etrain sets are training sets, edev sets are verification sets and Etest sets are test sets, etrain sets, edev sets and Etest sets are all mutually disjoint subsets of the text data set, the training sets Etrain sets are used for learning classification methods, the verification sets Edev sets are used for adjusting model parameters, and the test sets Etest sets are used for testing generalization capability of the model for unknown data.

In the foregoing method for identifying named entities of small sample in text dataset based on Gaussian function, step C, model is learned in Etrain set, and average value of the same type of mark embedding is calculated in the support set to obtain prototype z of each entity type, wherein for the i-th type, prototype is z _i, support set is s _i, and the relation between prototype z _i and support set s _i is shown in formula (1),

Where f _θ is the encoder, the sequence x= { x ₁,x₂,x₃,…,x_n }, and for each token x _i the encoder obtains a representation vector for each token of each sentence, as shown in equation (2),

h＝[h₁,...,h_n]＝f_θ([x₁,...,x_n]) (2)

Where token is the smallest semantic unit in the text and h is the representation vector generated by the encoder for a given sequence.

In the foregoing method for identifying named entities of small samples in a text dataset based on a gaussian function, step D, the gaussian function value of each marker x _i and each prototype z of a support set in Etrain sets is calculated, specifically as follows,

Step D1, for each x _i in the support set, computes its gaussian function value with each prototype, as shown in equation (3),

Wherein x is an independent variable, A is amplitude and is used for adjusting the peak value of the Gaussian curve, mu (z _i) is a mean value and is used for representing the center of the Gaussian curve, and sigma is a standard deviation and is used for determining the width of the curve; a and sigma are variable parameters, z _i is prototype of type i, using distance measuresPredicting the similarity degree between the variable and each prototype, so that the distance measurement becomes probability comparison;

Step D2, due to the simultaneous implementation of the euclidean distance and the gaussian function, using the gaussian kernel function as shown in equation (4),

Wherein, II x-X' ² is the distance measure

In the foregoing method for identifying named entities of small samples in a text dataset based on a gaussian function, step F, judging whether the predicted label is the same as the prototype label with the largest gaussian function value, if so, performing the next step, wherein the judging process for the support set s _y with the type value set Y and the variable x is shown in formula (5),

y^*＝arg min Gy(x)，y∈Y (5)。

In the foregoing method for identifying named entities of small sample of text dataset based on gaussian function, step H, predicting data labels in Etest sets by using determined parameters and learned models, comparing recall rate and accuracy rate of the models, completing named entity identification operation of small sample of text dataset, as shown in formula (6), formula (7) and formula (8),

Wherein True Positives is the number of positive cases that the model correctly predicts as positive cases, false Positives is the number of positive cases that the model incorrectly predicts as positive cases, and FALSE NEGATIVES is the number of negative cases that the model incorrectly predicts as positive cases.

The system comprises a data set preliminary division module, a data set secondary division module, a prototype obtaining module, a Gaussian function value calculation module, a prediction tag obtaining module, a tag judging module, a model building module and a named entity recognition module, wherein the data set preliminary division module is used for dividing a text data set into Etrain sets, edev sets and Etest sets, etrain sets are training sets, edev sets are verification sets and Etest sets are test sets; the data set secondary dividing module is used for carrying out secondary division on Etrain sets and Edev sets and setting a support set and a query set respectively; the prototype obtaining module is used for learning the model in Etrain sets, calculating the embedded average value of the shared same type marks in the support set, and obtaining a prototype z of each entity type; the Gaussian function value calculation module is used for calculating the Gaussian function value of each mark x _i and each prototype z of the support set in the Etrain set; the prediction tag obtaining module is used for predicting the data tag by using a model learned in a support set for a query set in the Etrain set to obtain a prediction tag; the label judging module is used for judging whether the predicted label is the same as the prototype label with the maximum Gaussian function value, and if so, the next step is carried out; the model building module is used for predicting the model in Edev sets, adjusting parameters A and mu, and determining a training model; the named entity recognition module is used for predicting the data tag by using the determined parameters and the learned model in Etest sets and comparing the recall rate and the accuracy rate of the model to complete the recognition operation of the named entity of the small sample of the text data set.

The beneficial effects of the invention are as follows: the invention relates to a method and a system for identifying a small sample named entity of a text data set based on a Gaussian function, wherein the text data set is firstly divided into Etrain sets, edev sets and Etest sets, the Etrain sets and the Edev sets are divided twice and are respectively provided with a support set and a query set, the models are learned in the Etrain sets, the average value of the marks which share the same type is calculated in the support sets to obtain a prototype of each entity type, each mark of the support set and the Gaussian function value of each prototype in the Etrain sets are calculated, then the query set in the Etrain sets predicts unknown data labels by using the models learned in the support sets and obtains prediction labels, then whether the prediction labels are the same as the largest Gaussian function value is judged, if the prediction labels are the same as the largest prototype labels, then the prediction labels of the models are carried out in the Edev sets and parameters are adjusted, the training models are determined, finally the unknown data are predicted by using the determined parameters and the learned model in the Etest sets, and the text named entity data sets are compared, and the accuracy of the model named entity data sets is identified; the invention realizes the functions of using a Gaussian function to replace the similarity between a distance function calculation variable and a prototype and improving the recognition rate and the robustness of the model, not only improves the tolerance to noise values and abnormal values, but also improves the tolerance to missing data, and reduces the influence of the distribution difference between training data and test data on a prediction result.

Drawings

FIG. 1 is a flow chart of a method and system for identifying named entities of small samples of a text data set based on a Gaussian function;

FIG. 2 is a graph showing a comparison of data before and after applying the method PRECISION of the present invention to an INTER dataset;

FIG. 3 is a graph of a comparison of data before and after application of the method RECALL of the present invention in an INTER dataset;

FIG. 4 is a graph of a comparison of data before and after applying the proposed method F values in the INTER dataset according to an embodiment of the present invention;

FIG. 5 is a graph showing a comparison of data before and after applying the method PRECISION of the present invention to an INTRA dataset;

FIG. 6 is a graph of a comparison of data before and after application of the proposed method RECALL in an INTRA dataset according to an embodiment of the present invention;

Fig. 7 is a graph of a comparison of data before and after applying the proposed method F values in INTRA datasets according to an embodiment of the present invention.

Detailed Description

The invention will be further described with reference to the drawings.

As shown in fig. 1, the method for identifying named entities of small samples of a text data set based on a gaussian function comprises the following steps,

Step A, dividing a text data set into Etrain sets, edev sets and Etest sets, wherein Etrain sets are training sets, edev sets are verification sets and Etest sets are test sets, wherein Etrain sets, edev sets and Etest sets are all mutually disjoint subsets in the text data set, the training sets Etrain sets are used for learning classification methods, the verification sets Edev sets are used for adjusting model parameters, and the test sets Etest sets are used for testing generalization capability of the model for unknown data;

Step C, learning the model in Etrain sets, calculating the average value of the mark embedments sharing the same type in the support set to obtain a prototype z of each entity type, wherein for the ith type, the prototype is z _i, the support set is s _i, and the relation between the prototype z _i and the support set s _i is shown as a formula (1),

h＝[h₁,...,h_n]＝f_θ([x₁,...,x_n]) (2)

The token is the smallest semantic unit in the text, and h is a representation vector generated by a given sequence through an encoder;

Step D, the gaussian function value of each marker x _i and each prototype z of the support set in Etrain sets is calculated, as follows,

Wherein, II x-X' ² is the distance measure

Step F, judging whether the predicted label is the same as the prototype label with the maximum Gaussian function value, if so, performing the next step, wherein the judging process of the support set s _y with the type value set Y and the variable x is shown in a formula (5),

y^*＝arg min Gy(x)，y∈Y (5)；

step H, predicting the data label by using the determined parameters and the learned model in Etest sets, comparing the recall rate and the accuracy rate of the model, completing the text data set small sample named entity recognition operation, as shown in a formula (6), a formula (7) and a formula (8),

To better illustrate the utility of the present invention, a specific embodiment of the present invention is described below, S1, using three reference partitions, ethain, edev, and Etest;

Few-NERD (SUP) is a standard supervised learning mode, all corpus are randomly sampled, wherein 70% of corpus is used as training set, 10% of corpus is used as verification set and 20% of corpus is used as testing set, and all three sets comprise 66 fine-grained entity classes;

Few-NERD (INTRA) is classified according to coarse granularity entities, training sets are peole, MISC, art and Product, verification sets are Event and Building, and test sets are ORG and LOC;

Few-NERD (INTER) is divided according to fine granularity, 60% of fine granularity entity classes are randomly selected from each coarse granularity class to serve as training sets, and 20% of fine granularity entity classes are randomly selected from each coarse granularity class to serve as verification sets and 20% of fine granularity entity classes are randomly selected from each coarse granularity class to serve as test sets;

The Few-NERD (SUP), few-NERD (INTRA), and Few-NERD (INTER) dataset contents are shown in Table 1;

Table 1, three data set content tables

S2, N-way-K-shot training is carried out and structshot models are uniformly used,

N-way-K-shot is carried out by using an inter set, parameters are customized, and support set, query set, ethain, edev and Etest are divided;

N-way-K-shot is carried out by using an intra set, parameters are customized, and support set, query set, ethain, edev and Etest are divided;

S3, calculating various types of prototypes;

s4, supporting centralized use of Gaussian kernel function prediction category;

S5, predicting unknown data labels in the query set;

S6, calculating the loss, recall rate, accuracy rate and f value in Etest sets;

Experimental data in Few-NERD (INTRA) and Few-NERD (INTER) datasets using the methods of the invention are shown in tables 2 and 3, respectively.

Table 2, experimental data in dataset INTRA using the method proposed by the present invention

TABLE 3 Experimental data in the data set INTER using the method of the invention

The small sample named entity recognition method based on the Gaussian function is used in an INTER data set: in the training of a 10-way-1-1 model, the accuracy rate reaches 0.0641, the recall rate reaches 0.0322, and the F value reaches 0.0429; in the training of the 10-way-5-1 model, the accuracy rate reaches 0.0448, the recall rate reaches 0.0179 and the F value reaches 0.0256. In the process of small sample learning, the loss of the model in the training process can be greatly reduced by the application of the Gaussian kernel function, so that the prediction of training data is as close to an actual label as possible, and the training of the model is enhanced; compared with the Euclidean distance, the performance of the model is greatly improved in the 10-way test, the prediction of the model on the positive example is more accurate, and the recognition capability on the positive example is stronger.

In summary, the method and the system for identifying the small sample named entity of the text data set based on the Gaussian function are characterized in that firstly, the text data set is divided into Etrain sets, edev sets and Etest sets, then, the Etrain sets and Edev sets are divided twice and are respectively provided with a support set and a query set, then, the models are learned in the Etrain sets, the average value of the mark embedding of the same type is calculated and shared in the support sets, a prototype of each entity type is obtained, then, the Gaussian function value of each mark of the support set and each prototype in the Etrain sets is calculated, then, the query set in the Etrain sets predicts unknown data tags by using the model learned in the support sets and obtains prediction tags, then, whether the prediction tags are identical to the prototype tags with the largest Gaussian function value is judged, if the prediction tags are identical, then, the prediction of the model is carried out in the Edev sets and parameters are adjusted, the training model is determined, finally, the unknown data is predicted by using the determined parameters and the learned model in the Etest sets, the text model is compared, and the text model is accurately identified, and the sample named is completed; the invention realizes the functions of using a Gaussian function to replace the similarity between a distance function calculation variable and a prototype and improving the recognition rate and the robustness of the model, not only improves the tolerance to noise values and abnormal values, but also improves the tolerance to missing data, and reduces the influence of the distribution difference between training data and test data on a prediction result. Experimental results show that the precision, recall and F value of the optimized model are improved remarkably.

The foregoing has outlined and described the basic principles, features, and advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that the above embodiments and descriptions are merely illustrative of the principles of the present invention, and various changes and modifications may be made without departing from the spirit and scope of the invention, which is defined in the appended claims. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A method for identifying named entities of small samples of a text data set based on a Gaussian function is characterized by comprising the following steps: comprises the steps of,

2. The method for identifying named entities of small samples of a text data set based on a gaussian function according to claim 1, wherein the method comprises the following steps: step A, a text data set is divided into Etrain sets, edev sets and Etest sets, wherein Etrain sets are training sets, edev sets are verification sets and Etest sets are test sets, wherein Etrain sets, edev sets and Etest sets are all mutually disjoint subsets in the text data set, the training sets Etrain sets are used for learning classification methods, the verification sets Edev sets are used for adjusting model parameters, and the test sets Etest sets are used for testing generalization capability of the model for unknown data.

3. The method for identifying named entities of small samples of a text data set based on a gaussian function according to claim 2, wherein the method comprises the following steps: step C, learning the model in Etrain sets, calculating the average value of the mark embedments sharing the same type in the support set to obtain a prototype z of each entity type, wherein for the ith type, the prototype is z _i, the support set is s _i, and the relation between the prototype z _i and the support set s _i is shown as a formula (1),

h＝[h₁,...,h_n]＝f_θ([x₁,...,x_n]) (2)

4. A method for identifying named entities of small samples of a text data set based on a gaussian function according to claim 3, characterized in that: step D, the gaussian function value of each marker x _i and each prototype z of the support set in Etrain sets is calculated, as follows,

Wherein, II x-X' ² is the distance measure

5. The method for identifying named entities in small samples of a text dataset based on a gaussian function according to claim 4, wherein: step F, judging whether the predicted label is the same as the prototype label with the maximum Gaussian function value, if so, performing the next step, wherein the judging process of the support set s _y with the type value set Y and the variable x is shown in a formula (5),

y^*＝arg min Gy(x)，y∈Y (5)。

6. The method for identifying named entities in small samples of a text dataset based on a gaussian function according to claim 5, wherein the method comprises the steps of: step H, predicting the data label by using the determined parameters and the learned model in Etest sets, comparing the recall rate and the accuracy rate of the model, completing the text data set small sample named entity recognition operation, as shown in a formula (6), a formula (7) and a formula (8),

7. A system for identifying named entities of small sample of text dataset based on gaussian function, the identification process of the identification system being based on the identification method according to any of claims 1-6, characterized in that: the method comprises a data set preliminary division module, a data set secondary division module, a prototype obtaining module, a Gaussian function value calculation module, a prediction tag obtaining module, a tag judging module, a model building module and a named entity recognition module, wherein the data set preliminary division module is used for dividing a text data set into Etrain sets, edev sets and Etest sets, wherein Etrain sets are training sets, edev sets are verification sets and Etest sets are test sets;

The data set secondary dividing module is used for carrying out secondary division on Etrain sets and Edev sets and setting a support set and a query set respectively;

The prototype obtaining module is used for learning the model in Etrain sets, calculating the embedded average value of the shared same type marks in the support set, and obtaining a prototype z of each entity type;

The Gaussian function value calculation module is used for calculating the Gaussian function value of each mark x _i and each prototype z of the support set in the Etrain set;

The prediction tag obtaining module is used for predicting the data tag by using a model learned in a support set for a query set in the Etrain set to obtain a prediction tag;

The label judging module is used for judging whether the predicted label is the same as the prototype label with the maximum Gaussian function value, and if so, the next step is carried out;

the model building module is used for predicting the model in Edev sets, adjusting parameters A and mu, and determining a training model;

The named entity recognition module is used for predicting the data tag by using the determined parameters and the learned model in Etest sets and comparing the recall rate and the accuracy rate of the model to complete the recognition operation of the named entity of the small sample of the text data set.