CN118114671A - Gaussian function-based text data set small sample named entity recognition method and system - Google Patents

Gaussian function-based text data set small sample named entity recognition method and system Download PDF

Info

Publication number
CN118114671A
CN118114671A CN202410237440.2A CN202410237440A CN118114671A CN 118114671 A CN118114671 A CN 118114671A CN 202410237440 A CN202410237440 A CN 202410237440A CN 118114671 A CN118114671 A CN 118114671A
Authority
CN
China
Prior art keywords
sets
etrain
model
prototype
gaussian function
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410237440.2A
Other languages
Chinese (zh)
Other versions
CN118114671B (en
Inventor
陈奕阳
黄佳佳
李鹏伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NANJING AUDIT UNIVERSITY
Original Assignee
NANJING AUDIT UNIVERSITY
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NANJING AUDIT UNIVERSITY filed Critical NANJING AUDIT UNIVERSITY
Priority to CN202410237440.2A priority Critical patent/CN118114671B/en
Publication of CN118114671A publication Critical patent/CN118114671A/en
Application granted granted Critical
Publication of CN118114671B publication Critical patent/CN118114671B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The invention discloses a method and a system for identifying named entities of small samples of a text data set based on a Gaussian function, wherein the text data set is firstly divided into Etrain sets, edev sets and Etest sets, then Etrain sets and Edev sets are divided twice and are respectively provided with a support set and a query set, then models are learned in the Etrain sets, the average value of the embedded marks sharing the same type is calculated in the support sets, a prototype of each entity type is obtained, and then Gaussian function values of each mark and each prototype of the support set in the Etrain sets are calculated; the invention realizes the functions of using a Gaussian function to replace the similarity between a distance function calculation variable and a prototype and improving the recognition rate and the robustness of the model, not only improves the tolerance to noise values and abnormal values, but also improves the tolerance to missing data, reduces the influence of the distribution difference between training data and test data on a prediction result, and is suitable for being widely popularized and used.

Description

Gaussian function-based text data set small sample named entity recognition method and system
Technical Field
The invention relates to the technical field of text data set recognition, in particular to a method and a system for recognizing named entities of small samples of text data sets based on a Gaussian function.
Background
NER is often expressed as a sequence table and a problem, and many approaches equipped with deep neural networks have met with great success, which is largely dependent on a large amount of training data. However, in real life, enough sample data cannot be obtained in many cases, or enough manpower and time cannot be possessed to manually label unlabeled data. Small sample learning it is desirable to have the machine learn a way for humans to solve problems with a small number of samples, and old categories that have been learned can help predict new categories when they have only one or a few tagged samples.
Currently, the text databases of named entity recognition in common use include OntoNotes, coNLL '03, WNUT'17, which face two challenges: 1. the database sample is insufficient; 2. because of the lack of a unified reference database, no comparison can be made; the dataset of FEW-NERD: A Few-shot NAMED ENTITY Recognition Dataset balances the dataset by selecting a section through a remote dictionary, the dataset selects the most recent neighbor mode to calculate the distance between a variable x and each prototype and predicts the possibility that the variable x falls in different categories according to the distance, but the model has poor noise resistance, the classification result is greatly influenced by the selection of prototype points, the prediction accuracy of samples with uneven distribution is lower, and only rough prediction classification can be performed; therefore, there is a need to design a method and a system for identifying named entities of small samples of text data sets based on gaussian functions.
Disclosure of Invention
The invention aims to overcome the defects of the prior art, and provides a method and a system for identifying a named entity of a small sample of a text data set based on a Gaussian function, which are used for replacing similarity of a distance function calculation variable and a prototype and improving the identification rate and the robustness of the model, so that the method and the system can effectively solve the problems that the existing text data set is poor in noise resistance, a classification result is greatly influenced by selection of prototype points, sample prediction accuracy of uneven distribution is low, and only rough prediction classification can be performed, and realize the functions of using the Gaussian function to replace similarity of the distance function calculation variable and the prototype and improving the identification rate and the robustness of the model, thereby improving the tolerance of noise values and abnormal values, improving the tolerance of missing data, and reducing the influence of distribution difference between training data and test data on the prediction result.
In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:
a method for identifying named entities of small samples of text data sets based on Gaussian functions comprises the following steps,
Step A, dividing a text data set into Etrain sets, edev sets and Etest sets, wherein Etrain sets are training sets, edev sets are verification sets and Etest sets are test sets;
Step B, performing secondary division on Etrain sets and Edev sets, and respectively setting a support set and a query set;
step C, learning the model in Etrain sets, and calculating the average value of the embedded marks sharing the same type in the support set to obtain a prototype z of each entity type;
Step D, calculating Gaussian function values of each mark x i and each prototype z of the support set in Etrain sets;
E, predicting unknown data tags by using a model learned in a supporting set for a query set in the Etrain set to obtain predicted tags;
Step F, judging whether the predicted label is the same as the prototype label with the maximum Gaussian function value, and if so, performing the next step;
G, predicting the model in Edev sets, adjusting the parameters A and mu, and determining a training model;
and step H, predicting the data label by using the determined parameters and the learned model in Etest sets, comparing the recall rate and the accuracy rate of the model, and completing the recognition operation of the named entity of the small sample of the text data set.
A method for identifying named entities of small samples of a text data set based on a Gaussian function comprises the following steps of A, dividing the text data set into Etrain sets, edev sets and Etest sets, wherein Etrain sets are training sets, edev sets are verification sets and Etest sets are test sets, etrain sets, edev sets and Etest sets are all mutually disjoint subsets of the text data set, the training sets Etrain sets are used for learning classification methods, the verification sets Edev sets are used for adjusting model parameters, and the test sets Etest sets are used for testing generalization capability of the model for unknown data.
In the foregoing method for identifying named entities of small sample in text dataset based on Gaussian function, step C, model is learned in Etrain set, and average value of the same type of mark embedding is calculated in the support set to obtain prototype z of each entity type, wherein for the i-th type, prototype is z i, support set is s i, and the relation between prototype z i and support set s i is shown in formula (1),
Where f θ is the encoder, the sequence x= { x 1,x2,x3,…,xn }, and for each token x i the encoder obtains a representation vector for each token of each sentence, as shown in equation (2),
h=[h1,...,hn]=fθ([x1,...,xn]) (2)
Where token is the smallest semantic unit in the text and h is the representation vector generated by the encoder for a given sequence.
In the foregoing method for identifying named entities of small samples in a text dataset based on a gaussian function, step D, the gaussian function value of each marker x i and each prototype z of a support set in Etrain sets is calculated, specifically as follows,
Step D1, for each x i in the support set, computes its gaussian function value with each prototype, as shown in equation (3),
Wherein x is an independent variable, A is amplitude and is used for adjusting the peak value of the Gaussian curve, mu (z i) is a mean value and is used for representing the center of the Gaussian curve, and sigma is a standard deviation and is used for determining the width of the curve; a and sigma are variable parameters, z i is prototype of type i, using distance measuresPredicting the similarity degree between the variable and each prototype, so that the distance measurement becomes probability comparison;
Step D2, due to the simultaneous implementation of the euclidean distance and the gaussian function, using the gaussian kernel function as shown in equation (4),
Wherein, II x-X' 2 is the distance measure
In the foregoing method for identifying named entities of small samples in a text dataset based on a gaussian function, step F, judging whether the predicted label is the same as the prototype label with the largest gaussian function value, if so, performing the next step, wherein the judging process for the support set s y with the type value set Y and the variable x is shown in formula (5),
y*=arg min Gy(x),y∈Y (5)。
In the foregoing method for identifying named entities of small sample of text dataset based on gaussian function, step H, predicting data labels in Etest sets by using determined parameters and learned models, comparing recall rate and accuracy rate of the models, completing named entity identification operation of small sample of text dataset, as shown in formula (6), formula (7) and formula (8),
Wherein True Positives is the number of positive cases that the model correctly predicts as positive cases, false Positives is the number of positive cases that the model incorrectly predicts as positive cases, and FALSE NEGATIVES is the number of negative cases that the model incorrectly predicts as positive cases.
The system comprises a data set preliminary division module, a data set secondary division module, a prototype obtaining module, a Gaussian function value calculation module, a prediction tag obtaining module, a tag judging module, a model building module and a named entity recognition module, wherein the data set preliminary division module is used for dividing a text data set into Etrain sets, edev sets and Etest sets, etrain sets are training sets, edev sets are verification sets and Etest sets are test sets; the data set secondary dividing module is used for carrying out secondary division on Etrain sets and Edev sets and setting a support set and a query set respectively; the prototype obtaining module is used for learning the model in Etrain sets, calculating the embedded average value of the shared same type marks in the support set, and obtaining a prototype z of each entity type; the Gaussian function value calculation module is used for calculating the Gaussian function value of each mark x i and each prototype z of the support set in the Etrain set; the prediction tag obtaining module is used for predicting the data tag by using a model learned in a support set for a query set in the Etrain set to obtain a prediction tag; the label judging module is used for judging whether the predicted label is the same as the prototype label with the maximum Gaussian function value, and if so, the next step is carried out; the model building module is used for predicting the model in Edev sets, adjusting parameters A and mu, and determining a training model; the named entity recognition module is used for predicting the data tag by using the determined parameters and the learned model in Etest sets and comparing the recall rate and the accuracy rate of the model to complete the recognition operation of the named entity of the small sample of the text data set.
The beneficial effects of the invention are as follows: the invention relates to a method and a system for identifying a small sample named entity of a text data set based on a Gaussian function, wherein the text data set is firstly divided into Etrain sets, edev sets and Etest sets, the Etrain sets and the Edev sets are divided twice and are respectively provided with a support set and a query set, the models are learned in the Etrain sets, the average value of the marks which share the same type is calculated in the support sets to obtain a prototype of each entity type, each mark of the support set and the Gaussian function value of each prototype in the Etrain sets are calculated, then the query set in the Etrain sets predicts unknown data labels by using the models learned in the support sets and obtains prediction labels, then whether the prediction labels are the same as the largest Gaussian function value is judged, if the prediction labels are the same as the largest prototype labels, then the prediction labels of the models are carried out in the Edev sets and parameters are adjusted, the training models are determined, finally the unknown data are predicted by using the determined parameters and the learned model in the Etest sets, and the text named entity data sets are compared, and the accuracy of the model named entity data sets is identified; the invention realizes the functions of using a Gaussian function to replace the similarity between a distance function calculation variable and a prototype and improving the recognition rate and the robustness of the model, not only improves the tolerance to noise values and abnormal values, but also improves the tolerance to missing data, and reduces the influence of the distribution difference between training data and test data on a prediction result.
Drawings
FIG. 1 is a flow chart of a method and system for identifying named entities of small samples of a text data set based on a Gaussian function;
FIG. 2 is a graph showing a comparison of data before and after applying the method PRECISION of the present invention to an INTER dataset;
FIG. 3 is a graph of a comparison of data before and after application of the method RECALL of the present invention in an INTER dataset;
FIG. 4 is a graph of a comparison of data before and after applying the proposed method F values in the INTER dataset according to an embodiment of the present invention;
FIG. 5 is a graph showing a comparison of data before and after applying the method PRECISION of the present invention to an INTRA dataset;
FIG. 6 is a graph of a comparison of data before and after application of the proposed method RECALL in an INTRA dataset according to an embodiment of the present invention;
Fig. 7 is a graph of a comparison of data before and after applying the proposed method F values in INTRA datasets according to an embodiment of the present invention.
Detailed Description
The invention will be further described with reference to the drawings.
As shown in fig. 1, the method for identifying named entities of small samples of a text data set based on a gaussian function comprises the following steps,
Step A, dividing a text data set into Etrain sets, edev sets and Etest sets, wherein Etrain sets are training sets, edev sets are verification sets and Etest sets are test sets, wherein Etrain sets, edev sets and Etest sets are all mutually disjoint subsets in the text data set, the training sets Etrain sets are used for learning classification methods, the verification sets Edev sets are used for adjusting model parameters, and the test sets Etest sets are used for testing generalization capability of the model for unknown data;
Step B, performing secondary division on Etrain sets and Edev sets, and respectively setting a support set and a query set;
Step C, learning the model in Etrain sets, calculating the average value of the mark embedments sharing the same type in the support set to obtain a prototype z of each entity type, wherein for the ith type, the prototype is z i, the support set is s i, and the relation between the prototype z i and the support set s i is shown as a formula (1),
Where f θ is the encoder, the sequence x= { x 1,x2,x3,…,xn }, and for each token x i the encoder obtains a representation vector for each token of each sentence, as shown in equation (2),
h=[h1,...,hn]=fθ([x1,...,xn]) (2)
The token is the smallest semantic unit in the text, and h is a representation vector generated by a given sequence through an encoder;
Step D, the gaussian function value of each marker x i and each prototype z of the support set in Etrain sets is calculated, as follows,
Step D1, for each x i in the support set, computes its gaussian function value with each prototype, as shown in equation (3),
Wherein x is an independent variable, A is amplitude and is used for adjusting the peak value of the Gaussian curve, mu (z i) is a mean value and is used for representing the center of the Gaussian curve, and sigma is a standard deviation and is used for determining the width of the curve; a and sigma are variable parameters, z i is prototype of type i, using distance measuresPredicting the similarity degree between the variable and each prototype, so that the distance measurement becomes probability comparison;
Step D2, due to the simultaneous implementation of the euclidean distance and the gaussian function, using the gaussian kernel function as shown in equation (4),
Wherein, II x-X' 2 is the distance measure
E, predicting unknown data tags by using a model learned in a supporting set for a query set in the Etrain set to obtain predicted tags;
Step F, judging whether the predicted label is the same as the prototype label with the maximum Gaussian function value, if so, performing the next step, wherein the judging process of the support set s y with the type value set Y and the variable x is shown in a formula (5),
y*=arg min Gy(x),y∈Y (5);
G, predicting the model in Edev sets, adjusting the parameters A and mu, and determining a training model;
step H, predicting the data label by using the determined parameters and the learned model in Etest sets, comparing the recall rate and the accuracy rate of the model, completing the text data set small sample named entity recognition operation, as shown in a formula (6), a formula (7) and a formula (8),
Wherein True Positives is the number of positive cases that the model correctly predicts as positive cases, false Positives is the number of positive cases that the model incorrectly predicts as positive cases, and FALSE NEGATIVES is the number of negative cases that the model incorrectly predicts as positive cases.
The system comprises a data set preliminary division module, a data set secondary division module, a prototype obtaining module, a Gaussian function value calculation module, a prediction tag obtaining module, a tag judging module, a model building module and a named entity recognition module, wherein the data set preliminary division module is used for dividing a text data set into Etrain sets, edev sets and Etest sets, etrain sets are training sets, edev sets are verification sets and Etest sets are test sets; the data set secondary dividing module is used for carrying out secondary division on Etrain sets and Edev sets and setting a support set and a query set respectively; the prototype obtaining module is used for learning the model in Etrain sets, calculating the embedded average value of the shared same type marks in the support set, and obtaining a prototype z of each entity type; the Gaussian function value calculation module is used for calculating the Gaussian function value of each mark x i and each prototype z of the support set in the Etrain set; the prediction tag obtaining module is used for predicting the data tag by using a model learned in a support set for a query set in the Etrain set to obtain a prediction tag; the label judging module is used for judging whether the predicted label is the same as the prototype label with the maximum Gaussian function value, and if so, the next step is carried out; the model building module is used for predicting the model in Edev sets, adjusting parameters A and mu, and determining a training model; the named entity recognition module is used for predicting the data tag by using the determined parameters and the learned model in Etest sets and comparing the recall rate and the accuracy rate of the model to complete the recognition operation of the named entity of the small sample of the text data set.
To better illustrate the utility of the present invention, a specific embodiment of the present invention is described below, S1, using three reference partitions, ethain, edev, and Etest;
Few-NERD (SUP) is a standard supervised learning mode, all corpus are randomly sampled, wherein 70% of corpus is used as training set, 10% of corpus is used as verification set and 20% of corpus is used as testing set, and all three sets comprise 66 fine-grained entity classes;
Few-NERD (INTRA) is classified according to coarse granularity entities, training sets are peole, MISC, art and Product, verification sets are Event and Building, and test sets are ORG and LOC;
Few-NERD (INTER) is divided according to fine granularity, 60% of fine granularity entity classes are randomly selected from each coarse granularity class to serve as training sets, and 20% of fine granularity entity classes are randomly selected from each coarse granularity class to serve as verification sets and 20% of fine granularity entity classes are randomly selected from each coarse granularity class to serve as test sets;
The Few-NERD (SUP), few-NERD (INTRA), and Few-NERD (INTER) dataset contents are shown in Table 1;
Table 1, three data set content tables
S2, N-way-K-shot training is carried out and structshot models are uniformly used,
N-way-K-shot is carried out by using an inter set, parameters are customized, and support set, query set, ethain, edev and Etest are divided;
N-way-K-shot is carried out by using an intra set, parameters are customized, and support set, query set, ethain, edev and Etest are divided;
S3, calculating various types of prototypes;
s4, supporting centralized use of Gaussian kernel function prediction category;
S5, predicting unknown data labels in the query set;
S6, calculating the loss, recall rate, accuracy rate and f value in Etest sets;
Experimental data in Few-NERD (INTRA) and Few-NERD (INTER) datasets using the methods of the invention are shown in tables 2 and 3, respectively.
Table 2, experimental data in dataset INTRA using the method proposed by the present invention
TABLE 3 Experimental data in the data set INTER using the method of the invention
The small sample named entity recognition method based on the Gaussian function is used in an INTER data set: in the training of a 10-way-1-1 model, the accuracy rate reaches 0.0641, the recall rate reaches 0.0322, and the F value reaches 0.0429; in the training of the 10-way-5-1 model, the accuracy rate reaches 0.0448, the recall rate reaches 0.0179 and the F value reaches 0.0256. In the process of small sample learning, the loss of the model in the training process can be greatly reduced by the application of the Gaussian kernel function, so that the prediction of training data is as close to an actual label as possible, and the training of the model is enhanced; compared with the Euclidean distance, the performance of the model is greatly improved in the 10-way test, the prediction of the model on the positive example is more accurate, and the recognition capability on the positive example is stronger.
In summary, the method and the system for identifying the small sample named entity of the text data set based on the Gaussian function are characterized in that firstly, the text data set is divided into Etrain sets, edev sets and Etest sets, then, the Etrain sets and Edev sets are divided twice and are respectively provided with a support set and a query set, then, the models are learned in the Etrain sets, the average value of the mark embedding of the same type is calculated and shared in the support sets, a prototype of each entity type is obtained, then, the Gaussian function value of each mark of the support set and each prototype in the Etrain sets is calculated, then, the query set in the Etrain sets predicts unknown data tags by using the model learned in the support sets and obtains prediction tags, then, whether the prediction tags are identical to the prototype tags with the largest Gaussian function value is judged, if the prediction tags are identical, then, the prediction of the model is carried out in the Edev sets and parameters are adjusted, the training model is determined, finally, the unknown data is predicted by using the determined parameters and the learned model in the Etest sets, the text model is compared, and the text model is accurately identified, and the sample named is completed; the invention realizes the functions of using a Gaussian function to replace the similarity between a distance function calculation variable and a prototype and improving the recognition rate and the robustness of the model, not only improves the tolerance to noise values and abnormal values, but also improves the tolerance to missing data, and reduces the influence of the distribution difference between training data and test data on a prediction result. Experimental results show that the precision, recall and F value of the optimized model are improved remarkably.
The foregoing has outlined and described the basic principles, features, and advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that the above embodiments and descriptions are merely illustrative of the principles of the present invention, and various changes and modifications may be made without departing from the spirit and scope of the invention, which is defined in the appended claims. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (7)

1. A method for identifying named entities of small samples of a text data set based on a Gaussian function is characterized by comprising the following steps: comprises the steps of,
Step A, dividing a text data set into Etrain sets, edev sets and Etest sets, wherein Etrain sets are training sets, edev sets are verification sets and Etest sets are test sets;
Step B, performing secondary division on Etrain sets and Edev sets, and respectively setting a support set and a query set;
step C, learning the model in Etrain sets, and calculating the average value of the embedded marks sharing the same type in the support set to obtain a prototype z of each entity type;
Step D, calculating Gaussian function values of each mark x i and each prototype z of the support set in Etrain sets;
E, predicting unknown data tags by using a model learned in a supporting set for a query set in the Etrain set to obtain predicted tags;
Step F, judging whether the predicted label is the same as the prototype label with the maximum Gaussian function value, and if so, performing the next step;
G, predicting the model in Edev sets, adjusting the parameters A and mu, and determining a training model;
and step H, predicting the data label by using the determined parameters and the learned model in Etest sets, comparing the recall rate and the accuracy rate of the model, and completing the recognition operation of the named entity of the small sample of the text data set.
2. The method for identifying named entities of small samples of a text data set based on a gaussian function according to claim 1, wherein the method comprises the following steps: step A, a text data set is divided into Etrain sets, edev sets and Etest sets, wherein Etrain sets are training sets, edev sets are verification sets and Etest sets are test sets, wherein Etrain sets, edev sets and Etest sets are all mutually disjoint subsets in the text data set, the training sets Etrain sets are used for learning classification methods, the verification sets Edev sets are used for adjusting model parameters, and the test sets Etest sets are used for testing generalization capability of the model for unknown data.
3. The method for identifying named entities of small samples of a text data set based on a gaussian function according to claim 2, wherein the method comprises the following steps: step C, learning the model in Etrain sets, calculating the average value of the mark embedments sharing the same type in the support set to obtain a prototype z of each entity type, wherein for the ith type, the prototype is z i, the support set is s i, and the relation between the prototype z i and the support set s i is shown as a formula (1),
Where f θ is the encoder, the sequence x= { x 1,x2,x3,…,xn }, and for each token x i the encoder obtains a representation vector for each token of each sentence, as shown in equation (2),
h=[h1,...,hn]=fθ([x1,...,xn]) (2)
Where token is the smallest semantic unit in the text and h is the representation vector generated by the encoder for a given sequence.
4. A method for identifying named entities of small samples of a text data set based on a gaussian function according to claim 3, characterized in that: step D, the gaussian function value of each marker x i and each prototype z of the support set in Etrain sets is calculated, as follows,
Step D1, for each x i in the support set, computes its gaussian function value with each prototype, as shown in equation (3),
Wherein x is an independent variable, A is amplitude and is used for adjusting the peak value of the Gaussian curve, mu (z i) is a mean value and is used for representing the center of the Gaussian curve, and sigma is a standard deviation and is used for determining the width of the curve; a and sigma are variable parameters, z i is prototype of type i, using distance measuresPredicting the similarity degree between the variable and each prototype, so that the distance measurement becomes probability comparison;
Step D2, due to the simultaneous implementation of the euclidean distance and the gaussian function, using the gaussian kernel function as shown in equation (4),
Wherein, II x-X' 2 is the distance measure
5. The method for identifying named entities in small samples of a text dataset based on a gaussian function according to claim 4, wherein: step F, judging whether the predicted label is the same as the prototype label with the maximum Gaussian function value, if so, performing the next step, wherein the judging process of the support set s y with the type value set Y and the variable x is shown in a formula (5),
y*=arg min Gy(x),y∈Y (5)。
6. The method for identifying named entities in small samples of a text dataset based on a gaussian function according to claim 5, wherein the method comprises the steps of: step H, predicting the data label by using the determined parameters and the learned model in Etest sets, comparing the recall rate and the accuracy rate of the model, completing the text data set small sample named entity recognition operation, as shown in a formula (6), a formula (7) and a formula (8),
Wherein True Positives is the number of positive cases that the model correctly predicts as positive cases, false Positives is the number of positive cases that the model incorrectly predicts as positive cases, and FALSE NEGATIVES is the number of negative cases that the model incorrectly predicts as positive cases.
7. A system for identifying named entities of small sample of text dataset based on gaussian function, the identification process of the identification system being based on the identification method according to any of claims 1-6, characterized in that: the method comprises a data set preliminary division module, a data set secondary division module, a prototype obtaining module, a Gaussian function value calculation module, a prediction tag obtaining module, a tag judging module, a model building module and a named entity recognition module, wherein the data set preliminary division module is used for dividing a text data set into Etrain sets, edev sets and Etest sets, wherein Etrain sets are training sets, edev sets are verification sets and Etest sets are test sets;
The data set secondary dividing module is used for carrying out secondary division on Etrain sets and Edev sets and setting a support set and a query set respectively;
The prototype obtaining module is used for learning the model in Etrain sets, calculating the embedded average value of the shared same type marks in the support set, and obtaining a prototype z of each entity type;
The Gaussian function value calculation module is used for calculating the Gaussian function value of each mark x i and each prototype z of the support set in the Etrain set;
The prediction tag obtaining module is used for predicting the data tag by using a model learned in a support set for a query set in the Etrain set to obtain a prediction tag;
The label judging module is used for judging whether the predicted label is the same as the prototype label with the maximum Gaussian function value, and if so, the next step is carried out;
the model building module is used for predicting the model in Edev sets, adjusting parameters A and mu, and determining a training model;
The named entity recognition module is used for predicting the data tag by using the determined parameters and the learned model in Etest sets and comparing the recall rate and the accuracy rate of the model to complete the recognition operation of the named entity of the small sample of the text data set.
CN202410237440.2A 2024-03-01 2024-03-01 Gaussian function-based text data set small sample named entity recognition method and system Active CN118114671B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410237440.2A CN118114671B (en) 2024-03-01 2024-03-01 Gaussian function-based text data set small sample named entity recognition method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410237440.2A CN118114671B (en) 2024-03-01 2024-03-01 Gaussian function-based text data set small sample named entity recognition method and system

Publications (2)

Publication Number Publication Date
CN118114671A true CN118114671A (en) 2024-05-31
CN118114671B CN118114671B (en) 2024-08-27

Family

ID=91217607

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410237440.2A Active CN118114671B (en) 2024-03-01 2024-03-01 Gaussian function-based text data set small sample named entity recognition method and system

Country Status (1)

Country Link
CN (1) CN118114671B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105223475A (en) * 2015-08-25 2016-01-06 国家电网公司 Based on the shelf depreciation chromatogram characteristic algorithm for pattern recognition of Gaussian parameter matching
CN114547241A (en) * 2022-02-08 2022-05-27 南华大学 Small sample entity identification method and model combining character perception and sentence perception
CN114780720A (en) * 2022-03-29 2022-07-22 河海大学 Text entity relation classification method based on small sample learning
US20220398502A1 (en) * 2021-06-11 2022-12-15 Palo Alto Research Center Incorporated Method and system for creating an ensemble of machine learning models to defend against adversarial examples

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105223475A (en) * 2015-08-25 2016-01-06 国家电网公司 Based on the shelf depreciation chromatogram characteristic algorithm for pattern recognition of Gaussian parameter matching
US20220398502A1 (en) * 2021-06-11 2022-12-15 Palo Alto Research Center Incorporated Method and system for creating an ensemble of machine learning models to defend against adversarial examples
CN114547241A (en) * 2022-02-08 2022-05-27 南华大学 Small sample entity identification method and model combining character perception and sentence perception
CN114780720A (en) * 2022-03-29 2022-07-22 河海大学 Text entity relation classification method based on small sample learning

Also Published As

Publication number Publication date
CN118114671B (en) 2024-08-27

Similar Documents

Publication Publication Date Title
CN111126482B (en) Remote sensing image automatic classification method based on multi-classifier cascade model
CN107798033B (en) Case text classification method in public security field
CN109993225B (en) Airspace complexity classification method and device based on unsupervised learning
CN111046930A (en) Power supply service satisfaction influence factor identification method based on decision tree algorithm
CN112926045B (en) Group control equipment identification method based on logistic regression model
CN112417132B (en) New meaning identification method for screening negative samples by using guest information
CN110942099A (en) Abnormal data identification and detection method of DBSCAN based on core point reservation
WO2018006631A1 (en) User level automatic segmentation method and system
CN111667135A (en) Load structure analysis method based on typical feature extraction
CN117478390A (en) Network intrusion detection method based on improved density peak clustering algorithm
CN118114671B (en) Gaussian function-based text data set small sample named entity recognition method and system
CN112465016A (en) Partial multi-mark learning method based on optimal distance between two adjacent marks
CN107909090A (en) Learn semi-supervised music-book on pianoforte difficulty recognition methods based on estimating
CN116090449B (en) Entity relation extraction method and system for quality problem analysis report
CN111950652A (en) Semi-supervised learning data classification algorithm based on similarity
CN111367986A (en) Joint information extraction method based on weak supervised learning
CN115688789A (en) Entity relation extraction model training method and system based on dynamic labels
CN116186266A (en) BERT (binary image analysis) and NER (New image analysis) entity extraction and knowledge graph material classification optimization method and system
CN115937616A (en) Training method and system of image classification model and mobile terminal
CN115455934A (en) Method and system for identifying multiple operation ranges of enterprise
CN115186138A (en) Comparison method and terminal for power distribution network data
CN111783788B (en) Multi-label classification method facing label noise
CN110097126B (en) Method for checking important personnel and house missing registration based on DBSCAN clustering algorithm
TWI754446B (en) System and method for maintaining model inference quality
Dai et al. Self-supervised pairing image clustering and its application in cyber manufacturing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant