CN115310449A

CN115310449A - Named entity identification method and device based on small sample and related medium

Info

Publication number: CN115310449A
Application number: CN202211000683.1A
Authority: CN
Inventors: 张黔; 王伟; 陈焕坤
Original assignee: China Resources Digital Technology Co Ltd
Current assignee: China Resources Digital Technology Co Ltd
Priority date: 2022-08-19
Filing date: 2022-08-19
Publication date: 2022-11-08

Abstract

The invention discloses a named entity identification method, a named entity identification device and a related medium based on small samples, wherein the method comprises the following steps: acquiring sample data, and labeling an entity label on the sample data to construct a first sample set; selecting pivot characters in the first sample set, and constructing a label mapping space based on the pivot characters; mapping the first set of samples to a second set of samples using the label mapping space; fine-tuning a pre-training language model by using the second sample set; and carrying out named entity recognition prediction on the specified text by adopting the fine-tuned pre-training language model. According to the method, the most representative pivot character is selected to construct the label mapping space so as to map the sample data, then the second sample set obtained by mapping is utilized to conduct fine adjustment on the pre-training language model, and therefore the fine-adjusted pre-training language model is utilized to conduct named entity recognition prediction, and therefore named entity recognition efficiency and accuracy can be improved.

Description

Named entity identification method and device based on small sample and related medium

Technical Field

The invention relates to the technical field of named entity identification, in particular to a named entity identification method and device based on small samples and a related medium.

Background

Named entity recognition refers to recognition of entities with specific meanings in text, and mainly includes names of people, places, organizations, proper nouns and the like. With the continuous development of the current information industry, the number of various electronic texts is increased sharply, and the difficulty of quickly and efficiently acquiring structured information is higher and higher, so that the named entity recognition technology is applied to various fields for accurately and efficiently extracting key information in the texts.

At present, the mainstream method for processing an entity recognition task is a deep learning-based method, and the common method is that after a text is encoded, semantic features of the text are captured by using a deep learning model, and then the semantic features are input into a classification layer to recognize and classify entities in the text. One disadvantage of this approach is that it requires a certain number of samples in the training set, and the model can be trained on a large number of samples to effectively capture the entity information. In some specific fields, the problems of small sample quantity, high collection difficulty, high cost and the like exist. In view of the above problems, the prior art also proposes a neural network model based on hint learning for small samples. However, such a method based on prompt learning needs to enumerate all potential templates or entities for inference prediction, which consumes a lot of time, and the recognition effect of the model is also affected to a certain extent due to the inconsistency between the fine tuning target and the pre-trained language model.

Disclosure of Invention

The embodiment of the invention provides a named entity identification method and device based on a small sample, computer equipment and a storage medium, aiming at improving the efficiency and the precision of named entity identification.

In a first aspect, an embodiment of the present invention provides a named entity identification method based on a small sample, including:

acquiring sample data, and labeling an entity label on the sample data to construct a first sample set;

selecting pivot characters in the first sample set, and constructing a label mapping space based on the pivot characters;

mapping the first set of samples to a second set of samples using the label mapping space;

fine-tuning a pre-training language model by using the second sample set;

and carrying out named entity recognition prediction on the specified text by adopting the fine-tuned pre-training language model.

In a second aspect, an embodiment of the present invention provides a named entity identification apparatus based on a small sample, including:

the label marking unit is used for acquiring sample data and marking an entity label on the sample data so as to construct a first sample set;

the character selection unit is used for selecting pivot characters in the first sample set and constructing a label mapping space based on the pivot characters;

a sample mapping unit for mapping the first sample set into a second sample set by using the label mapping space;

the model fine-tuning unit is used for fine-tuning a pre-training language model by utilizing the second sample set;

and the recognition prediction unit is used for carrying out named entity recognition prediction on the specified text by adopting the fine-tuned pre-training language model.

In a third aspect, an embodiment of the present invention provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor, when executing the computer program, implements the method for named entity identification based on small samples according to the first aspect.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when executed by a processor, the computer program implements the method for named entity identification based on a small sample according to the first aspect.

The embodiment of the invention provides a named entity identification method and device based on a small sample, computer equipment and a storage medium, wherein the method comprises the following steps: acquiring sample data, and labeling an entity label on the sample data to construct a first sample set; selecting pivot characters in the first sample set, and constructing a label mapping space based on the pivot characters; mapping the first set of samples to a second set of samples using the label mapping space; fine-tuning a pre-training language model by using the second sample set; and carrying out named entity recognition prediction on the specified text by adopting the fine-tuned pre-training language model. According to the embodiment of the invention, the most representative pivot character is selected to construct the label mapping space so as to map the sample data, and then the second sample set obtained by mapping is utilized to finely tune the pre-training language model, so that the finely tuned pre-training language model is utilized to carry out named entity recognition prediction, and the named entity recognition efficiency and precision can be improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

Fig. 1 is a schematic flowchart of a named entity identification method based on a small sample according to an embodiment of the present invention;

fig. 2 is a schematic network structure diagram of a named entity identification method based on a small sample according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a prediction flow of a named entity recognition method based on a small sample according to an embodiment of the present invention;

fig. 4 is a schematic block diagram of a named entity recognition apparatus based on a small sample according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

Referring to fig. 1, fig. 1 is a schematic flow chart of a named entity identification method based on a small sample according to an embodiment of the present invention, which specifically includes: steps S101 to S105.

S101, obtaining sample data, and marking an entity label on the sample data to construct a first sample set;

s102, selecting pivot characters in the first sample set, and constructing a label mapping space based on the pivot characters;

s103, mapping the first sample set into a second sample set by using the label mapping space;

s104, fine adjustment is carried out on a pre-training language model by utilizing the second sample set;

and S105, carrying out named entity recognition prediction on the specified text by adopting the fine-tuned pre-training language model.

In this embodiment, entity tagging is performed on a small amount of sample data to obtain a first sample set, then the most representative character is selected from the first sample set as a pivot character, and a tag mapping space is constructed according to the pivot character to map the sample data in the first sample set to obtain a corresponding second sample set. And then, the pre-training language model is subjected to fine adjustment by mapping the second sample set, so that the named entity recognition prediction is performed by using the fine-adjusted pre-training language model, and the efficiency and the precision of the named entity recognition can be improved.

In one embodiment, the step S101 includes:

dividing the sample data into a named entity text and a non-named entity text;

labeling the named entity text with an entity label;

marking the non-named entity text as O;

constructing and obtaining the first sample set S based on the labeling structure ₁ ＝<Text X, label Y>。

In this embodiment, when constructing the first sample set, entity label tagging is performed on a small amount of sample data, and the named entity text is labeled as a corresponding entity label, such as name (PER), gender (GEN), AGE (AGE), date of birth (DOB), and the like. For non-named entity text, labeled O in unison. After the labeling, a first sample set S in the form of a binary group can be obtained ₁ ＝<Text X, label Y>。

In an embodiment, the step S102 includes:

the tag mapping space M is constructed as follows:

wherein x and y represent text and corresponding entity labels in the first sample set, respectively,

indicating the representative degree index of the pivot character w to the entity label Li,

indicating that the entity label L is selected from all the characters V _i Pivot character w, tf (x = w, y = l) with highest representative degree index _i ) Means that all are marked with L _i Idf (x = w) represents a measure of the general importance of the pivot character w.

The purpose of this embodiment is to select the most representative character (called pivot character) from the dictionary V for each label of the first sample set, thereby constructing the label mapping space M. With a single arbitrary label l _i For example, the following steps are carried out:

wherein the content of the first and second substances,

defined as pivot character w to label L _i Is a representative degree index of (a).

Indicating that the label L is selected from all the characters V _i The pivot character w with the highest degree index is represented.

tf(x＝w,y＝l _i ) Is defined as all labeled L _i The frequency of occurrence of the pivot character w. The higher the frequency, the more the character can represent the label, and the specific formula is as follows:

wherein the formula N (-) is used to calculate the number of occurrences of characters within the first sample set that satisfy the condition. In the above formula, the pivot character w of the molecular representation is markedIs signed as _i And the denominator indicates all tagged as l _i The sum of the number of occurrences of the character.

idf (x = w) is defined as a measure of the general importance of the pivot character w. If the general importance is higher, the character is more common in each label sample, and the representing capability for a single label sample is weaker, the formula is as follows:

in the above formula, the numerator represents the number of tag types in the first sample set, and the denominator represents the number of tag types including the pivot word w.

Thus, a tag mapping space M is constructed that is capable of mapping an entity tag to a pivot character representing the tag.

In one embodiment, the step S103 includes:

selecting entity labels in a first sample set;

mapping texts corresponding to entity labels in the first sample set according to the following formula to obtain a second sample set S containing texts and target texts ₂ ＝<Text X, target text X'>：

X'＝{x ₁ ,…,M(y _i ),…,x _n }

Where X' represents the target text mapped into the second sample set, M (-) represents the tag mapping space, y _i Denotes an entity tag, x, in a first sample set ₁ And x _n Representing text in the first sample set.

In this embodiment, the first sample set S ₁ (X＝{x ₁ ,…,x _n },Y＝{y ₁ ,…,y _n }) label mapping. If the entity label is the entity label, mapping the entity label into a pivot character; if not, the original text is retained. Suppose y _i The entity label is an entity label, and the formula of the target text X' obtained by mapping the original text X is as follows:

X'＝{x ₁ ,…,M(y _i ),…,x _n }

where M (-) is the tag mapping space, pivot character M (y) _i ) Replace original x _i . On the basis, a second sample set S in the form of a binary group is constructed ₂ ＝<Text X, target text X'>。

In an embodiment, the pre-training language model is a BERT pre-training model. Of course, in other embodiments, other pre-trained language models may be employed, such as the Roberta chinese pre-trained model, the ERNIE pre-trained model, and so forth.

Further, the step S104 includes:

inputting the texts in the second sample set into a BERT pre-training model, and outputting corresponding feature codes by the BERT pre-training model;

based on the feature encoding, according to the probability P of calculating the input text to be predicted as the target text:

P(x _i ＝x' _i |X)＝softmax(W _LM ·h _i )

wherein x is _i Denotes the ith text data, x 'of the input' _i Representing ith target text data, X representing text of a second sample set, LM representing BERT pre-training model, W _LM Weight parameter, h, representing the last fully-connected layer of the BERT pre-training model LM _i A feature code representing the ith text data;

using the loss function according to

And performing optimization updating on the fine tuning training to obtain a fine-tuned BERT pre-training model LM':

the present embodiment uses the second sample set S ₂ (X＝{x ₁ ,…,x _n },X'＝{x' ₁ ,…,x' _n }) carrying out fine tuning training on the pre-training language model LM, specifically:

input text X = { X ₁ ,…,x _n Get the characteristic code H = { H after the pre-training language model LM processing ₁ ,…,h _n H, then encode the character X in the input text X according to the features _i Is predicted as x 'in target text' _i The probability of (c) is calculated:

P(x _i ＝x' _i |X)＝softmax(W _LM ·h _i )

thus fine-tuning the loss function of the training

Comprises the following steps:

the weight parameters of the pre-training language model LM can be adaptively updated in the fine tuning process, and finally the fine-tuned language model LM' is obtained through training.

As shown in fig. 2, in the training process, entity label labeling is performed on a small number of samples to generate a first sample set S ₁ (X, Y), followed by a first set of samples S ₁ Selecting representative pivot characters from (X, Y) to construct a label mapping space M, and using the label mapping space M to perform a first sample set S ₁ (X, Y) performing label mapping to obtain a corresponding second sample set S ₂ (X, X'). Then using a second set of samples S ₂ (X, X ') performing fine tuning training on the pre-training language model LM to obtain an optimized pre-training language model LM'.

In one embodiment, the step S105 includes:

and performing character prediction on the specified text by adopting the fine-tuned pre-training language model according to the following formula:

o _i ＝softmax(W _LM' ·e _i )

wherein o is _i Representing the probability of character generation, W _LM' Weight parameter representing the trimmed pre-trained language model, e _i A feature code representing the ith text data in the specified text,

an ith character representing a prediction;

and constructing characters generated by prediction into a prediction text, and mapping the characters in the prediction text into entity labels by using the label mapping space.

In this embodiment, referring to fig. 3, a specified text T = { T is specified by using a pre-training language model LM' after fine-tuning training ₁ ,…,t _n Predicting, specifically including:

will specify the text T = { T = { T } ₁ ,…,t _n Inputting the code to a pre-training language model LM', and outputting a corresponding feature code E = { E } by the pre-training language model LM ₁ ,…,e _n }；

After the full connection layer of the pre-training language model LM', calculating the character generation probability by using a softmax function, and operating and taking the character with the maximum possibility by argmax (a parameter solving function). Character with position i

The generation formula is as follows:

o _i ＝softmax(W _LM' ·e _i )

according to the obtained character

Constructing to obtain a predicted text

And mapping the label mapping space M into a label. In particular toIf the text is

If the pivot character in the label mapping space M is the pivot character, the corresponding entity label is output

Otherwise, outputting a non-entity label O; the final predicted label result is

Representing a reverse mapping, i.e. deriving the corresponding entity label from the pivot character,

specifically, the entity tags may be PER, AGE, and the like.

Fig. 4 is a schematic block diagram of a named entity recognition apparatus 400 based on a small sample according to an embodiment of the present invention, where the apparatus 400 includes:

a tag labeling unit 401, configured to acquire sample data and label an entity tag for the sample data, so as to construct a first sample set;

a character selection unit 402, configured to select pivot characters in the first sample set, and construct a label mapping space based on the pivot characters;

a sample mapping unit 403, configured to map the first sample set into a second sample set by using the label mapping space;

a model fine-tuning unit 404, configured to perform fine-tuning on a pre-training language model by using the second sample set;

and the recognition prediction unit 405 is configured to perform named entity recognition prediction on the specified text by using the fine-tuned pre-training language model.

In one embodiment, the label labeling unit 401 includes:

the data dividing unit is used for dividing the sample data into a named entity text and a non-named entity text;

the first text labeling unit is used for labeling the entity label to the named entity text;

the second text labeling unit is used for labeling the non-named entity text as O;

a first sample set constructing unit, configured to obtain the first sample set S based on labeled structure construction ₁ ＝<Text X, label Y>。

In one embodiment, the character selecting unit 402 includes:

a space construction unit, configured to construct the label mapping space M according to the following formula:

indicating that the entity label L is selected from all the characters V _i The pivot character w, tf (x = w, y = l) with the highest representative degree index _i ) Means that all are marked with L _i Idf (x = w) represents a measure of the general importance of the pivot character w.

In an embodiment, the sample mapping unit 403 includes:

the label selecting unit is used for selecting the entity labels in the first sample set;

a second sample set constructing unit, configured to map a text corresponding to the entity tag in the first sample set according to the following formula, so as to obtain a second sample set S including a text and a target text ₂ ＝<Text X, target text X'>：

X'＝{x ₁ ,…,M(y _i ),…,x _n }

In one embodiment, the pre-trained language model is a BERT pre-trained model.

In one embodiment, the model fine tuning unit 404 includes:

the text input unit is used for inputting the texts in the second sample set into a BERT pre-training model and outputting corresponding feature codes by the BERT pre-training model;

a probability calculation unit for calculating a probability P that the input text is predicted as the target text based on the feature encoding:

P(x _i ＝x' _i |X)＝softmax(W _LM ·h _i )

wherein x is _i Ith text data, x 'representing input' _i Representing ith target text data, X representing text of a second sample set, LM representing BERT pre-training model, W _LM Weight parameter, h, representing the last fully-connected layer of the BERT pre-training model LM _i A feature code representing the ith text data;

an optimization updating unit for utilizing the loss function according to

in an embodiment, the identifying a prediction unit 405 comprises:

the character prediction unit is used for performing character prediction on the specified text by adopting the fine-tuned pre-training language model according to the following formula:

o _i ＝softmax(W _LM' ·e _i )

wherein o is _i Indicates the probability of character generation, W _LM' Weight parameter representing the trimmed pre-trained language model, e _i A feature code representing the ith text data in the specified text,

an ith character representing a prediction;

and the character mapping unit is used for constructing the characters generated by prediction into a predicted text and mapping the characters in the predicted text into entity labels by utilizing the label mapping space.

Since the embodiment of the apparatus portion and the embodiment of the method portion correspond to each other, please refer to the description of the embodiment of the method portion for the embodiment of the apparatus portion, and details are not repeated here.

Embodiments of the present invention also provide a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed, the steps provided by the above embodiments can be implemented. The storage medium may include: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The embodiment of the present invention further provides a computer device, which may include a memory and a processor, where the memory stores a computer program, and the processor may implement the steps provided in the above embodiment when calling the computer program in the memory. Of course, the computer device may also include various network interfaces, power supplies, and the like.

The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description. It should be noted that, for those skilled in the art, it is possible to make several improvements and modifications to the present application without departing from the principle of the present application, and such improvements and modifications also fall within the scope of the claims of the present application.

It should also be noted that, in this specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Claims

1. A named entity recognition method based on small samples is characterized by comprising the following steps:

fine-tuning a pre-training language model by using the second sample set;

2. The method according to claim 1, wherein the obtaining sample data and labeling the sample data with an entity label to construct a first sample set comprises:

dividing the sample data into a named entity text and a non-named entity text;

labeling an entity label on the named entity text;

marking the non-named entity text as O;

3. The method according to claim 2, wherein selecting a pivot character from the first sample set and constructing a label mapping space based on the pivot character comprises:

constructing the label mapping space M according to the following formula:

4. The method according to claim 3, wherein the mapping the first set of samples to a second set of samples using the label mapping space comprises:

selecting entity labels in a first sample set;

mapping texts corresponding to the entity labels in the first sample set according to the following formula to obtain a second sample set S containing texts and target texts ₂ ＝<Text X, target text X'>：

X'＝{x ₁ ,…,M(y _i ),…,x _n }

5. The small-sample-based named entity recognition method of claim 4, wherein the pre-trained language model is a BERT pre-trained model.

6. The method according to claim 5, wherein the fine-tuning of the pre-trained language model using the second set of samples comprises:

based on the feature encoding, according to the probability P of the input text being predicted as the target text:

P(x _i ＝x′ _i |X)＝softmax(W _LM ·h _i )

wherein x is _i Ith text data, x 'representing input' _i Representing the ith target text data, X representing the text of the second sample set, LM representing the BERT pre-training model, W _LM Representing weights of last full link layer of BERT pre-training model LMParameter, h _i A feature code representing the ith text data;

using the loss function according to

And (3) performing optimization updating on the fine tuning training to obtain a fine tuned BERT pre-training model LM':

7. the small-sample-based named entity recognition method of claim 6, wherein the conducting named entity recognition prediction on a specified text using a trimmed pre-trained language model comprises:

o _i ＝softmax(W _LM' ·e _i )

an ith character representing a prediction;

8. A named entity recognition apparatus based on a small sample, comprising:

a sample mapping unit, configured to map the first sample set into a second sample set by using the label mapping space;

9. A computer device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the method for small sample based named entity recognition according to any one of claims 1 to 7 when the computer program is executed.

10. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out a method for small-sample based named entity recognition as claimed in any one of the claims 1 to 7.