CN117076594A

CN117076594A - Text structuring method for small sample data in military field

Info

Publication number: CN117076594A
Application number: CN202211735348.6A
Authority: CN
Inventors: 陈酉明; 贾学良; 张文峰; 纪有书; 陈小康
Original assignee: Nanjing Xingyao Intelligent Technology Co ltd
Current assignee: Nanjing Xingyao Intelligent Technology Co ltd
Priority date: 2022-12-30
Filing date: 2022-12-30
Publication date: 2023-11-17

Abstract

The invention provides a text structuring method for small sample data in the military field, which is characterized by comprising the following steps: the method comprises the following steps: constructing a structured extraction template; constructing a prompt for controlling the generation structure; constructing a generating model; constructing a pre-training model; coding, fine tuning and training in the case of small samples: and (3) encoding small sample data in the military field into the structural input model, and performing fine tuning training on the model by adopting cross entropy loss. The text structuring method for the small sample data in the military field models different text information extraction tasks through the universal structure, and the structure can cooperatively learn universal information extraction capability from different knowledge sources so as to achieve the purpose of mutual cooperation of different knowledge structures.

Description

Text structuring method for small sample data in military field

Technical Field

The invention belongs to the technical field of document processing, and particularly provides a text structuring method for small sample data in the military field.

Background

Text structuring is an important application direction in the field of military intelligence. For example, in some intelligent auditing scenarios, the file needs to be structured to enable subsequent auditing and warehousing; in the information discovery, the information text information needs to be structured and arranged to finish subsequent arrangement and judgment. Text structuring typically involves three basic classes of tasks: entity extraction, relationship extraction and event extraction.

The text structuring application scene in the military field faces greater challenges compared with the text structuring application in the conventional field due to uneven data distribution, difficult sample acquisition, complicated text patterns, irregular text patterns and the like. The current text structuring work is mostly based on depth model design. Research shows that the deeper the neural network structure is, the stronger the learning ability is, and the easier the fitting phenomenon is; thus, the model needs to learn training over a large amount of supervised data to improve the generalization ability of the model to achieve good performance. However, in the military application scene, the acquisition cost of the supervised sample is too high, the sample labeling is difficult, and how to train to obtain a good data model under the condition of a small sample is a problem to be solved.

Existing small sample learning strategies can be broadly divided into three categories: a method based on pre-training and prompt learning, a method based on data enhancement, a method based on integrated learning and self-training.

1. Methods based on pre-training and prompt learning.

(1) A large amount of unlabeled data is adopted to train a large model irrelevant to a task in a self-supervision mode, and a downstream task can acquire good context expression by calling the large model.

(2) And adding a small model related to the task intensity behind an output layer of the large model, and training the small model related to the task intensity by adopting a small amount of labeling data related to the task in a mode of fixing parameters of the large model or training the small model related to the task intensity by fine-tuning parameters of the large model. Because the model to be trained is small, the required annotation data is relatively small, and the model is suitable for a small sample scene.

And adding a small model which is strongly related to the task behind the pre-trained large model, and training by using the labeling data. This process splits the pre-training and downstream tasks, resulting in poor model migration. Prompt Learning (Prompt Learning) aims to eliminate the difference between downstream tasks and pre-training tasks, completing the downstream tasks without the need to add additional small models. Because no training of the small model is required, the required tag data is further reduced, and the prediction is accomplished even without the need for tag data. The text structuring method for small sample data in military field is also constructed based on a pre-training model and prompt learning, and specific implementation steps of the prompt learning method similar to the method are detailed below.

For example, the BERT-based emotion analysis method is implemented as follows:

(1) Setting a BERT training task: when the BERT model is trained, some words in the text are randomly shielded according to a certain proportion, and the training aim is to enable the model to predict the shielded words according to the context information.

(2) By constructing an input mode similar to that when training a pre-training model, the downstream task can perform training of the downstream task without adding an output layer related to the task. Namely constructing prompt words with the same shielding word structure in the training task, and identifying shielded parts by using the model.

(3) Result mapping: and finally, inputting the constructed input text into a pre-training model, and mapping the emotion of the constructed input text onto the classification label by using the predicted mask words.

2. Data enhancement based method

The data enhancement is to generate a plurality of new samples which are the same as or similar to the original sample semantics in a certain way under the condition of not changing the original sample semantics, wherein the labels of the new samples are the same as the original samples. The usual sample enhancement methods are: synonym replacement, data backtranslation, random insertion, deletion, etc. A data distribution containing more information can be obtained by data enhancement. In addition to enhancement of the sample data plane, enhancement information may also be constructed inside the model. If the countermeasure training is constructed in the model training process, a maximum disturbance is found, and the model can learn the distribution for keeping the output corresponding to the disturbance consistent with the output corresponding to the condition without the disturbance.

3. Integrated learning and self-training based method

The ensemble learning adopts the idea of 'minority compliance majority', and the reasoning results given by the majority model together are considered to be more trustworthy than the reasoning results given by the single model. The self-training uses a small amount of marked data and a large amount of unmarked data to carry out joint training on the model, and firstly, the model is trained by using a small amount of marked data; then predicting unlabeled data by using the trained model, and selecting output with higher confidence as a pseudo tag for correspondingly inputting unlabeled data; and finally, adding the data with the pseudo tag into a training set to continue training the network, and repeating the steps.

By adopting the method of integrated learning and self-training, a better model can be trained by using a small amount of labeling data, and the application in a small sample scene can be realized.

In the prior art, most text structuring methods (e.g., the three types of small sample learning strategies above) are task-specific, resulting in the need to design independent models and specialized knowledge sources for different structuring tasks. And the method is in practical application in the field of military information, and is unfavorable for rapid construction and development of a system, effective knowledge sharing and rapid cross-field adaptation.

Disclosure of Invention

Technical problems: in order to solve the problems, the invention provides a text structuring method for small sample data in the military field, which can uniformly model different text structuring tasks, adaptively predict different extraction structures and is suitable for fast migration to a new field, and effectively solve the learning fitting problem in a small sample scene.

The technical scheme is as follows: the invention provides a text structuring method for small sample data in the military field, which comprises the following steps:

step 1, constructing a structured extraction template: constructing entity identification, relation extraction and event extraction into a unified data structure template as a generated target;

step 2, constructing a prompt for controlling the generation structure: constructing a structural mode director for controlling the discovery and the associated information types and information contents based on a prompting mechanism of the mode;

step 3, constructing a generating model: building a generative model using the structure of the encoder-decoder;

step 4, constructing a pre-training model: training the structuring capability of the model by using the structuring data set and training the semantic representation capability of the mask language model by using the unstructured data set;

step 5, coding, fine tuning and training under the condition of small samples: and (3) encoding military field small sample data into the structural input model described in the steps 1 and 2, and performing fine tuning training on the model by adopting cross entropy loss.

In the step 1, a JSON data structure style is adopted for constructing a data structure template, and each item is divided according to a hierarchical structure; the data structure templates are expressed as:

wherein,

the trigger represents that the element under the structure is an event type, and the span is a fragment in the original text; the roles hierarchy is all elements related to the event, and the associates are roles played by the elements related to the event in the event;

the subject represents the element under the structure as the relationship type, span as the subject of the relationship; the roles hierarchy is an element related to the main body, and the associates represent relationship types;

the entity type indicates that the element under the structure is an entity type.

Step 2, a method for constructing a prompt word for controlling a generated structure is specifically as follows: constructing a prompt as s entity type s … aiming at entity identification information, wherein the entity type represents entity type information required to be extracted from a text; aiming at the relation identification information, constructing a prompt as sub span as relation type; for event identification information, a hint is constructed as [ trigger ] eventtype [ as ] event argument type.

In the step 3, the method for constructing the generated model is as follows: inputting the prompt and text splice into an encoder for joint encoding; the entered prompts and text are computed as hidden layer representations of each token position; using an autoregressive mode, and carrying out joint coding by using prediction results one by one in the decoding process; and finishing prediction when the end symbol is output, and converting the predicted structural template expression into the extracted information record.

In step 4, the method for constructing the pre-training model includes:

1. constructing a data set: collecting a large-scale dataset, including a structured dataset; (e.g., a specialized military knowledge base) and unstructured data sets (e.g., raw text of military intelligence data);

2. pre-training: training a structuring capability of the model using the structured dataset; mask language model semantic representation capabilities are trained using unstructured data sets.

The beneficial effects are that: the text structuring method for the small sample data in the military field models different text information extraction tasks through the universal structure, and the structure can cooperatively learn universal information extraction capability from different knowledge sources so as to achieve the purpose of mutual cooperation of different knowledge structures.

The method comprises the steps of uniformly encoding different extraction structures through a structured extraction language, adaptively generating a target extraction template through a mode-based prompt (prompt) mechanism, learning a text-to-structure generation model through large-scale pre-training, and completing information extraction capability; the mechanism can well learn the complementary information of the sample, and the model can be effectively transferred to a new field due to the fact that the pre-training model is attached to a downstream task, and good performance can be achieved under the condition of a small sample.

In particular, the present invention has the following outstanding advantages over the prior art:

1. the generation template designed by the invention can unify different information structured data and is suitable for the combined extraction of the models; and the generated output structure is very compact, and the decoding complexity is greatly reduced.

2. The method can control the generated result in the generation process based on the prompt prefix so as to adapt to different text structuring tasks, and the design structure has great advantages in the actual military application scene, can accurately control and incrementally train in the small sample scene, and is beneficial to the rapid expansion of a knowledge structure base.

3. The invention designs the pre-training model combining two tasks aiming at the data characteristics of the military information field, can learn the data distribution and semantic representation in the military information field, provides a solid foundation for information knowledge sharing and rapid adaptation to a new information structuring environment, and remarkably improves the performance of information extraction in a supervised, low-resource and small-sample environment.

4. Compared with the existing information extraction method in the military field, the pre-training method used by the method enables the pre-training model task to be closer to the downstream task. The existing pre-training method needs to separate pre-training and downstream tasks, and independent loss functions are required to be built for each downstream task, so that the migration effect is poor, good data distribution effect can not be learned under the condition of small samples in the new field, and model performance depends on the pre-training data amount and training condition. The task construction pre-training method of the present invention includes downstream tasks, so that the task construction pre-training method performs better in the case of small samples.

5. Compared with the existing text structuring method, the unified structure designed by the patent avoids the need of designing a separate model for each text structuring task. For example, a design sequence labeling method is identified for an entity, and a design relationship classification method is extracted for a relationship. The system uses the generating type structure, realizes all the structured tasks end to end, controls different task generation by using the prompt words, and is more flexible and convenient in realization mechanism.

Drawings

FIG. 1 is a flow chart of a text structuring method for small sample data in the military field using the present invention; the method is divided into two parts, namely training and prediction: the training part builds a structural generation template based on the supervised part needing to use the label; the prediction part uses prompts to construct different prompt words aiming at different tasks, and takes the prompt words and texts as model input to obtain an output structured generation template.

Detailed Description

The present invention will be further described below.

The text structuring method of the small sample data facing the military field is described below by taking certain military information as an example, and comprises the following steps:

in this way, different information extraction tasks can be broken down into a series of text-to-structure conversions, with all information extraction models sharing the same underlying discovery and association capabilities.

The template construction method adopts a JSON data structure style, and each item is divided according to a hierarchical structure. The data organization structure is as follows:

the template is used to represent whether the extracted data structure is an event, a relationship or an entity, respectively. span represents a fragment in an original document, elements under a trigger mark structure are event types, and a roles layer is all elements related to the event. Association is the role an element associated with the event plays in the event. The Subject indicates that a relationship is extracted, the following span is the Subject of the relationship, and the elements associated with the Subject are also at the roles level, where the associty indicates the relationship type. Except for the trigger and the subject, an entity type (entity type) indicates that an entity is extracted.

In addition, the "{ }" sum "in the structured extraction template is used for forming a hierarchical structure among the structures, so that the generated template is convenient to decode.

Step 2, constructing a prompt word for controlling the generation structure: constructing a structural mode director for controlling the discovery and the associated information types and information contents based on a prompting mechanism of the mode;

the framework of this patent uses a hint (Prompt) based architecture, so different structures need to be generated for different needs. For example, in the sentence "somebody is in 2021, acts as a first president of a country", the entity recognition system will generate "[ { people: some } { country: some country } { time: 2021} ", the event extraction system will generate" [ { trigger: roles: [ { person: somebody }, just job position: some country president } ] ". To this end, this patent devised a structural pattern director, which is a pattern-based hint (prompt) mechanism for controlling which types of information need to be discovered and associated.

The construction method of the prompt is as follows:

for the entity identification information, the hint is "[ s ] entity type [ s ] …", where "entity type" represents entity type information that needs to be extracted from text. For example, "some of 2021 plays a role as a first president of some country," the prompt that an entity wants to extract a person and time is constructed is: "[ s ] character [ s ] time ]"

Aiming at relation extraction, a prompt is constructed as a ' sub span [ as ] relation type ', so that a prompt ' sub ' a certain [ as ] position ' is given to a text ' a certain position in 2021, and the text is used as a first president ' of a certain country so as to control and generate a structural template, and the relation (a certain position, a position and a president) in the structural template is extracted.

Aiming at event extraction, constructing prompts as follows: "[ trigger ] eventtype [ as ] event argument type". For example, the above example would generate a hint: "[ trigger ] job [ as ] person" is used to extract the person of the "job" event.

the encoder-decoder architecture is used to build the generative model. The encoder and decoder of the model use a transducer structure. The prompt and text splice input encoders are first jointly encoded. The entered prompts and text will be represented by hidden layers that calculate each token position. Then, the prediction results are used one by one for joint coding in the decoding process by using an autoregressive mode. And finally, finishing prediction when outputting the end symbol, and converting the predicted structural template expression into an extracted information record.

constructing a pre-training model which can capture the general capabilities of different text organization tasks; when the task is migrated, the downstream task can be adapted only by fine adjustment under the condition of a small sample. The pre-training model construction consists of two parts:

1. data sets are constructed, large-scale data sets are collected, including structured (e.g., specialized military knowledge base) and unstructured (e.g., raw text of military intelligence data), and then pre-trained uniformly over these heterogeneous data sets.

2. Pre-training: the present patent uses two sequence generation tasks to pretrain the model. (1) The structuring capability of the model is trained using the structured dataset. (2) The mask language model is used on the original information data of the normal unsupervised military field to continuously learn the semantic representation capability, so that the adaptability of the model in the military information field is enhanced. Wherein task (1) uses a supervised dataset, the input of the model is 'hint + text', and the output is a structured hint template. The input of task (2) is mask text with a part of the words masked off, and the output is complete text in which the masking word is predicted.

Step 5, coding, fine tuning and training under the condition of small samples: encoding military field small sample data into the structure input model described in the steps 1 and 2, and performing fine tuning training on the model by adopting cross entropy loss;

for example: in the military information event extraction task, the newly added event types are in a fitting state due to small sample size, the traditional depth model is adopted for supervised training, and an available model with good generalization cannot be obtained.

Specifically, the pre-training model provided by the invention trains a limited number of event categories (including support, language, navigation and the like) on an original knowledge base, the categories cannot cover the categories of actual application scenes, and in actual application, a user wants to acquire a text information structured model capable of rapidly identifying a "reconnaissance" category event, the following operations can be executed:

(1) And collecting and labeling a small amount of data related to reconnaissance to obtain labeled original data.

(2) For each piece of data, a hint template is constructed, for example, there is a piece of news data: "take off from some base and go to some place to perform reconnaissance action" by some model of a reconnaissance aircraft with a navy 1 frame number of some country in the morning today. "comprising event elements: main body force, initial area, target area and time. The alert template can be constructed according to the method of the patent as follows: "[ trigger ] reconnaissance [ as ] main body force [ as ] starting area [ as ] destination area [ as ] time. And splicing the prompt template and the text to obtain the input of the model.

(3) And constructing a label with supervised training data, and constructing and generating a data structure template, namely the label of the data, by using the method provided by the patent aiming at each piece of data. The generated data structure templates of the example described in (2) above are:

[

{ trigger } reconnaissance,

roll [ { time: today's morning }, { subject force: a model of scout }, { start area: a base }, { destination area: a ground },

]

(4) Performing fine tuning training, taking the acquired prompt and data structure templates as input and output of a model, training the model to obtain an identification model with good generalization performance, wherein the model can identify relevant elements aiming at a reconnaissance event, for example, for new news information such as a new day of a new year and a new month before a new year, an M state naval model reconnaissance aircraft performs a near reconnaissance task at a southeast 104 km (56.1 sea), collides with a model fighter of a C state naval aviation soldier performing tracking and monitoring to cause a C state airplane crash, a pilot sacrifice and a model reconnaissance aircraft forced landing airport, and the model can extract event categories contained in the model as reconnaissance events, wherein the event elements are as follows: (time: day of month of year), (subject force: navy of China), (subject force: reconnaissance of model number).

The above examples illustrate only a few embodiments of the invention, which are described in detail and are not to be construed as limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of protection of the present invention is to be determined by the appended claims.

Claims

1. A text structuring method for small sample data in military field is characterized in that: the method comprises the following steps:

2. The text structuring method of small sample data for military field as in claim 1, wherein: in the step 1, a JSON data structure style is adopted for constructing a data structure template, and each item is divided according to a hierarchical structure; the data structure templates are expressed as:

wherein,

the Subject indicates that the elements under the structure are of a relationship type, and the span is the Subject of the relationship; the roles hierarchy is an element related to the main body, and the associates represent relationship types;

3. The text structuring method of small sample data for military field as in claim 1, wherein: step 2, constructing a method for controlling the generation of the prompting words of the structure, which specifically comprises the following steps: constructing a prompt as s entity type s … aiming at entity identification information, wherein the entity type represents entity type information required to be extracted from a text; aiming at the relation identification information, constructing a prompt as sub span as relation type; for event identification information, the constructed hint is [ trigger ] eventtype [ as ] event argument type.

4. The text structuring method of small sample data for military field as in claim 1, wherein: in the step 3, the method for constructing the generated model is as follows: inputting the prompt and text splice into an encoder for joint encoding; the entered prompts and text are computed as hidden layer representations of each token position; using an autoregressive mode, and carrying out joint coding by using prediction results one by one in the decoding process; and finishing prediction when the end symbol is output, and converting the predicted structural template expression into the extracted information record.

5. The text structuring method of small sample data for military field as in claim 1, wherein: in step 4, the method for constructing the pre-training model comprises the following steps:

1. constructing a data set: collecting a large-scale dataset, including a structured dataset; (e.g., a specialized military knowledge base) and unstructured data sets (e.g., raw text of military intelligence data); 2. pre-training: training a structuring capability of the model using the structured dataset; mask language model semantic representation capabilities are trained using unstructured data sets.