CN113609244B

CN113609244B - Structured record extraction method and device based on controllable generation

Info

Publication number: CN113609244B
Application number: CN202110637453.5A
Authority: CN
Inventors: 陆垚杰; 林鸿宇; 韩先培; 孙乐; 唐家龙
Original assignee: Institute of Software of CAS
Current assignee: Institute of Software of CAS
Priority date: 2021-06-08
Filing date: 2021-06-08
Publication date: 2023-09-05
Anticipated expiration: 2041-06-08
Also published as: CN113609244A

Abstract

The invention provides a structured record extraction method and device based on controllable generation. The method can automatically extract the structured text record from the unstructured text, and the extracting step comprises the following steps: for the target text, the sequence-to-structure network first captures text semantics of the target text with a self-attention mechanism-based encoder, and then generates a structured representation with a mixed-attention mechanism-based decoder; the controllable decoding algorithm based on the prefix tree is used for restraining a decoding space in the generation process, injecting frame knowledge, guiding model decoding and generating a linear expression; and finally, carrying out structural transformation on the linear expression to generate a structural record. In the model training stage, a two-stage model learning method is adopted to help the model to learn efficiently: the first stage adopts a substructure to perform model learning, and pays attention to training of text block extraction capacity; and in the second stage, a complete record structure is adopted for model learning, and training of the extraction capacity of the structure is focused.

Description

Structured record extraction method and device based on controllable generation

Technical Field

The invention relates to a structured record extraction method, in particular to a record extraction method and device based on controllable generation, and belongs to the technical field of natural language processing.

Background

Record extraction is intended to automatically extract structured record information from unstructured text, such record information including, but not limited to, event structures, binary relationship structures between entities, multiple relationship structures between entities, and the like. Taking event structured record extraction as an example, given The sentence "The man returned to Los Angeles from Mexico", "a record extraction system should be able to identify a" transport "event whose trigger word is" return ", and whose meta-structures are" The man "(subject)," Los Angeles "(origin) and" Mexico "(destination). Taking the example of a binary relationship structure between entities, given the sentence "Obama was born in honoulu," a record extraction system should be able to identify the relationship of "Obama" to "honoulu" as "born in". Structured record extraction is a key task in knowledge graph construction and natural language understanding.

The difficulty with structured record extraction is the complexity of the structural framework and the diversity of text expressions. First, the structural framework has a multi-element structure, a single record structure is made up of record categories, participants of different roles, and different records have different semantic structures. Second, text expressions are diverse, and a single recording structure can employ a variety of different text expressions.

The traditional record extraction model mainly adopts the decoupling idea to cope with the complexity of a record structure and the diversity of text expression. Taking event record extraction as an example, the conventional method generally decomposes the extraction of a complete event structure into a plurality of subtasks (event trigger word detection, entity extraction, argument structure extraction), combines the results of the subtasks through different combination strategies (pipeline model, joint modeling and joint reasoning methods), and extracts the complete event structure. For example, the conventional method first determines that "return" in a sentence triggers a transportation event, then extracts "The man" as a Person entity, and finally determines whether "The man" is The subject corresponding to The "return" event. The decoupling method mainly faces the problems that fine-grained data marking is difficult and an optimal combination strategy is difficult to design manually. First, the conventional method requires labeling training data of different strengths for different subtasks. For example, for the detection of a Transport trigger word, the identification of a Person entity and the extraction of a Transport argument, the traditional method needs different fine-grained labeling training corpuses for model training, and the method has lower use efficiency of the labeled data and increases the difficulty and cost of data acquisition. Furthermore, it is a very challenging task to design the optimal combination of different subtasks by hand. For example, pipeline models often result in error propagation, and joint models require heuristically predefined information sharing and decision dependencies between trigger detection, parameter classification, and entity identification, often resulting in an event extractor that is structurally non-optimal and inflexible in design.

Disclosure of Invention

Aiming at the problems of high difficulty in labeling fine-grained data and difficulty in designing an optimal combination strategy, the invention provides a record extraction method and device based on controllable generation.

The technical scheme adopted by the invention is as follows:

a structured record extraction method based on controllable generation, comprising the steps of:

converting the plain text sequence into a structured linear expression using a sequence-to-structure based record generation model;

in the process of converting the plain text sequence into the structured linear expression, a controllable decoding algorithm based on the frame prefix tree restriction is used for restricting the generation process of the linear expression;

and carrying out structural transformation on the generated linear expression to generate a structured record.

Further, the sequence-to-structure based record generation model first captures text semantics of the target text using a self-attention mechanism based encoder and then generates a structured linear expression using a mixed-attention mechanism based decoder.

Further, the sequence-to-structure-based record generation model adopts a two-stage model learning method to perform efficient learning:

the first stage learning process adopts a linear expression of a substructure to perform model learning, and pays attention to training of text block extraction capacity;

the second stage learning process adopts a linear expression of a full structure to perform model learning, and pays attention to training of the extraction capacity of the structure.

Further, the linear expression expresses a multi-layered recording structure, wherein: each "()" sub-bracket consists of a text category and a text block character string, and is the minimum unit of extraction; each "[ ]" sub-bracket represents a single record; each "{ }" sub-bracket represents a record in one sentence, and can contain a plurality of records, or no record.

Further, the controllable decoding algorithm automatically prunes the word list based on the decoding state to generate a dynamic word list, so that controllable generation is realized; the decoding process starts from the root node < bos > of the prefix tree and ends at the leaf node < eos >, at each generation step, the dynamic vocabulary of this step is the child node of the currently generated node.

Further, the controllable decoding algorithm comprises the following steps:

pruning is automatically carried out on the word list based on the decoding state to generate a dynamic word list: traversing the frame prefix tree to generate a linear expression, wherein the dynamic word list of each decoding state is a child node of a node corresponding to the state in the tree, and the complete word list is all legal natural language words and identifiers "() { } [ ]";

controllably generating a linear expression using the generated dynamic vocabulary: selecting conditional probabilities P (y) from dynamic word lists _i |y _＜i X) the highest candidate word is added at the end of the generated linear expression, where P (y _i |y _＜i X) represents a result y based on the text sequence x to be extracted and the generated linear expression _＜i Continuing to generate y _i Conditional probability of y _i A symbol indicating the i-th position of the linear expression.

A structured record extraction device based on controlled generation employing the above method, comprising:

a linear expression generation module for converting the plain text sequence into a structured linear expression using a sequence-to-structure based record generation model; in the process of converting the plain text sequence into the structured linear expression, a controllable decoding algorithm based on the frame prefix tree restriction is used for restricting the generation process of the linear expression;

and the structured record generation module is used for carrying out structural transformation on the generated linear expression to generate a structured record.

The beneficial effects of the invention are as follows:

1) The linear expression is used for expressing a multi-level record structure, so that the model can generate a complete event structure in an autoregressive mode.

2) The record extraction is directly carried out on the plain text sequence, and the complete structured record is generated, so that the dependence of the model on fine granularity training data is avoided.

3) The generation of the record structure is controlled in a dynamic word list mode, so that the integrity of the generated expression is ensured, and the frame knowledge is effectively injected.

4) Through a two-stage training method, a sub-structure is adopted to pretrain the model, and then a full-structure is used to fine tune the model, so that the model is helped to migrate knowledge from the pretrained language model.

Drawings

FIG. 1 is a diagram of a controllably generated record extraction framework.

Fig. 2 is an event linear structural expression.

Fig. 3 is an event frame prefix tree.

Fig. 4 is an event category prefix tree.

Detailed Description

In order to make the above features and advantages of the present invention more comprehensible, embodiments accompanied with figures are described in detail below.

The event extraction method based on controllable generation provided by the invention, as shown in fig. 1, mainly comprises the following steps: 1) Generating a model based on the sequence-to-structure records; 2) A controllable decoding algorithm based on frame prefix tree restriction; 3) A two-stage model learning algorithm based on continuous learning.

The recorded form mainly has relation and event, taking event extraction as an example, the technical scheme adopted by the invention is as follows:

first, the present invention uses linear expressions to express different record structures, the linear expressions contain different levels, and can be used for training and decoding of record decimators in different stages, as shown in fig. 2. For example, "[ Transport arrived (Artifact bus) ]" represents a single bus-related Transport (Transport) event; "{ [ Transport arrived (Artifact Bush) ] } represents that a single text contains a transport event associated with Bush; "{ [ Transport arrived (Artifact Bush) ] [ event-Jail Arbitten (Person James) ] } represents that the text contains a plurality of events, one event is a transport event related to" Bush ", and the other event is an Arrest (event-Jail) event related to" policy ". Wherein each "()" sub-bracket consists of a text category and text block string, which is the smallest unit of extraction. Each "[ ]" sub-bracket represents a single event record. Each "{ }" sub-bracket represents an event record in a sentence, and may or may not contain a plurality of event records. When no event record is contained, the linear expression is "{ }.

The invention provides a record generation model based on sequence-to-structure, wherein the input of the model is a plain text sequence, and the output of the model is a linearized event structure expression. The method comprises the following steps:

for a text sequence x to be extracted ₁ ,...,x _|x| The Encoder model Encoder based on the self-attention mechanism captures semantic information in the target text, and acquires context information in the target text by utilizing the multi-head attention mechanism to obtain semantic feature representation:

H＝Encoder(x ₁ ,...,x _|x| )；

wherein x is _i (i=1, …, |x|) represents the i-th word in the text sequence to be extracted, and |x| represents the length of the text sequence.

Given the semantic feature representation H of the input sequence, the mixed-attention-mechanism-based Decoder generates an event record structure by means of autoregressive. The decoder model takes "< bos >" as an initial state until the generation ends to "< eos >". The conditional probability of the whole generated sequence is obtained by multiplying the conditional probability at each moment:

where x represents a text sequence to be extracted, e.g. "The man returned to Los Angeles from mexico", "y represents a linear expression representing the extraction result (as shown in fig. 2), i represents the i-th position of the generated linear expression, |y| represents the length of the linear expression, y _i Symbol indicating the ith position of the linear expression, y _＜i Representing the result of the linear expression that has been generated, P (y _i |y _＜i X) represents a result y based on the text sequence x to be extracted and the generated linear expression _＜i Continuing to generate y _i Conditional probability of (2).

In the generation step i, the decoder performs Softmax normalization operation on the vocabulary V to calculate the generation probability P (y _i |y _＜i X). The decoder represents H based on semantic features of the target text, decoder state at the preamble timeWith the output y of the last moment _i-1 And (3) performing output of predicting the current moment:

wherein the Decoder comprises a Decoder stateThe self-attention mechanism of (2) and the cross-attention mechanism of the target text semantic feature representation H, namely the Decoder, employ a mixed-attention mechanism. The self-attention mechanism refers to attention mechanism inside the Decoder, namely attention to the generated expression state in the encoding process of the Decoder, and the cross-attention mechanism refers to attention of the Decoder to the target text semantic feature representation H generated by the Encoder.

In the model decoding process, in order to effectively inject frame knowledge (a frame refers to a recorded specific structure, the frame knowledge refers to a role type of a specific participant in the recorded structure, for example, an arrest event should contain a plurality of different role types such as time, place, arrestee and the like) and ensure the integrity of a generated structure, the invention proposes a controllable decoding algorithm based on the limitation of a frame prefix tree to perform controllable generation. The decoding algorithm automatically prunes the vocabulary based on the decoding state to generate a dynamic vocabulary, thereby realizing controllable generation. The complete decoding process may be regarded as a process of searching the frame prefix tree. Specifically, the generation process of the record structure can be divided into three different decoding states:

1. recording frame: names of the event category (T) and the role category (R) are generated, for example, "Transport" and "notify" in the above-described examples, and the like.

2. Record mentions: the character strings (S) of the event trigger words and event arguments are generated, for example, "generated" and "push" in the above-described examples.

3. Structure identifier: the structure identifiers "{", "}", "[", "]", "(" and ")", in the event structure linear expression are generated for combining the record frame and the record references.

The controllable decoding algorithm based on the frame prefix tree limitation specifically comprises the following steps:

1) Pruning is automatically carried out on the word list based on the decoding state to generate a dynamic word list, and the specific method is as follows:

traversing the frame prefix tree to generate a linear expression, wherein the dynamic word list of each decoding state is a child node of the node corresponding to the state in the tree. The complete vocabulary is all legal natural language vocabulary and identifiers "() { } [ ]). For example, as shown in fig. 3, for the generated result "< bos > {", the child nodes of the corresponding nodes are "[" and "}", i.e., the dynamic vocabulary is only "[ }"). For example, as shown in fig. 4, when the generated result is "< bos > { [" the dynamic vocabulary corresponding to the state is the set of event categories T.

2) The linear expression is controllably generated by using the generated dynamic word list, and the specific method is as follows:

from dynamic vocabularySelecting conditional probability P (y _i |y _＜i X) the highest candidate word is added at the end of the generated linear expression. For example, FIG. 3, "for a generated string"<bos>{ ", its legal candidate words are" [ "and" } ", the probabilities are P respectively _a And P _b . When P _a ≥P _b When generating'<bos>{ [ "; otherwise, generate'<bos>{}”。

The decoding process starts from the root node < bos > of the prefix tree and ends at leaf node < eos >. In each generating step i, the dynamic vocabulary of the i-th step is the child node of the current generating node. As shown in fig. 3, the candidate vocabulary of the < bos > node is { "," } ". Event category generation, event role generation, and event mention generation then search the corresponding subtrees, as shown in fig. 4.

After the event record linear expression is generated, the linear record is parsed into a tree, and finally converted into the event record.

In the model training phase, the invention uses a pre-training language model generated from sequence to sequence oriented text as an initialization model. Different from the common text-to-text generation task, the sequence-to-structure framework adopted by the invention faces the problems of output gap between text expression and linear expression, high occupation ratio of semantic-free structure identifiers and the like, increases the difficulty of model learning, and is difficult to effectively use knowledge in a pre-training model. In order to solve the problems, the invention designs a two-stage model learning algorithm based on course learning to train a model, and the model is helped to migrate knowledge from a pre-training language model to realize efficient training. The training algorithm comprises the following steps:

the first stage employs a substructure extraction training strategy, with the objective of migrating the "sequence-to-sequence" model (i.e., the initialization model) to a "sequence-to-substructure" model. The stage mainly enables the model to learn the text block extraction capability and not learn the structure extraction capability. This stage uses the linear expression of the substructure event for model training. The sub-structure does not contain a hierarchical event structure, centered on the text block markers. For example, the sub-structure linear expression "{ (Transport arrived) (Artifact Bush) (Destination Saint Petersburg) }" does not contain a hierarchical structure, only different text block sub-brackets.

The second stage employs a full structure extraction training strategy with the objective of migrating the "sequence to substructure" model to a "sequence to structure" model. This stage uses full-structure linear expressions for model training, mainly to learn the ability of the structure extraction.

In summary, the key steps of the structured record extraction method based on controllable generation of the invention comprise:

1) A record structure linearization representation, wherein a record structure is represented by using a linearization expression, and the linearization expression can be mutually converted with a complete record structure;

2) Based on the representation, converting the plain text sequence into a structured linear expression using a sequence-to-structure based record generation model;

3) Based on the representation and the generation model, a controllable decoding algorithm based on the limitation of a frame prefix tree is used for restraining the generation process in the generation structure, a dynamic vocabulary is generated by automatically pruning the vocabulary, the integrity of the generated record structure is ensured, and frame knowledge is injected.

4) Aiming at the training process of the generated model, a two-stage model learning algorithm based on course learning is adopted, and the learning of the extraction model is carried out in two stages of substructure learning and full structure learning.

Wherein, step 1) uses linear expression to express multi-level record structure, so that the model can generate complete event structure by autoregressive mode.

And 2) directly performing record extraction on the plain text sequence to generate a complete structured record, thereby avoiding the dependence of the model on fine-grained training data.

Wherein, step 3) controls the generation of the record structure by means of dynamic word list, thereby ensuring the integrity of the generated expression and effectively injecting the frame knowledge.

The step 4) is to pretrain the model by adopting a substructure through a two-stage training method, and then to use a full structure to finely tune the model so as to help the model to migrate knowledge from the pretrained language model.

An example of this method is as follows:

this embodiment takes as an example the extraction of a "Transport" event triggered by "return" and a "Arrest-Jail" event triggered by "capture" in "The man returned to Los Angeles from Mexico following his capture Tuesday by bounty hunters".

Scene:

training corpus:

training example 1: "Bush arrived in Saint Petersburg," medium "arived" triggers a transport event, the destination and subject being "Saint Petersburg" and "Bush", respectively, with a full structure event linear expression of "{ [ Transport arrived (Artifact Bush) (Destination Saint Petersburg) ] }" and a substructure event linear expression of "{ (Transport arrived) (Artifact Bush) (Destination Saint Petersburg) }.

Training example 2: "Police arrested James in Coral Springs on Friday" in "Arbitted" triggered an Arrest event, subject, object, time and place are "policy", "James", "Friday" and "Coral Springs", respectively, with a full structure event linear expression of "{ [ Arrest-Jail Arbites (Agent policy) (Person James) (Place Coral Springs) (Time Friday) ] }" and a substructure event linear expression of "{ (Arrest-Jail Arbites) (Agent policy) (Person James) (Place Coral Springs) (Time Friday) }.

Testing corpus:

test example 1: "The man returned to Los Angeles from Mexico following his capture Tuesday by bounty hunters.

The implementation is as follows:

the invention uses text-event record training corpus, uses sequence to structure neural network model and course learning algorithm to construct event record extractor. In the method, an event substructure extraction task is utilized to pretrain a model, and then a complete model is subjected to fine adjustment by using a full-structure linear expression, so that a trained event extraction model is obtained.

And secondly, the invention uses a sequence-to-structure neural network model to generate the event record linear expression based on a controllable decoding algorithm limited by a frame prefix tree. For example, test case 1, the model will generate a linear expression (as shown in FIG. 2) containing two events: { [ Transport returned (Artifact The man) (Destination Los Angeles) (Origin Mexico) ] [ Arrest-Jail capture (Person The man) (Agent bounty hunters) (Time Tuesday) ] }.

And (III) finally, converting the linear expression generated in the second step into two event records by using structural conversion: { Type: transport, trigger: return, arg1 Role: artifact, arg1: the man, arg2 Role: destination, arg2: los Angeles, arg3 Role: origin, arg3: mexico }; { Type: arrest-Jail, trigger: capture, arg1 Role: person, arg1: the man, arg2 Role: arrest-Jail. Agent, arg2: bounty registers, arg3 Role: time, arg3: tuesday }. In this example, a controllable decoding algorithm based on the framework prefix tree constraint guarantees the integrity of the generated structure and the validity of the event framework.

Based on the same inventive concept, another embodiment of the present invention provides a structured record extraction apparatus based on controllable generation employing the above method, comprising:

Based on the same inventive concept, another embodiment of the present invention provides an electronic device (computer, server, smart phone, etc.) comprising a memory storing a computer program configured to be executed by the processor, and a processor, the computer program comprising instructions for performing the steps in the inventive method.

Based on the same inventive concept, another embodiment of the present invention provides a computer readable storage medium (e.g., ROM/RAM, magnetic disk, optical disk) storing a computer program which, when executed by a computer, implements the steps of the inventive method.

Other embodiments of the invention:

1) The invention can be equivalently replaced to semantic role labeling and relation extraction tasks, namely binary and multiple relations are converted into linear expressions for extraction; for example, the binary relation in "Obama was Born in honoulu" may be expressed as "{ born_in [ Arg1 Obama ] [ Arg2 honoulu ] }").

2) The present invention may represent records in a variety of linear expressions including, but not limited to, replacing frames and reference positions, replacing different identifiers "() { } [ ]", etc. For example, test example 1 can be expressed as { [ returned Transport (The man Artifact) (Los Angeles Destination) (Mexico Origin) ] [ capture Arrest-Jail (The man Person) (bounty hunters Agent) (Tuesday Time) ] }.

The above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and those skilled in the art may modify or substitute the technical solution of the present invention without departing from the spirit and scope of the present invention, and the protection scope of the present invention shall be defined by the claims.

Claims

1. The structured record extraction method based on controllable generation is characterized by comprising the following steps of:

carrying out structural transformation on the generated linear expression to generate a structured record;

the sequence-to-structure based record generation model firstly captures text semantics of a target text by using a self-attention mechanism-based encoder, and then generates a structured linear expression by using a mixed-attention mechanism-based decoder;

the linear expression expresses a multi-layered recording structure, wherein: each "()" sub-bracket consists of a text category and a text block character string, and is the minimum unit of extraction; each "[ ]" sub-bracket represents a single record; each "{ }" sub-bracket represents a record in a sentence, can contain multiple records, or does not contain records;

the controllable decoding algorithm automatically prunes the word list based on the decoding state to generate a dynamic word list, so that controllable generation is realized; the decoding process starts from the root node < bos > of the prefix tree and ends at the leaf node < eos >, at each generation step, the dynamic vocabulary of this step is the child node of the current generation node;

the decoding status includes:

recording frame: generating names of event categories and role categories;

record mentions: generating an event trigger word and a character string of an event argument;

structure identifier: generating a structure identifier in the event structure linear expression for combining the record frame and the record mention;

the controllable decoding algorithm comprises the following steps:

controllably generating a linear expression using the generated dynamic vocabulary: selecting conditional probabilities P (y) from dynamic word lists _i |y _<i X) the highest candidate word is added at the end of the generated linear expression, where P (y _i |y _<i X) represents a linear representation based on the text sequence x to be extracted and the generated linearityExpression result y _<i Continuing to generate y _i Conditional probability of y _i A symbol indicating the i-th position of the linear expression.

2. The method of claim 1, wherein the sequence-to-structure based record generation model employs a two-stage model learning method for efficient learning:

3. A controllably generated structured record extraction device employing the method of claim 1 or 2, comprising:

4. An electronic device comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for performing the method of claim 1 or 2.

5. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program which, when executed by a computer, implements the method of claim 1 or 2.