CN113609244A

CN113609244A - Structured record extraction method and device based on controllable generation

Info

Publication number: CN113609244A
Application number: CN202110637453.5A
Authority: CN
Inventors: 陆垚杰; 林鸿宇; 韩先培; 孙乐; 唐家龙
Original assignee: Institute of Software of CAS
Current assignee: Institute of Software of CAS
Priority date: 2021-06-08
Filing date: 2021-06-08
Publication date: 2021-11-05
Anticipated expiration: 2041-06-08
Also published as: CN113609244B

Abstract

The invention provides a structured record extraction method and device based on controllable generation. The method can automatically extract the structured text records from the unstructured text, and the extraction step comprises the following steps: for a target text, the sequence-to-structure network firstly captures text semantics of the target text by using an encoder based on an attention mechanism, and then generates a structured representation by using a decoder based on a mixed attention mechanism; the controllable decoding algorithm based on the prefix tree restrains decoding space in the generation process, injects frame knowledge, guides model decoding and generates a linear expression; and finally, carrying out structural transformation on the linear expression to generate a structural record. In the model training stage, a two-stage model learning method is adopted to help the model to efficiently learn: in the first stage, a substructure is adopted for model learning, and the training of text block extraction capability is emphasized; and in the second stage, a complete record structure is adopted for model learning, and the training of structure extraction capability is emphasized.

Description

Structured record extraction method and device based on controllable generation

Technical Field

The invention relates to a structured record extraction method, in particular to a record extraction method and device based on controllable generation, and belongs to the technical field of natural language processing.

Background

Record extraction aims to automatically extract structured record information from unstructured text, and the record information includes but is not limited to an event structure, an inter-entity binary relation structure, an inter-entity multivariate relation structure and the like. Taking an example of event structured record extraction, given The sentence "The man turned to Los Angeles from Mexico", a record extraction system should be able to identify a "transportation" event whose trigger is "turned" and whose argument structures are "The man" (subject), "Los Angeles" (origin), and "Mexico" (destination). Taking the binary relationship structure between entities as an example, given the sentence "Obama wa born in honolu.", a record extraction system should be able to recognize that "Obama" is related to "hornin" which is "honolu". Structured record extraction is a key task in knowledge graph construction and natural language understanding.

The difficulty of structured record extraction is the complexity of the structural framework and the diversity of text expressions. First, the structural framework has a multi-element structure, with a single record structure consisting of record categories, participants in different roles, and different records having different semantic structures. Secondly, the text expressions are diverse, and a single record structure can adopt a plurality of different text expressions.

The traditional record extraction model mainly adopts a decoupling idea to deal with the complexity of a record structure and the diversity of text expression. Taking event record extraction as an example, the traditional method usually decomposes extraction of a complete event structure into a plurality of subtasks (event trigger word detection, entity extraction, argument structure extraction), combines results of the subtasks through different combination strategies (pipeline model, joint modeling and joint reasoning methods), and extracts the complete event structure. For example, The conventional method first determines that "turned" in The sentence triggers a transportation event, then extracts "The man" as a "Person" entity, and finally determines whether "The man" is a subject corresponding to The "turned" event. The decoupling method mainly faces the problems that fine-grained data are difficult to label and an optimal combination strategy is difficult to design manually. First, the traditional method needs to label training data with different strengths for different subtasks. For example, for Transport trigger word detection, Person entity recognition and Transport argument extraction, the traditional method needs different fine-grained labeled training corpora for model training, the use efficiency of the method for labeled data is low, and the difficulty and cost of data acquisition are increased. Furthermore, it is a very challenging task to design the optimal combined structure of different subtasks manually. For example, the pipeline model often causes error propagation, and the joint model needs to heuristically predefine information sharing and decision-making dependency relationships among trigger detection, parameter classification and entity identification, often resulting in poor structure and inflexible design of the event extractor.

Disclosure of Invention

The invention provides a record extraction method and a device based on controllable generation, aiming at the problems of high difficulty in fine-grained data labeling and difficulty in designing an optimal combination strategy.

The technical scheme adopted by the invention is as follows:

a structured record extraction method based on controllable generation comprises the following steps:

converting the plain text sequence into a structured linear expression by using a sequence-to-structure-based record generation model;

in the process of converting the plain text sequence into the structured linear expression, using a controllable decoding algorithm based on frame prefix tree limitation to constrain the generation process of the linear expression;

and carrying out structural transformation on the generated linear expression to generate a structural record.

Further, the sequence-to-structure based record generation model first captures text semantics of the target text using an auto-attention mechanism based encoder, and then generates a structured linear expression using a hybrid attention mechanism based decoder.

Further, the record generation model based on the sequence-to-structure adopts a two-stage model learning method to perform efficient learning:

in the first stage, a linear expression of a substructure is adopted for model learning in the learning process, and the training of text block extraction capability is emphasized;

and in the second stage, model learning is carried out by adopting a linear expression of a full structure, and the training of the structure extraction capability is emphasized.

Further, the linear expression expresses a multi-level recording structure, wherein: each "()" sub-bracket consists of a text type and a text block character string and is the minimum unit extracted; each "[ ]" sub-bracket represents a single record; each "{ }" sub-bracket represents a record in one sentence, and can contain a plurality of records or contain no records.

Furthermore, the controllable decoding algorithm automatically prunes the vocabulary based on the decoding state to generate a dynamic vocabulary, thereby realizing controllable generation; the decoding process starts at the root node < bos > of the prefix tree and ends at the leaf node < eos >, at each generation step, the dynamic vocabulary of that step being the child node of the currently generated node.

Further, the controllable decoding algorithm comprises the following steps:

automatically pruning the vocabulary based on the decoding state to generate a dynamic vocabulary: traversing the frame prefix tree to generate a linear expression, wherein the dynamic word list of each decoding state is a child node of a node corresponding to the state in the tree, and the complete word list is all legal natural language words and identifiers' () { } [ ];

controllably generating a linear expression using the generated dynamic vocabulary: selecting conditional probability P (y) from dynamic vocabulary_i|y_＜iX) the highest candidate is added at the end of the generated linear expression, where P (y)_i|y_＜iX) tableShowing the result y based on the text sequence x to be extracted and the generated linear expression_＜iContinue to generate y_iConditional probability of (a), y_iThe symbol representing the ith position of the linear expression.

A structured record extraction device based on controllable generation and adopting the method comprises the following steps:

the linear expression generation module is used for converting the plain text sequence into a structured linear expression by using a sequence-to-structure-based record generation model; in the process of converting the plain text sequence into the structured linear expression, using a controllable decoding algorithm based on frame prefix tree limitation to constrain the generation process of the linear expression;

and the structured record generation module is used for carrying out structural transformation on the generated linear expression to generate a structured record.

The invention has the beneficial effects that:

1) a multi-level recording structure is expressed by using a linear expression, so that the model can generate a complete event structure in an autoregressive mode.

2) And (3) directly extracting records on the plain text sequence to generate a complete structured record, thereby avoiding the dependence of the model on fine-grained training data.

3) The generation of the record structure is controlled in a dynamic word list mode, so that the integrity of the generated expression is ensured, and the framework knowledge is effectively injected.

4) Through a two-stage training method, firstly, a substructure is adopted to pre-train a model, and then a full structure is used to finely tune the model to help the model to transfer knowledge from a pre-trained language model.

Drawings

FIG. 1 is a block diagram of a record extraction framework based on controlled generation.

Fig. 2 is an event linear structure expression.

Fig. 3 is an event framework prefix tree.

Fig. 4 is an event category prefix tree.

Detailed Description

In order to make the aforementioned and other features and advantages of the invention more comprehensible, embodiments accompanied with figures are described in detail below.

The event extraction method based on controllable generation, as shown in fig. 1, mainly includes: 1) a sequence-to-structure based record generation model; 2) a controllable decoding algorithm based on framework prefix tree restriction; 3) a two-stage model learning algorithm based on continuous learning.

The recording form mainly includes relationship and event, taking event extraction as an example, the technical scheme adopted by the invention is as follows:

first, the present invention uses a linear expression to express different record structures, where the linear expression includes different layers and can be used for training and decoding of record extractors at different stages, as shown in fig. 2. For example, "[ Transport associated (Artifact bump) ]" represents a single Transport (Transport) event associated with bump; "{ [ Transport associated (Artifact Bush) ] }" represents that a single text contains one Transport event related to Bush; "{ [ Transport associated (Artifact Bush) ] [ Arrest-Jail associated (Agent Police) (Person James) ] }" represents that the text contains a plurality of events, one event is a transportation event related to "Bush", and the other event is an Arrest (Arrest-Jail) event related to "Police". Wherein, each "()" sub-bracket is composed of a text type and a text block character string and is the minimum unit of extraction. Each "[ ]" sub-bracket represents a single event record. Each sub-bracket represents an event record in a sentence, and may or may not contain a plurality of event records. When no event record is contained, the linear expression is "{ }".

The invention provides a record generation model based on sequence-to-structure, wherein the input of the model is a plain text sequence, and the output is a linearized event structure expression. The method comprises the following steps:

for the text sequence x to be extracted₁,...,x_|x|Capturing semantic information in a target text by an Encoder model Encoder based on a self-attention mechanism, acquiring context information in the target text by using a multi-head attention mechanism, and obtaining a semantic feature tableThe following steps:

H＝Encoder(x₁,...,x_|x|)；

wherein x is_i(i ═ 1, …, | x |) represents the ith word in the text sequence to be extracted, and | x | represents the length of the text sequence.

Given the semantic feature representation H of the input sequence, the Decoder Decoder based on the mixed attention mechanism generates an event record structure by means of autoregressive. The decoder model takes "< bos >" as an initial state until the generation ends with "< eos >". The conditional probability of the whole generated sequence is obtained by multiplying the conditional probability at each moment by:

where x represents a text sequence to be extracted, such as "The man turned to Los Angeles from mexico", y represents a linear expression obtained by extraction (as shown in fig. 2), i represents The ith position of The generated linear expression, | y | represents The length of The linear expression, and y represents The length of The linear expression_iSymbol, y, representing the i-th position of the linear expression_＜iRepresenting the linear expression result, P (y), that has been generated_i|y_＜iX) represents a result y based on a text sequence x to be extracted and a generated linear expression_＜iContinue to generate y_iThe conditional probability of (2).

In the ith step of generation, the decoder performs Softmax normalization operation on the vocabulary V to calculate the generation probability P (y) of a single moment_i|y_＜iX). Decoder represents H, decoder state of preorder moment based on semantic characteristics of target text

And the output y of the previous moment_i-1And (3) outputting the predicted current time:

wherein the Decoder includes a Decoder state

The self-attention mechanism and the cross-attention mechanism for the semantic feature representation H of the target text, namely the Decoder adopts a mixed attention mechanism. The self-attention mechanism refers to an attention mechanism inside the Decoder, namely, attention to a generated expression state in a Decoder encoding process, and the cross-attention mechanism refers to attention of the Decoder to a semantic feature expression H of a target text generated by the Endecoder.

In the model decoding process, in order to effectively inject framework knowledge (the framework refers to a specific structure of a record, the framework knowledge refers to the role type of a specific participant in the record structure, for example, an arrest event should contain a plurality of different role types such as "time", "place", "arrestee", "arrest", and the like) and ensure the integrity of a generated structure, the invention provides a controllable decoding algorithm based on framework prefix tree restriction to carry out controllable generation. The decoding algorithm automatically prunes the vocabulary to generate a dynamic vocabulary based on the decoding state, thereby realizing controllable generation. The complete decoding process can be regarded as a process of searching the frame prefix tree. In particular, the generation process of the recording structure can be divided into three different decoding states:

1. recording frame: names of the event category (T) and the role category (R) are generated, for example, "Transport" and "Artifact" in the above-described example.

2. Mention is made of the records: a string (S) of event triggers and event arguments is generated, such as "concerned" and "Bush" in the above example.

3. Structure identifier: the structure identifiers "{", "}", "[", "]", "(" and ")" in the event structure linear expression are generated for combining the record frame and the record reference.

The controllable decoding algorithm based on the framework prefix tree limitation specifically comprises the following steps:

1) based on the decoding state, the vocabulary is automatically pruned to generate a dynamic vocabulary, and the specific method comprises the following steps:

and traversing the frame prefix tree to generate a linear expression, wherein the dynamic word list of each decoding state is a child node of a node corresponding to the state in the tree. The complete vocabulary is all legal natural language vocabulary and identifiers "() { } [ ]. For example, as shown in fig. 3, for the generated result "< bos > {", the child nodes of the corresponding node are "[" and "}", i.e., the dynamic vocabulary is merely "[ }". For example, as shown in fig. 4, when the generated result is "< bos > { [", the dynamic vocabulary corresponding to the state is a set of event categories T.

2) And controllably generating a linear expression by utilizing the generated dynamic vocabulary, wherein the specific method comprises the following steps:

selecting conditional probability P (y) from dynamic vocabulary_i|y_＜iX) the highest candidate word is added at the end of the generated linear expression. For example, FIG. 3, for a generated string "<bos>{ "the legal candidate words are" [ "and" } ", and the probabilities are P respectively_aAnd P_b. When P is present_a≥P_bThen, generate "<bos>{ [ "; otherwise, generate "<bos>{}”。

The decoding process starts from the root node < bos > of the prefix tree and ends at the leaf node < eos >. In each generation step i, the dynamic vocabulary in the step i is a child node of the current generation node. As shown in fig. 3, the candidate word table of the < bos > node is { "," }. Event category generation, event role generation, and event mention generation then search the corresponding sub-tree, as shown in fig. 4.

And after the generation of the event record linear expression is finished, analyzing the linear record into a tree, and finally converting the tree into the event record.

In the model training phase, the invention uses a pre-training language model generated by orienting to sequence text as an initialization model. Different from the common text-to-text generation task, the sequence-to-structure framework adopted by the invention has the problems of output gap between text expression and linear expression, high semantic-structure-free identifier ratio and the like, increases the difficulty of model learning, and is difficult to effectively use knowledge in a pre-training model. In order to solve the problems, the invention designs a two-stage model learning algorithm based on curriculum learning to train the model and help the model to transfer knowledge from the pre-training language model to realize efficient training. The training algorithm comprises the following steps:

the first stage employs a substructure extraction training strategy aimed at migrating the "sequence-to-sequence" model (i.e., the initialization model) to a "sequence-to-substructure" model. In this stage, the model is mainly used for learning the text block extraction capability, and does not learn the structure extraction capability. This stage uses a linear expression of sub-structure events for model training. The sub-structure does not contain a hierarchical event structure, centered on the text block. For example, the sub-structure linear expression "{ (Transport associated) (Artifact Bush) }" does not contain a hierarchical structure, only has different text block sub-brackets.

And in the second stage, a full-structure extraction training strategy is adopted, so that the 'sequence-to-substructure' model is transferred into a 'sequence-to-structure' model. In the stage, a linear expression of a full structure is used for model training, and the model is mainly used for learning the structure extraction capability.

In summary, the key steps of the structured record extraction method based on controllable generation of the present invention include:

1) recording structure linear expression, using linear expression to express a recording structure, the linear expression can be converted with the complete recording structure;

2) based on the representation, converting the plain text sequence into a structured linear expression by using a sequence-to-structure-based record generation model;

3) based on the representation and the generation model, a controllable decoding algorithm based on frame prefix tree limitation is used for restricting the generation process in the generation structure, and a dynamic word list is generated in a mode of automatically pruning the word list, so that the integrity of the generation record structure is ensured, and frame knowledge is injected.

4) Aiming at the training process of generating the model, a two-stage model learning algorithm based on course learning is adopted, and the learning of extracting the model is divided into a sub-structure learning stage and a full-structure learning stage.

In the step 1), a multi-level recording structure is expressed by using a linear expression, so that the model can generate a complete event structure in an autoregressive mode.

And 2) directly extracting records on the plain text sequence to generate a complete structured record, so that the dependence of the model on fine-grained training data is avoided.

And 3) controlling the generation of the record structure in a dynamic word list mode, thereby ensuring the integrity of the generated expression and effectively injecting the framework knowledge.

In the step 4), through a two-stage training method, the model is pre-trained by adopting a substructure, and then the model is finely adjusted by using a full structure, so that the model is helped to migrate knowledge from the pre-trained language model.

An example of the above method is as follows:

this embodiment takes The example of extracting "transmitted" event triggered by "transmitted" and "arm-Jail" event triggered by "capture" from Mexico flowing his capture tasks in "The man transmitted to Los Angeles from Mexico flowing around this hub".

Scene:

training the corpus:

training example 1: "in" round involved in Saint Petersburg "triggers a Transport event, the Destination and subject are" Saint Petersburg "and" bump ", respectively, a full structure event linear expression is" { [ Transport involved (Destination) Petersburg) ] } ", and a sub structure event linear expression is" { (Transport involved (Destination) Petersburg) } ".

Training example 2: "polarized trapped James in surface Springs on Friday", where the subject, object, Time and Place are "polarized", "James", "Friday" and "surface Springs", respectively, the full structure event linear expression is "{ [ arm-Jail trapped (Agent polce) (Person James) (surface Springs) (Time Friday) ] }", and the sub structure event linear expression is "{ (arm-Jail trapped) (Agent polce) (Person James) (surface rod Springs) (Time Friday) }".

Testing corpora:

test example 1: "The man turned to Los Angeles from Mexico following his capture Tuesday by round hunters.

The implementation is as follows:

the invention uses text-event record training corpus, and uses sequence-to-structure neural network model and course learning algorithm to construct event record extractor. In the method, firstly, an event substructure extraction task is used for pre-training a model, and then a full-structure linear expression is used for fine-tuning the complete model to obtain a trained event extraction model.

The invention uses a sequence-to-structure neural network model and a controllable decoding algorithm based on framework prefix tree limitation to generate an event record linear expression. For example, test example 1, the model would generate a linear expression containing two events (as shown in FIG. 2): { [ Transport recovered (organ The man) (Destination Los Angeles) (Origin Mexico) ] [ Arrest-Jail capture (Person The man) (Agent bounces) (Time Tuesday) ] }.

And (III) finally, converting the linear expression generated in the second step into two event records by using structural conversion: { Type: Transport, Trigger: returned, Arg1 Role: Artifact, Arg1: The man, Arg2 Role: Destination, Arg2: Los Angeles, Arg3 Role: Origin, Arg3: Mexico }; { Type: Arrest-Jail, Trigger: capture, Arg1 Role: Person, Arg1: The man, Arg2 Role: Arrest-Jail. agent, Arg2: bounty hunters, Arg3 Role: Time, Arg3: Tuesday }. In this example, a controllable decoding algorithm based on framework prefix tree constraints guarantees the integrity of the generated structure and the validity of the event framework.

Based on the same inventive concept, another embodiment of the present invention provides a structured record extraction device based on controllable generation, which adopts the above method, and comprises:

Based on the same inventive concept, another embodiment of the present invention provides an electronic device (computer, server, smartphone, etc.) comprising a memory storing a computer program configured to be executed by the processor, and a processor, the computer program comprising instructions for performing the steps of the inventive method.

Based on the same inventive concept, another embodiment of the present invention provides a computer-readable storage medium (e.g., ROM/RAM, magnetic disk, optical disk) storing a computer program, which when executed by a computer, performs the steps of the inventive method.

Other embodiments of the invention:

1) the invention can equally replace the tasks of semantic role marking and relation extraction, namely converting binary and multivariate relations into linear expressions for extraction; for example, the binary relation in "Obama waters borne in Honolulu" } "can be expressed as" { Born _ in [ Arg1 Obama ] [ Arg2 Honolulu ] } ".

2) The present invention may represent records in a variety of linear expressions including, but not limited to, replacing frames and mentioning locations, replacing different identifiers "() { } [ ]", and the like. For example, in test example 1, it can be expressed as { [ turned Transport (The man arm) (Los angles Destination) (Mexico Origin) ] [ capture arms-jail (The man person agents) (reception times) ].

The above embodiments are only intended to illustrate the technical solution of the present invention and not to limit the same, and a person skilled in the art can modify the technical solution of the present invention or substitute the same without departing from the spirit and scope of the present invention, and the scope of the present invention should be determined by the claims.

Claims

1. A structured record extraction method based on controllable generation is characterized by comprising the following steps:

2. The method of claim 1, wherein the sequence-to-structure based record generation model first captures text semantics of a target text using an auto-attention mechanism based encoder and then generates a structured linear expression using a hybrid-attention mechanism based decoder.

3. The method of claim 1, wherein the sequence-to-structure based record generation model employs a two-stage model learning approach for efficient learning:

4. The method of claim 1, wherein the linear expression expresses a multi-level record structure, wherein: each "()" sub-bracket consists of a text type and a text block character string and is the minimum unit extracted; each "[ ]" sub-bracket represents a single record; each "{ }" sub-bracket represents a record in one sentence, and can contain a plurality of records or contain no records.

5. The method of claim 4, wherein the controllable decoding algorithm automatically prunes the vocabulary based on the decoding status to generate a dynamic vocabulary, thereby achieving controllable generation; the decoding process starts at the root node < bos > of the prefix tree and ends at the leaf node < eos >, at each generation step, the dynamic vocabulary of that step being the child node of the currently generated node.

6. The method of claim 5, wherein the decoding state comprises:

recording frame: generating names of event categories and role categories;

mention is made of the records: generating a character string of an event trigger word and an event argument;

structure identifier: a structure identifier in an event structure linear expression is generated for combining the record framework and the record mentions.

7. The method of claim 5, wherein the controllable decoding algorithm comprises the steps of:

controllably generating a linear expression using the generated dynamic vocabulary: selecting conditional probability P (y) from dynamic vocabulary_i|y_＜iX) the highest candidate is added at the end of the generated linear expression, where P (y)_i|y_＜iX) represents a result y based on a text sequence x to be extracted and a generated linear expression_＜iContinue to generate y_iConditional probability of (a), y_iThe symbol representing the ith position of the linear expression.

8. A structured record extraction device based on controllable generation and adopting the method of any one of claims 1 to 7, characterized by comprising:

9. An electronic apparatus, comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for performing the method of any of claims 1 to 7.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a computer, implements the method of any one of claims 1 to 7.