CN113609244B - Structured record extraction method and device based on controllable generation - Google Patents

Structured record extraction method and device based on controllable generation Download PDF

Info

Publication number
CN113609244B
CN113609244B CN202110637453.5A CN202110637453A CN113609244B CN 113609244 B CN113609244 B CN 113609244B CN 202110637453 A CN202110637453 A CN 202110637453A CN 113609244 B CN113609244 B CN 113609244B
Authority
CN
China
Prior art keywords
record
linear expression
structured
generation
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110637453.5A
Other languages
Chinese (zh)
Other versions
CN113609244A (en
Inventor
陆垚杰
林鸿宇
韩先培
孙乐
唐家龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Software of CAS
Original Assignee
Institute of Software of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Software of CAS filed Critical Institute of Software of CAS
Priority to CN202110637453.5A priority Critical patent/CN113609244B/en
Publication of CN113609244A publication Critical patent/CN113609244A/en
Application granted granted Critical
Publication of CN113609244B publication Critical patent/CN113609244B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a structured record extraction method and device based on controllable generation. The method can automatically extract the structured text record from the unstructured text, and the extracting step comprises the following steps: for the target text, the sequence-to-structure network first captures text semantics of the target text with a self-attention mechanism-based encoder, and then generates a structured representation with a mixed-attention mechanism-based decoder; the controllable decoding algorithm based on the prefix tree is used for restraining a decoding space in the generation process, injecting frame knowledge, guiding model decoding and generating a linear expression; and finally, carrying out structural transformation on the linear expression to generate a structural record. In the model training stage, a two-stage model learning method is adopted to help the model to learn efficiently: the first stage adopts a substructure to perform model learning, and pays attention to training of text block extraction capacity; and in the second stage, a complete record structure is adopted for model learning, and training of the extraction capacity of the structure is focused.

Description

Structured record extraction method and device based on controllable generation
Technical Field
The invention relates to a structured record extraction method, in particular to a record extraction method and device based on controllable generation, and belongs to the technical field of natural language processing.
Background
Record extraction is intended to automatically extract structured record information from unstructured text, such record information including, but not limited to, event structures, binary relationship structures between entities, multiple relationship structures between entities, and the like. Taking event structured record extraction as an example, given The sentence "The man returned to Los Angeles from Mexico", "a record extraction system should be able to identify a" transport "event whose trigger word is" return ", and whose meta-structures are" The man "(subject)," Los Angeles "(origin) and" Mexico "(destination). Taking the example of a binary relationship structure between entities, given the sentence "Obama was born in honoulu," a record extraction system should be able to identify the relationship of "Obama" to "honoulu" as "born in". Structured record extraction is a key task in knowledge graph construction and natural language understanding.
The difficulty with structured record extraction is the complexity of the structural framework and the diversity of text expressions. First, the structural framework has a multi-element structure, a single record structure is made up of record categories, participants of different roles, and different records have different semantic structures. Second, text expressions are diverse, and a single recording structure can employ a variety of different text expressions.
The traditional record extraction model mainly adopts the decoupling idea to cope with the complexity of a record structure and the diversity of text expression. Taking event record extraction as an example, the conventional method generally decomposes the extraction of a complete event structure into a plurality of subtasks (event trigger word detection, entity extraction, argument structure extraction), combines the results of the subtasks through different combination strategies (pipeline model, joint modeling and joint reasoning methods), and extracts the complete event structure. For example, the conventional method first determines that "return" in a sentence triggers a transportation event, then extracts "The man" as a Person entity, and finally determines whether "The man" is The subject corresponding to The "return" event. The decoupling method mainly faces the problems that fine-grained data marking is difficult and an optimal combination strategy is difficult to design manually. First, the conventional method requires labeling training data of different strengths for different subtasks. For example, for the detection of a Transport trigger word, the identification of a Person entity and the extraction of a Transport argument, the traditional method needs different fine-grained labeling training corpuses for model training, and the method has lower use efficiency of the labeled data and increases the difficulty and cost of data acquisition. Furthermore, it is a very challenging task to design the optimal combination of different subtasks by hand. For example, pipeline models often result in error propagation, and joint models require heuristically predefined information sharing and decision dependencies between trigger detection, parameter classification, and entity identification, often resulting in an event extractor that is structurally non-optimal and inflexible in design.
Disclosure of Invention
Aiming at the problems of high difficulty in labeling fine-grained data and difficulty in designing an optimal combination strategy, the invention provides a record extraction method and device based on controllable generation.
The technical scheme adopted by the invention is as follows:
a structured record extraction method based on controllable generation, comprising the steps of:
converting the plain text sequence into a structured linear expression using a sequence-to-structure based record generation model;
in the process of converting the plain text sequence into the structured linear expression, a controllable decoding algorithm based on the frame prefix tree restriction is used for restricting the generation process of the linear expression;
and carrying out structural transformation on the generated linear expression to generate a structured record.
Further, the sequence-to-structure based record generation model first captures text semantics of the target text using a self-attention mechanism based encoder and then generates a structured linear expression using a mixed-attention mechanism based decoder.
Further, the sequence-to-structure-based record generation model adopts a two-stage model learning method to perform efficient learning:
the first stage learning process adopts a linear expression of a substructure to perform model learning, and pays attention to training of text block extraction capacity;
the second stage learning process adopts a linear expression of a full structure to perform model learning, and pays attention to training of the extraction capacity of the structure.
Further, the linear expression expresses a multi-layered recording structure, wherein: each "()" sub-bracket consists of a text category and a text block character string, and is the minimum unit of extraction; each "[ ]" sub-bracket represents a single record; each "{ }" sub-bracket represents a record in one sentence, and can contain a plurality of records, or no record.
Further, the controllable decoding algorithm automatically prunes the word list based on the decoding state to generate a dynamic word list, so that controllable generation is realized; the decoding process starts from the root node < bos > of the prefix tree and ends at the leaf node < eos >, at each generation step, the dynamic vocabulary of this step is the child node of the currently generated node.
Further, the controllable decoding algorithm comprises the following steps:
pruning is automatically carried out on the word list based on the decoding state to generate a dynamic word list: traversing the frame prefix tree to generate a linear expression, wherein the dynamic word list of each decoding state is a child node of a node corresponding to the state in the tree, and the complete word list is all legal natural language words and identifiers "() { } [ ]";
controllably generating a linear expression using the generated dynamic vocabulary: selecting conditional probabilities P (y) from dynamic word lists i |y <i X) the highest candidate word is added at the end of the generated linear expression, where P (y i |y <i X) represents a result y based on the text sequence x to be extracted and the generated linear expression <i Continuing to generate y i Conditional probability of y i A symbol indicating the i-th position of the linear expression.
A structured record extraction device based on controlled generation employing the above method, comprising:
a linear expression generation module for converting the plain text sequence into a structured linear expression using a sequence-to-structure based record generation model; in the process of converting the plain text sequence into the structured linear expression, a controllable decoding algorithm based on the frame prefix tree restriction is used for restricting the generation process of the linear expression;
and the structured record generation module is used for carrying out structural transformation on the generated linear expression to generate a structured record.
The beneficial effects of the invention are as follows:
1) The linear expression is used for expressing a multi-level record structure, so that the model can generate a complete event structure in an autoregressive mode.
2) The record extraction is directly carried out on the plain text sequence, and the complete structured record is generated, so that the dependence of the model on fine granularity training data is avoided.
3) The generation of the record structure is controlled in a dynamic word list mode, so that the integrity of the generated expression is ensured, and the frame knowledge is effectively injected.
4) Through a two-stage training method, a sub-structure is adopted to pretrain the model, and then a full-structure is used to fine tune the model, so that the model is helped to migrate knowledge from the pretrained language model.
Drawings
FIG. 1 is a diagram of a controllably generated record extraction framework.
Fig. 2 is an event linear structural expression.
Fig. 3 is an event frame prefix tree.
Fig. 4 is an event category prefix tree.
Detailed Description
In order to make the above features and advantages of the present invention more comprehensible, embodiments accompanied with figures are described in detail below.
The event extraction method based on controllable generation provided by the invention, as shown in fig. 1, mainly comprises the following steps: 1) Generating a model based on the sequence-to-structure records; 2) A controllable decoding algorithm based on frame prefix tree restriction; 3) A two-stage model learning algorithm based on continuous learning.
The recorded form mainly has relation and event, taking event extraction as an example, the technical scheme adopted by the invention is as follows:
first, the present invention uses linear expressions to express different record structures, the linear expressions contain different levels, and can be used for training and decoding of record decimators in different stages, as shown in fig. 2. For example, "[ Transport arrived (Artifact bus) ]" represents a single bus-related Transport (Transport) event; "{ [ Transport arrived (Artifact Bush) ] } represents that a single text contains a transport event associated with Bush; "{ [ Transport arrived (Artifact Bush) ] [ event-Jail Arbitten (Person James) ] } represents that the text contains a plurality of events, one event is a transport event related to" Bush ", and the other event is an Arrest (event-Jail) event related to" policy ". Wherein each "()" sub-bracket consists of a text category and text block string, which is the smallest unit of extraction. Each "[ ]" sub-bracket represents a single event record. Each "{ }" sub-bracket represents an event record in a sentence, and may or may not contain a plurality of event records. When no event record is contained, the linear expression is "{ }.
The invention provides a record generation model based on sequence-to-structure, wherein the input of the model is a plain text sequence, and the output of the model is a linearized event structure expression. The method comprises the following steps:
for a text sequence x to be extracted 1 ,...,x |x| The Encoder model Encoder based on the self-attention mechanism captures semantic information in the target text, and acquires context information in the target text by utilizing the multi-head attention mechanism to obtain semantic feature representation:
H=Encoder(x 1 ,...,x |x| );
wherein x is i (i=1, …, |x|) represents the i-th word in the text sequence to be extracted, and |x| represents the length of the text sequence.
Given the semantic feature representation H of the input sequence, the mixed-attention-mechanism-based Decoder generates an event record structure by means of autoregressive. The decoder model takes "< bos >" as an initial state until the generation ends to "< eos >". The conditional probability of the whole generated sequence is obtained by multiplying the conditional probability at each moment:
where x represents a text sequence to be extracted, e.g. "The man returned to Los Angeles from mexico", "y represents a linear expression representing the extraction result (as shown in fig. 2), i represents the i-th position of the generated linear expression, |y| represents the length of the linear expression, y i Symbol indicating the ith position of the linear expression, y <i Representing the result of the linear expression that has been generated, P (y i |y <i X) represents a result y based on the text sequence x to be extracted and the generated linear expression <i Continuing to generate y i Conditional probability of (2).
In the generation step i, the decoder performs Softmax normalization operation on the vocabulary V to calculate the generation probability P (y i |y <i X). The decoder represents H based on semantic features of the target text, decoder state at the preamble timeWith the output y of the last moment i-1 And (3) performing output of predicting the current moment:
wherein the Decoder comprises a Decoder stateThe self-attention mechanism of (2) and the cross-attention mechanism of the target text semantic feature representation H, namely the Decoder, employ a mixed-attention mechanism. The self-attention mechanism refers to attention mechanism inside the Decoder, namely attention to the generated expression state in the encoding process of the Decoder, and the cross-attention mechanism refers to attention of the Decoder to the target text semantic feature representation H generated by the Encoder.
In the model decoding process, in order to effectively inject frame knowledge (a frame refers to a recorded specific structure, the frame knowledge refers to a role type of a specific participant in the recorded structure, for example, an arrest event should contain a plurality of different role types such as time, place, arrestee and the like) and ensure the integrity of a generated structure, the invention proposes a controllable decoding algorithm based on the limitation of a frame prefix tree to perform controllable generation. The decoding algorithm automatically prunes the vocabulary based on the decoding state to generate a dynamic vocabulary, thereby realizing controllable generation. The complete decoding process may be regarded as a process of searching the frame prefix tree. Specifically, the generation process of the record structure can be divided into three different decoding states:
1. recording frame: names of the event category (T) and the role category (R) are generated, for example, "Transport" and "notify" in the above-described examples, and the like.
2. Record mentions: the character strings (S) of the event trigger words and event arguments are generated, for example, "generated" and "push" in the above-described examples.
3. Structure identifier: the structure identifiers "{", "}", "[", "]", "(" and ")", in the event structure linear expression are generated for combining the record frame and the record references.
The controllable decoding algorithm based on the frame prefix tree limitation specifically comprises the following steps:
1) Pruning is automatically carried out on the word list based on the decoding state to generate a dynamic word list, and the specific method is as follows:
traversing the frame prefix tree to generate a linear expression, wherein the dynamic word list of each decoding state is a child node of the node corresponding to the state in the tree. The complete vocabulary is all legal natural language vocabulary and identifiers "() { } [ ]). For example, as shown in fig. 3, for the generated result "< bos > {", the child nodes of the corresponding nodes are "[" and "}", i.e., the dynamic vocabulary is only "[ }"). For example, as shown in fig. 4, when the generated result is "< bos > { [" the dynamic vocabulary corresponding to the state is the set of event categories T.
2) The linear expression is controllably generated by using the generated dynamic word list, and the specific method is as follows:
from dynamic vocabularySelecting conditional probability P (y i |y <i X) the highest candidate word is added at the end of the generated linear expression. For example, FIG. 3, "for a generated string"<bos>{ ", its legal candidate words are" [ "and" } ", the probabilities are P respectively a And P b . When P a ≥P b When generating'<bos>{ [ "; otherwise, generate'<bos>{}”。
The decoding process starts from the root node < bos > of the prefix tree and ends at leaf node < eos >. In each generating step i, the dynamic vocabulary of the i-th step is the child node of the current generating node. As shown in fig. 3, the candidate vocabulary of the < bos > node is { "," } ". Event category generation, event role generation, and event mention generation then search the corresponding subtrees, as shown in fig. 4.
After the event record linear expression is generated, the linear record is parsed into a tree, and finally converted into the event record.
In the model training phase, the invention uses a pre-training language model generated from sequence to sequence oriented text as an initialization model. Different from the common text-to-text generation task, the sequence-to-structure framework adopted by the invention faces the problems of output gap between text expression and linear expression, high occupation ratio of semantic-free structure identifiers and the like, increases the difficulty of model learning, and is difficult to effectively use knowledge in a pre-training model. In order to solve the problems, the invention designs a two-stage model learning algorithm based on course learning to train a model, and the model is helped to migrate knowledge from a pre-training language model to realize efficient training. The training algorithm comprises the following steps:
the first stage employs a substructure extraction training strategy, with the objective of migrating the "sequence-to-sequence" model (i.e., the initialization model) to a "sequence-to-substructure" model. The stage mainly enables the model to learn the text block extraction capability and not learn the structure extraction capability. This stage uses the linear expression of the substructure event for model training. The sub-structure does not contain a hierarchical event structure, centered on the text block markers. For example, the sub-structure linear expression "{ (Transport arrived) (Artifact Bush) (Destination Saint Petersburg) }" does not contain a hierarchical structure, only different text block sub-brackets.
The second stage employs a full structure extraction training strategy with the objective of migrating the "sequence to substructure" model to a "sequence to structure" model. This stage uses full-structure linear expressions for model training, mainly to learn the ability of the structure extraction.
In summary, the key steps of the structured record extraction method based on controllable generation of the invention comprise:
1) A record structure linearization representation, wherein a record structure is represented by using a linearization expression, and the linearization expression can be mutually converted with a complete record structure;
2) Based on the representation, converting the plain text sequence into a structured linear expression using a sequence-to-structure based record generation model;
3) Based on the representation and the generation model, a controllable decoding algorithm based on the limitation of a frame prefix tree is used for restraining the generation process in the generation structure, a dynamic vocabulary is generated by automatically pruning the vocabulary, the integrity of the generated record structure is ensured, and frame knowledge is injected.
4) Aiming at the training process of the generated model, a two-stage model learning algorithm based on course learning is adopted, and the learning of the extraction model is carried out in two stages of substructure learning and full structure learning.
Wherein, step 1) uses linear expression to express multi-level record structure, so that the model can generate complete event structure by autoregressive mode.
And 2) directly performing record extraction on the plain text sequence to generate a complete structured record, thereby avoiding the dependence of the model on fine-grained training data.
Wherein, step 3) controls the generation of the record structure by means of dynamic word list, thereby ensuring the integrity of the generated expression and effectively injecting the frame knowledge.
The step 4) is to pretrain the model by adopting a substructure through a two-stage training method, and then to use a full structure to finely tune the model so as to help the model to migrate knowledge from the pretrained language model.
An example of this method is as follows:
this embodiment takes as an example the extraction of a "Transport" event triggered by "return" and a "Arrest-Jail" event triggered by "capture" in "The man returned to Los Angeles from Mexico following his capture Tuesday by bounty hunters".
Scene:
training corpus:
training example 1: "Bush arrived in Saint Petersburg," medium "arived" triggers a transport event, the destination and subject being "Saint Petersburg" and "Bush", respectively, with a full structure event linear expression of "{ [ Transport arrived (Artifact Bush) (Destination Saint Petersburg) ] }" and a substructure event linear expression of "{ (Transport arrived) (Artifact Bush) (Destination Saint Petersburg) }.
Training example 2: "Police arrested James in Coral Springs on Friday" in "Arbitted" triggered an Arrest event, subject, object, time and place are "policy", "James", "Friday" and "Coral Springs", respectively, with a full structure event linear expression of "{ [ Arrest-Jail Arbites (Agent policy) (Person James) (Place Coral Springs) (Time Friday) ] }" and a substructure event linear expression of "{ (Arrest-Jail Arbites) (Agent policy) (Person James) (Place Coral Springs) (Time Friday) }.
Testing corpus:
test example 1: "The man returned to Los Angeles from Mexico following his capture Tuesday by bounty hunters.
The implementation is as follows:
the invention uses text-event record training corpus, uses sequence to structure neural network model and course learning algorithm to construct event record extractor. In the method, an event substructure extraction task is utilized to pretrain a model, and then a complete model is subjected to fine adjustment by using a full-structure linear expression, so that a trained event extraction model is obtained.
And secondly, the invention uses a sequence-to-structure neural network model to generate the event record linear expression based on a controllable decoding algorithm limited by a frame prefix tree. For example, test case 1, the model will generate a linear expression (as shown in FIG. 2) containing two events: { [ Transport returned (Artifact The man) (Destination Los Angeles) (Origin Mexico) ] [ Arrest-Jail capture (Person The man) (Agent bounty hunters) (Time Tuesday) ] }.
And (III) finally, converting the linear expression generated in the second step into two event records by using structural conversion: { Type: transport, trigger: return, arg1 Role: artifact, arg1: the man, arg2 Role: destination, arg2: los Angeles, arg3 Role: origin, arg3: mexico }; { Type: arrest-Jail, trigger: capture, arg1 Role: person, arg1: the man, arg2 Role: arrest-Jail. Agent, arg2: bounty registers, arg3 Role: time, arg3: tuesday }. In this example, a controllable decoding algorithm based on the framework prefix tree constraint guarantees the integrity of the generated structure and the validity of the event framework.
Based on the same inventive concept, another embodiment of the present invention provides a structured record extraction apparatus based on controllable generation employing the above method, comprising:
a linear expression generation module for converting the plain text sequence into a structured linear expression using a sequence-to-structure based record generation model; in the process of converting the plain text sequence into the structured linear expression, a controllable decoding algorithm based on the frame prefix tree restriction is used for restricting the generation process of the linear expression;
and the structured record generation module is used for carrying out structural transformation on the generated linear expression to generate a structured record.
Based on the same inventive concept, another embodiment of the present invention provides an electronic device (computer, server, smart phone, etc.) comprising a memory storing a computer program configured to be executed by the processor, and a processor, the computer program comprising instructions for performing the steps in the inventive method.
Based on the same inventive concept, another embodiment of the present invention provides a computer readable storage medium (e.g., ROM/RAM, magnetic disk, optical disk) storing a computer program which, when executed by a computer, implements the steps of the inventive method.
Other embodiments of the invention:
1) The invention can be equivalently replaced to semantic role labeling and relation extraction tasks, namely binary and multiple relations are converted into linear expressions for extraction; for example, the binary relation in "Obama was Born in honoulu" may be expressed as "{ born_in [ Arg1 Obama ] [ Arg2 honoulu ] }").
2) The present invention may represent records in a variety of linear expressions including, but not limited to, replacing frames and reference positions, replacing different identifiers "() { } [ ]", etc. For example, test example 1 can be expressed as { [ returned Transport (The man Artifact) (Los Angeles Destination) (Mexico Origin) ] [ capture Arrest-Jail (The man Person) (bounty hunters Agent) (Tuesday Time) ] }.
The above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and those skilled in the art may modify or substitute the technical solution of the present invention without departing from the spirit and scope of the present invention, and the protection scope of the present invention shall be defined by the claims.

Claims (5)

1. The structured record extraction method based on controllable generation is characterized by comprising the following steps of:
converting the plain text sequence into a structured linear expression using a sequence-to-structure based record generation model;
in the process of converting the plain text sequence into the structured linear expression, a controllable decoding algorithm based on the frame prefix tree restriction is used for restricting the generation process of the linear expression;
carrying out structural transformation on the generated linear expression to generate a structured record;
the sequence-to-structure based record generation model firstly captures text semantics of a target text by using a self-attention mechanism-based encoder, and then generates a structured linear expression by using a mixed-attention mechanism-based decoder;
the linear expression expresses a multi-layered recording structure, wherein: each "()" sub-bracket consists of a text category and a text block character string, and is the minimum unit of extraction; each "[ ]" sub-bracket represents a single record; each "{ }" sub-bracket represents a record in a sentence, can contain multiple records, or does not contain records;
the controllable decoding algorithm automatically prunes the word list based on the decoding state to generate a dynamic word list, so that controllable generation is realized; the decoding process starts from the root node < bos > of the prefix tree and ends at the leaf node < eos >, at each generation step, the dynamic vocabulary of this step is the child node of the current generation node;
the decoding status includes:
recording frame: generating names of event categories and role categories;
record mentions: generating an event trigger word and a character string of an event argument;
structure identifier: generating a structure identifier in the event structure linear expression for combining the record frame and the record mention;
the controllable decoding algorithm comprises the following steps:
pruning is automatically carried out on the word list based on the decoding state to generate a dynamic word list: traversing the frame prefix tree to generate a linear expression, wherein the dynamic word list of each decoding state is a child node of a node corresponding to the state in the tree, and the complete word list is all legal natural language words and identifiers "() { } [ ]";
controllably generating a linear expression using the generated dynamic vocabulary: selecting conditional probabilities P (y) from dynamic word lists i |y <i X) the highest candidate word is added at the end of the generated linear expression, where P (y i |y <i X) represents a linear representation based on the text sequence x to be extracted and the generated linearityExpression result y <i Continuing to generate y i Conditional probability of y i A symbol indicating the i-th position of the linear expression.
2. The method of claim 1, wherein the sequence-to-structure based record generation model employs a two-stage model learning method for efficient learning:
the first stage learning process adopts a linear expression of a substructure to perform model learning, and pays attention to training of text block extraction capacity;
the second stage learning process adopts a linear expression of a full structure to perform model learning, and pays attention to training of the extraction capacity of the structure.
3. A controllably generated structured record extraction device employing the method of claim 1 or 2, comprising:
a linear expression generation module for converting the plain text sequence into a structured linear expression using a sequence-to-structure based record generation model; in the process of converting the plain text sequence into the structured linear expression, a controllable decoding algorithm based on the frame prefix tree restriction is used for restricting the generation process of the linear expression;
and the structured record generation module is used for carrying out structural transformation on the generated linear expression to generate a structured record.
4. An electronic device comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for performing the method of claim 1 or 2.
5. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program which, when executed by a computer, implements the method of claim 1 or 2.
CN202110637453.5A 2021-06-08 2021-06-08 Structured record extraction method and device based on controllable generation Active CN113609244B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110637453.5A CN113609244B (en) 2021-06-08 2021-06-08 Structured record extraction method and device based on controllable generation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110637453.5A CN113609244B (en) 2021-06-08 2021-06-08 Structured record extraction method and device based on controllable generation

Publications (2)

Publication Number Publication Date
CN113609244A CN113609244A (en) 2021-11-05
CN113609244B true CN113609244B (en) 2023-09-05

Family

ID=78303478

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110637453.5A Active CN113609244B (en) 2021-06-08 2021-06-08 Structured record extraction method and device based on controllable generation

Country Status (1)

Country Link
CN (1) CN113609244B (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102298588A (en) * 2010-06-25 2011-12-28 株式会社理光 Method and device for extracting object from non-structured document
KR20130023563A (en) * 2011-08-29 2013-03-08 두산동아 주식회사 Apparatus and method for learning from text structure
CN103886080A (en) * 2014-03-25 2014-06-25 中国科学院地理科学与资源研究所 Method for extracting road traffic information from Internet unstructured text
CN108304911A (en) * 2018-01-09 2018-07-20 中国科学院自动化研究所 Knowledge Extraction Method and system based on Memory Neural Networks and equipment
CN109446513A (en) * 2018-09-18 2019-03-08 中国电子科技集团公司第二十八研究所 The abstracting method of event in a kind of text based on natural language understanding
CN111078825A (en) * 2019-12-20 2020-04-28 北京百度网讯科技有限公司 Structured processing method, structured processing device, computer equipment and medium
CN111339311A (en) * 2019-12-30 2020-06-26 智慧神州(北京)科技有限公司 Method, device and processor for extracting structured events based on generative network
CN112487109A (en) * 2020-12-01 2021-03-12 朱胜青 Entity relationship extraction method, terminal and computer readable storage medium
CN112597283A (en) * 2021-03-04 2021-04-02 北京数业专攻科技有限公司 Notification text information entity attribute extraction method, computer equipment and storage medium
CN112612871A (en) * 2020-12-17 2021-04-06 浙江大学 Multi-event detection method based on sequence generation model

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4141556B2 (en) * 1998-12-18 2008-08-27 株式会社日立製作所 Structured document management method, apparatus for implementing the method, and medium storing the processing program

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102298588A (en) * 2010-06-25 2011-12-28 株式会社理光 Method and device for extracting object from non-structured document
KR20130023563A (en) * 2011-08-29 2013-03-08 두산동아 주식회사 Apparatus and method for learning from text structure
CN103886080A (en) * 2014-03-25 2014-06-25 中国科学院地理科学与资源研究所 Method for extracting road traffic information from Internet unstructured text
CN108304911A (en) * 2018-01-09 2018-07-20 中国科学院自动化研究所 Knowledge Extraction Method and system based on Memory Neural Networks and equipment
CN109446513A (en) * 2018-09-18 2019-03-08 中国电子科技集团公司第二十八研究所 The abstracting method of event in a kind of text based on natural language understanding
CN111078825A (en) * 2019-12-20 2020-04-28 北京百度网讯科技有限公司 Structured processing method, structured processing device, computer equipment and medium
CN111339311A (en) * 2019-12-30 2020-06-26 智慧神州(北京)科技有限公司 Method, device and processor for extracting structured events based on generative network
CN112487109A (en) * 2020-12-01 2021-03-12 朱胜青 Entity relationship extraction method, terminal and computer readable storage medium
CN112612871A (en) * 2020-12-17 2021-04-06 浙江大学 Multi-event detection method based on sequence generation model
CN112597283A (en) * 2021-03-04 2021-04-02 北京数业专攻科技有限公司 Notification text information entity attribute extraction method, computer equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
一种面向医学文本数据的结构化信息抽取方法;杨兵;《小型微型计算机系统》;第40卷(第7期);1479-1485 *

Also Published As

Publication number Publication date
CN113609244A (en) 2021-11-05

Similar Documents

Publication Publication Date Title
Wu et al. CorefQA: Coreference resolution as query-based span prediction
Vahdat Toward robustness against label noise in training deep discriminative neural networks
CN116820429B (en) Training method and device of code processing model, electronic equipment and storage medium
Shilpa et al. Sentiment analysis using deep learning
CN116501306B (en) Method for generating interface document code based on natural language description
Liang et al. Reinforced iterative knowledge distillation for cross-lingual named entity recognition
CN111813913A (en) Two-stage problem generation system with problem as guide
CN115906815B (en) Error correction method and device for modifying one or more types of error sentences
CN117094325B (en) Named entity identification method in rice pest field
CN113609866A (en) Text marking method, device, equipment and storage medium
CN113609244B (en) Structured record extraction method and device based on controllable generation
CN110008344B (en) Method for automatically marking data structure label on code
Ressmeyer et al. “Deep faking” political twitter using transfe r learning and GPT-2
US20230168989A1 (en) BUSINESS LANGUAGE PROCESSING USING LoQoS AND rb-LSTM
CN115759103A (en) Training method and recognition method for small sample named entity recognition model
CN115455937A (en) Negative analysis method based on syntactic structure and comparative learning
He et al. Entire information attentive GRU for text representation
CN115587184A (en) Method and device for training key information extraction model and storage medium thereof
Jin et al. Amr-to-text generation with cache transition systems
Gouws Deep unsupervised feature learning for natural language processing
CN115329740B (en) Data augmentation method and device for contracting documents, computer equipment and storage medium
CN111158640B (en) One-to-many demand analysis and identification method based on deep learning
Zhang et al. MCSN: Multi-graph Collaborative Semantic Network for Chinese NER
CN114492387B (en) Domain self-adaptive aspect term extraction method and system based on syntactic structure
Wang et al. Domain Knowledge Enhanced BERT for Chinese Named Entity Recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant