CN113609244A - Structured record extraction method and device based on controllable generation - Google Patents

Structured record extraction method and device based on controllable generation Download PDF

Info

Publication number
CN113609244A
CN113609244A CN202110637453.5A CN202110637453A CN113609244A CN 113609244 A CN113609244 A CN 113609244A CN 202110637453 A CN202110637453 A CN 202110637453A CN 113609244 A CN113609244 A CN 113609244A
Authority
CN
China
Prior art keywords
linear expression
record
generation
model
structured
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110637453.5A
Other languages
Chinese (zh)
Other versions
CN113609244B (en
Inventor
陆垚杰
林鸿宇
韩先培
孙乐
唐家龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Software of CAS
Original Assignee
Institute of Software of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Software of CAS filed Critical Institute of Software of CAS
Priority to CN202110637453.5A priority Critical patent/CN113609244B/en
Publication of CN113609244A publication Critical patent/CN113609244A/en
Application granted granted Critical
Publication of CN113609244B publication Critical patent/CN113609244B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a structured record extraction method and device based on controllable generation. The method can automatically extract the structured text records from the unstructured text, and the extraction step comprises the following steps: for a target text, the sequence-to-structure network firstly captures text semantics of the target text by using an encoder based on an attention mechanism, and then generates a structured representation by using a decoder based on a mixed attention mechanism; the controllable decoding algorithm based on the prefix tree restrains decoding space in the generation process, injects frame knowledge, guides model decoding and generates a linear expression; and finally, carrying out structural transformation on the linear expression to generate a structural record. In the model training stage, a two-stage model learning method is adopted to help the model to efficiently learn: in the first stage, a substructure is adopted for model learning, and the training of text block extraction capability is emphasized; and in the second stage, a complete record structure is adopted for model learning, and the training of structure extraction capability is emphasized.

Description

Structured record extraction method and device based on controllable generation
Technical Field
The invention relates to a structured record extraction method, in particular to a record extraction method and device based on controllable generation, and belongs to the technical field of natural language processing.
Background
Record extraction aims to automatically extract structured record information from unstructured text, and the record information includes but is not limited to an event structure, an inter-entity binary relation structure, an inter-entity multivariate relation structure and the like. Taking an example of event structured record extraction, given The sentence "The man turned to Los Angeles from Mexico", a record extraction system should be able to identify a "transportation" event whose trigger is "turned" and whose argument structures are "The man" (subject), "Los Angeles" (origin), and "Mexico" (destination). Taking the binary relationship structure between entities as an example, given the sentence "Obama wa born in honolu.", a record extraction system should be able to recognize that "Obama" is related to "hornin" which is "honolu". Structured record extraction is a key task in knowledge graph construction and natural language understanding.
The difficulty of structured record extraction is the complexity of the structural framework and the diversity of text expressions. First, the structural framework has a multi-element structure, with a single record structure consisting of record categories, participants in different roles, and different records having different semantic structures. Secondly, the text expressions are diverse, and a single record structure can adopt a plurality of different text expressions.
The traditional record extraction model mainly adopts a decoupling idea to deal with the complexity of a record structure and the diversity of text expression. Taking event record extraction as an example, the traditional method usually decomposes extraction of a complete event structure into a plurality of subtasks (event trigger word detection, entity extraction, argument structure extraction), combines results of the subtasks through different combination strategies (pipeline model, joint modeling and joint reasoning methods), and extracts the complete event structure. For example, The conventional method first determines that "turned" in The sentence triggers a transportation event, then extracts "The man" as a "Person" entity, and finally determines whether "The man" is a subject corresponding to The "turned" event. The decoupling method mainly faces the problems that fine-grained data are difficult to label and an optimal combination strategy is difficult to design manually. First, the traditional method needs to label training data with different strengths for different subtasks. For example, for Transport trigger word detection, Person entity recognition and Transport argument extraction, the traditional method needs different fine-grained labeled training corpora for model training, the use efficiency of the method for labeled data is low, and the difficulty and cost of data acquisition are increased. Furthermore, it is a very challenging task to design the optimal combined structure of different subtasks manually. For example, the pipeline model often causes error propagation, and the joint model needs to heuristically predefine information sharing and decision-making dependency relationships among trigger detection, parameter classification and entity identification, often resulting in poor structure and inflexible design of the event extractor.
Disclosure of Invention
The invention provides a record extraction method and a device based on controllable generation, aiming at the problems of high difficulty in fine-grained data labeling and difficulty in designing an optimal combination strategy.
The technical scheme adopted by the invention is as follows:
a structured record extraction method based on controllable generation comprises the following steps:
converting the plain text sequence into a structured linear expression by using a sequence-to-structure-based record generation model;
in the process of converting the plain text sequence into the structured linear expression, using a controllable decoding algorithm based on frame prefix tree limitation to constrain the generation process of the linear expression;
and carrying out structural transformation on the generated linear expression to generate a structural record.
Further, the sequence-to-structure based record generation model first captures text semantics of the target text using an auto-attention mechanism based encoder, and then generates a structured linear expression using a hybrid attention mechanism based decoder.
Further, the record generation model based on the sequence-to-structure adopts a two-stage model learning method to perform efficient learning:
in the first stage, a linear expression of a substructure is adopted for model learning in the learning process, and the training of text block extraction capability is emphasized;
and in the second stage, model learning is carried out by adopting a linear expression of a full structure, and the training of the structure extraction capability is emphasized.
Further, the linear expression expresses a multi-level recording structure, wherein: each "()" sub-bracket consists of a text type and a text block character string and is the minimum unit extracted; each "[ ]" sub-bracket represents a single record; each "{ }" sub-bracket represents a record in one sentence, and can contain a plurality of records or contain no records.
Furthermore, the controllable decoding algorithm automatically prunes the vocabulary based on the decoding state to generate a dynamic vocabulary, thereby realizing controllable generation; the decoding process starts at the root node < bos > of the prefix tree and ends at the leaf node < eos >, at each generation step, the dynamic vocabulary of that step being the child node of the currently generated node.
Further, the controllable decoding algorithm comprises the following steps:
automatically pruning the vocabulary based on the decoding state to generate a dynamic vocabulary: traversing the frame prefix tree to generate a linear expression, wherein the dynamic word list of each decoding state is a child node of a node corresponding to the state in the tree, and the complete word list is all legal natural language words and identifiers' () { } [ ];
controllably generating a linear expression using the generated dynamic vocabulary: selecting conditional probability P (y) from dynamic vocabularyi|y<iX) the highest candidate is added at the end of the generated linear expression, where P (y)i|y<iX) tableShowing the result y based on the text sequence x to be extracted and the generated linear expression<iContinue to generate yiConditional probability of (a), yiThe symbol representing the ith position of the linear expression.
A structured record extraction device based on controllable generation and adopting the method comprises the following steps:
the linear expression generation module is used for converting the plain text sequence into a structured linear expression by using a sequence-to-structure-based record generation model; in the process of converting the plain text sequence into the structured linear expression, using a controllable decoding algorithm based on frame prefix tree limitation to constrain the generation process of the linear expression;
and the structured record generation module is used for carrying out structural transformation on the generated linear expression to generate a structured record.
The invention has the beneficial effects that:
1) a multi-level recording structure is expressed by using a linear expression, so that the model can generate a complete event structure in an autoregressive mode.
2) And (3) directly extracting records on the plain text sequence to generate a complete structured record, thereby avoiding the dependence of the model on fine-grained training data.
3) The generation of the record structure is controlled in a dynamic word list mode, so that the integrity of the generated expression is ensured, and the framework knowledge is effectively injected.
4) Through a two-stage training method, firstly, a substructure is adopted to pre-train a model, and then a full structure is used to finely tune the model to help the model to transfer knowledge from a pre-trained language model.
Drawings
FIG. 1 is a block diagram of a record extraction framework based on controlled generation.
Fig. 2 is an event linear structure expression.
Fig. 3 is an event framework prefix tree.
Fig. 4 is an event category prefix tree.
Detailed Description
In order to make the aforementioned and other features and advantages of the invention more comprehensible, embodiments accompanied with figures are described in detail below.
The event extraction method based on controllable generation, as shown in fig. 1, mainly includes: 1) a sequence-to-structure based record generation model; 2) a controllable decoding algorithm based on framework prefix tree restriction; 3) a two-stage model learning algorithm based on continuous learning.
The recording form mainly includes relationship and event, taking event extraction as an example, the technical scheme adopted by the invention is as follows:
first, the present invention uses a linear expression to express different record structures, where the linear expression includes different layers and can be used for training and decoding of record extractors at different stages, as shown in fig. 2. For example, "[ Transport associated (Artifact bump) ]" represents a single Transport (Transport) event associated with bump; "{ [ Transport associated (Artifact Bush) ] }" represents that a single text contains one Transport event related to Bush; "{ [ Transport associated (Artifact Bush) ] [ Arrest-Jail associated (Agent Police) (Person James) ] }" represents that the text contains a plurality of events, one event is a transportation event related to "Bush", and the other event is an Arrest (Arrest-Jail) event related to "Police". Wherein, each "()" sub-bracket is composed of a text type and a text block character string and is the minimum unit of extraction. Each "[ ]" sub-bracket represents a single event record. Each sub-bracket represents an event record in a sentence, and may or may not contain a plurality of event records. When no event record is contained, the linear expression is "{ }".
The invention provides a record generation model based on sequence-to-structure, wherein the input of the model is a plain text sequence, and the output is a linearized event structure expression. The method comprises the following steps:
for the text sequence x to be extracted1,...,x|x|Capturing semantic information in a target text by an Encoder model Encoder based on a self-attention mechanism, acquiring context information in the target text by using a multi-head attention mechanism, and obtaining a semantic feature tableThe following steps:
H=Encoder(x1,...,x|x|);
wherein x isi(i ═ 1, …, | x |) represents the ith word in the text sequence to be extracted, and | x | represents the length of the text sequence.
Given the semantic feature representation H of the input sequence, the Decoder Decoder based on the mixed attention mechanism generates an event record structure by means of autoregressive. The decoder model takes "< bos >" as an initial state until the generation ends with "< eos >". The conditional probability of the whole generated sequence is obtained by multiplying the conditional probability at each moment by:
Figure BDA0003106354210000041
where x represents a text sequence to be extracted, such as "The man turned to Los Angeles from mexico", y represents a linear expression obtained by extraction (as shown in fig. 2), i represents The ith position of The generated linear expression, | y | represents The length of The linear expression, and y represents The length of The linear expressioniSymbol, y, representing the i-th position of the linear expression<iRepresenting the linear expression result, P (y), that has been generatedi|y<iX) represents a result y based on a text sequence x to be extracted and a generated linear expression<iContinue to generate yiThe conditional probability of (2).
In the ith step of generation, the decoder performs Softmax normalization operation on the vocabulary V to calculate the generation probability P (y) of a single momenti|y<iX). Decoder represents H, decoder state of preorder moment based on semantic characteristics of target text
Figure BDA0003106354210000042
And the output y of the previous momenti-1And (3) outputting the predicted current time:
Figure BDA0003106354210000043
wherein the Decoder includes a Decoder state
Figure BDA0003106354210000044
The self-attention mechanism and the cross-attention mechanism for the semantic feature representation H of the target text, namely the Decoder adopts a mixed attention mechanism. The self-attention mechanism refers to an attention mechanism inside the Decoder, namely, attention to a generated expression state in a Decoder encoding process, and the cross-attention mechanism refers to attention of the Decoder to a semantic feature expression H of a target text generated by the Endecoder.
In the model decoding process, in order to effectively inject framework knowledge (the framework refers to a specific structure of a record, the framework knowledge refers to the role type of a specific participant in the record structure, for example, an arrest event should contain a plurality of different role types such as "time", "place", "arrestee", "arrest", and the like) and ensure the integrity of a generated structure, the invention provides a controllable decoding algorithm based on framework prefix tree restriction to carry out controllable generation. The decoding algorithm automatically prunes the vocabulary to generate a dynamic vocabulary based on the decoding state, thereby realizing controllable generation. The complete decoding process can be regarded as a process of searching the frame prefix tree. In particular, the generation process of the recording structure can be divided into three different decoding states:
1. recording frame: names of the event category (T) and the role category (R) are generated, for example, "Transport" and "Artifact" in the above-described example.
2. Mention is made of the records: a string (S) of event triggers and event arguments is generated, such as "concerned" and "Bush" in the above example.
3. Structure identifier: the structure identifiers "{", "}", "[", "]", "(" and ")" in the event structure linear expression are generated for combining the record frame and the record reference.
The controllable decoding algorithm based on the framework prefix tree limitation specifically comprises the following steps:
1) based on the decoding state, the vocabulary is automatically pruned to generate a dynamic vocabulary, and the specific method comprises the following steps:
and traversing the frame prefix tree to generate a linear expression, wherein the dynamic word list of each decoding state is a child node of a node corresponding to the state in the tree. The complete vocabulary is all legal natural language vocabulary and identifiers "() { } [ ]. For example, as shown in fig. 3, for the generated result "< bos > {", the child nodes of the corresponding node are "[" and "}", i.e., the dynamic vocabulary is merely "[ }". For example, as shown in fig. 4, when the generated result is "< bos > { [", the dynamic vocabulary corresponding to the state is a set of event categories T.
2) And controllably generating a linear expression by utilizing the generated dynamic vocabulary, wherein the specific method comprises the following steps:
selecting conditional probability P (y) from dynamic vocabularyi|y<iX) the highest candidate word is added at the end of the generated linear expression. For example, FIG. 3, for a generated string "<bos>{ "the legal candidate words are" [ "and" } ", and the probabilities are P respectivelyaAnd Pb. When P is presenta≥PbThen, generate "<bos>{ [ "; otherwise, generate "<bos>{}”。
The decoding process starts from the root node < bos > of the prefix tree and ends at the leaf node < eos >. In each generation step i, the dynamic vocabulary in the step i is a child node of the current generation node. As shown in fig. 3, the candidate word table of the < bos > node is { "," }. Event category generation, event role generation, and event mention generation then search the corresponding sub-tree, as shown in fig. 4.
And after the generation of the event record linear expression is finished, analyzing the linear record into a tree, and finally converting the tree into the event record.
In the model training phase, the invention uses a pre-training language model generated by orienting to sequence text as an initialization model. Different from the common text-to-text generation task, the sequence-to-structure framework adopted by the invention has the problems of output gap between text expression and linear expression, high semantic-structure-free identifier ratio and the like, increases the difficulty of model learning, and is difficult to effectively use knowledge in a pre-training model. In order to solve the problems, the invention designs a two-stage model learning algorithm based on curriculum learning to train the model and help the model to transfer knowledge from the pre-training language model to realize efficient training. The training algorithm comprises the following steps:
the first stage employs a substructure extraction training strategy aimed at migrating the "sequence-to-sequence" model (i.e., the initialization model) to a "sequence-to-substructure" model. In this stage, the model is mainly used for learning the text block extraction capability, and does not learn the structure extraction capability. This stage uses a linear expression of sub-structure events for model training. The sub-structure does not contain a hierarchical event structure, centered on the text block. For example, the sub-structure linear expression "{ (Transport associated) (Artifact Bush) }" does not contain a hierarchical structure, only has different text block sub-brackets.
And in the second stage, a full-structure extraction training strategy is adopted, so that the 'sequence-to-substructure' model is transferred into a 'sequence-to-structure' model. In the stage, a linear expression of a full structure is used for model training, and the model is mainly used for learning the structure extraction capability.
In summary, the key steps of the structured record extraction method based on controllable generation of the present invention include:
1) recording structure linear expression, using linear expression to express a recording structure, the linear expression can be converted with the complete recording structure;
2) based on the representation, converting the plain text sequence into a structured linear expression by using a sequence-to-structure-based record generation model;
3) based on the representation and the generation model, a controllable decoding algorithm based on frame prefix tree limitation is used for restricting the generation process in the generation structure, and a dynamic word list is generated in a mode of automatically pruning the word list, so that the integrity of the generation record structure is ensured, and frame knowledge is injected.
4) Aiming at the training process of generating the model, a two-stage model learning algorithm based on course learning is adopted, and the learning of extracting the model is divided into a sub-structure learning stage and a full-structure learning stage.
In the step 1), a multi-level recording structure is expressed by using a linear expression, so that the model can generate a complete event structure in an autoregressive mode.
And 2) directly extracting records on the plain text sequence to generate a complete structured record, so that the dependence of the model on fine-grained training data is avoided.
And 3) controlling the generation of the record structure in a dynamic word list mode, thereby ensuring the integrity of the generated expression and effectively injecting the framework knowledge.
In the step 4), through a two-stage training method, the model is pre-trained by adopting a substructure, and then the model is finely adjusted by using a full structure, so that the model is helped to migrate knowledge from the pre-trained language model.
An example of the above method is as follows:
this embodiment takes The example of extracting "transmitted" event triggered by "transmitted" and "arm-Jail" event triggered by "capture" from Mexico flowing his capture tasks in "The man transmitted to Los Angeles from Mexico flowing around this hub".
Scene:
training the corpus:
training example 1: "in" round involved in Saint Petersburg "triggers a Transport event, the Destination and subject are" Saint Petersburg "and" bump ", respectively, a full structure event linear expression is" { [ Transport involved (Destination) Petersburg) ] } ", and a sub structure event linear expression is" { (Transport involved (Destination) Petersburg) } ".
Training example 2: "polarized trapped James in surface Springs on Friday", where the subject, object, Time and Place are "polarized", "James", "Friday" and "surface Springs", respectively, the full structure event linear expression is "{ [ arm-Jail trapped (Agent polce) (Person James) (surface Springs) (Time Friday) ] }", and the sub structure event linear expression is "{ (arm-Jail trapped) (Agent polce) (Person James) (surface rod Springs) (Time Friday) }".
Testing corpora:
test example 1: "The man turned to Los Angeles from Mexico following his capture Tuesday by round hunters.
The implementation is as follows:
the invention uses text-event record training corpus, and uses sequence-to-structure neural network model and course learning algorithm to construct event record extractor. In the method, firstly, an event substructure extraction task is used for pre-training a model, and then a full-structure linear expression is used for fine-tuning the complete model to obtain a trained event extraction model.
The invention uses a sequence-to-structure neural network model and a controllable decoding algorithm based on framework prefix tree limitation to generate an event record linear expression. For example, test example 1, the model would generate a linear expression containing two events (as shown in FIG. 2): { [ Transport recovered (organ The man) (Destination Los Angeles) (Origin Mexico) ] [ Arrest-Jail capture (Person The man) (Agent bounces) (Time Tuesday) ] }.
And (III) finally, converting the linear expression generated in the second step into two event records by using structural conversion: { Type: Transport, Trigger: returned, Arg1 Role: Artifact, Arg1: The man, Arg2 Role: Destination, Arg2: Los Angeles, Arg3 Role: Origin, Arg3: Mexico }; { Type: Arrest-Jail, Trigger: capture, Arg1 Role: Person, Arg1: The man, Arg2 Role: Arrest-Jail. agent, Arg2: bounty hunters, Arg3 Role: Time, Arg3: Tuesday }. In this example, a controllable decoding algorithm based on framework prefix tree constraints guarantees the integrity of the generated structure and the validity of the event framework.
Based on the same inventive concept, another embodiment of the present invention provides a structured record extraction device based on controllable generation, which adopts the above method, and comprises:
the linear expression generation module is used for converting the plain text sequence into a structured linear expression by using a sequence-to-structure-based record generation model; in the process of converting the plain text sequence into the structured linear expression, using a controllable decoding algorithm based on frame prefix tree limitation to constrain the generation process of the linear expression;
and the structured record generation module is used for carrying out structural transformation on the generated linear expression to generate a structured record.
Based on the same inventive concept, another embodiment of the present invention provides an electronic device (computer, server, smartphone, etc.) comprising a memory storing a computer program configured to be executed by the processor, and a processor, the computer program comprising instructions for performing the steps of the inventive method.
Based on the same inventive concept, another embodiment of the present invention provides a computer-readable storage medium (e.g., ROM/RAM, magnetic disk, optical disk) storing a computer program, which when executed by a computer, performs the steps of the inventive method.
Other embodiments of the invention:
1) the invention can equally replace the tasks of semantic role marking and relation extraction, namely converting binary and multivariate relations into linear expressions for extraction; for example, the binary relation in "Obama waters borne in Honolulu" } "can be expressed as" { Born _ in [ Arg1 Obama ] [ Arg2 Honolulu ] } ".
2) The present invention may represent records in a variety of linear expressions including, but not limited to, replacing frames and mentioning locations, replacing different identifiers "() { } [ ]", and the like. For example, in test example 1, it can be expressed as { [ turned Transport (The man arm) (Los angles Destination) (Mexico Origin) ] [ capture arms-jail (The man person agents) (reception times) ].
The above embodiments are only intended to illustrate the technical solution of the present invention and not to limit the same, and a person skilled in the art can modify the technical solution of the present invention or substitute the same without departing from the spirit and scope of the present invention, and the scope of the present invention should be determined by the claims.

Claims (10)

1. A structured record extraction method based on controllable generation is characterized by comprising the following steps:
converting the plain text sequence into a structured linear expression by using a sequence-to-structure-based record generation model;
in the process of converting the plain text sequence into the structured linear expression, using a controllable decoding algorithm based on frame prefix tree limitation to constrain the generation process of the linear expression;
and carrying out structural transformation on the generated linear expression to generate a structural record.
2. The method of claim 1, wherein the sequence-to-structure based record generation model first captures text semantics of a target text using an auto-attention mechanism based encoder and then generates a structured linear expression using a hybrid-attention mechanism based decoder.
3. The method of claim 1, wherein the sequence-to-structure based record generation model employs a two-stage model learning approach for efficient learning:
in the first stage, a linear expression of a substructure is adopted for model learning in the learning process, and the training of text block extraction capability is emphasized;
and in the second stage, model learning is carried out by adopting a linear expression of a full structure, and the training of the structure extraction capability is emphasized.
4. The method of claim 1, wherein the linear expression expresses a multi-level record structure, wherein: each "()" sub-bracket consists of a text type and a text block character string and is the minimum unit extracted; each "[ ]" sub-bracket represents a single record; each "{ }" sub-bracket represents a record in one sentence, and can contain a plurality of records or contain no records.
5. The method of claim 4, wherein the controllable decoding algorithm automatically prunes the vocabulary based on the decoding status to generate a dynamic vocabulary, thereby achieving controllable generation; the decoding process starts at the root node < bos > of the prefix tree and ends at the leaf node < eos >, at each generation step, the dynamic vocabulary of that step being the child node of the currently generated node.
6. The method of claim 5, wherein the decoding state comprises:
recording frame: generating names of event categories and role categories;
mention is made of the records: generating a character string of an event trigger word and an event argument;
structure identifier: a structure identifier in an event structure linear expression is generated for combining the record framework and the record mentions.
7. The method of claim 5, wherein the controllable decoding algorithm comprises the steps of:
automatically pruning the vocabulary based on the decoding state to generate a dynamic vocabulary: traversing the frame prefix tree to generate a linear expression, wherein the dynamic word list of each decoding state is a child node of a node corresponding to the state in the tree, and the complete word list is all legal natural language words and identifiers' () { } [ ];
controllably generating a linear expression using the generated dynamic vocabulary: selecting conditional probability P (y) from dynamic vocabularyi|y<iX) the highest candidate is added at the end of the generated linear expression, where P (y)i|y<iX) represents a result y based on a text sequence x to be extracted and a generated linear expression<iContinue to generate yiConditional probability of (a), yiThe symbol representing the ith position of the linear expression.
8. A structured record extraction device based on controllable generation and adopting the method of any one of claims 1 to 7, characterized by comprising:
the linear expression generation module is used for converting the plain text sequence into a structured linear expression by using a sequence-to-structure-based record generation model; in the process of converting the plain text sequence into the structured linear expression, using a controllable decoding algorithm based on frame prefix tree limitation to constrain the generation process of the linear expression;
and the structured record generation module is used for carrying out structural transformation on the generated linear expression to generate a structured record.
9. An electronic apparatus, comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for performing the method of any of claims 1 to 7.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a computer, implements the method of any one of claims 1 to 7.
CN202110637453.5A 2021-06-08 2021-06-08 Structured record extraction method and device based on controllable generation Active CN113609244B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110637453.5A CN113609244B (en) 2021-06-08 2021-06-08 Structured record extraction method and device based on controllable generation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110637453.5A CN113609244B (en) 2021-06-08 2021-06-08 Structured record extraction method and device based on controllable generation

Publications (2)

Publication Number Publication Date
CN113609244A true CN113609244A (en) 2021-11-05
CN113609244B CN113609244B (en) 2023-09-05

Family

ID=78303478

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110637453.5A Active CN113609244B (en) 2021-06-08 2021-06-08 Structured record extraction method and device based on controllable generation

Country Status (1)

Country Link
CN (1) CN113609244B (en)

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040205598A1 (en) * 1998-12-18 2004-10-14 Toru Takahashi Method and system for management of structured document and medium having processing program therefor
CN102298588A (en) * 2010-06-25 2011-12-28 株式会社理光 Method and device for extracting object from non-structured document
KR20130023563A (en) * 2011-08-29 2013-03-08 두산동아 주식회사 Apparatus and method for learning from text structure
CN103886080A (en) * 2014-03-25 2014-06-25 中国科学院地理科学与资源研究所 Method for extracting road traffic information from Internet unstructured text
CN108304911A (en) * 2018-01-09 2018-07-20 中国科学院自动化研究所 Knowledge Extraction Method and system based on Memory Neural Networks and equipment
CN109446513A (en) * 2018-09-18 2019-03-08 中国电子科技集团公司第二十八研究所 The abstracting method of event in a kind of text based on natural language understanding
CN111078825A (en) * 2019-12-20 2020-04-28 北京百度网讯科技有限公司 Structured processing method, structured processing device, computer equipment and medium
CN111339311A (en) * 2019-12-30 2020-06-26 智慧神州(北京)科技有限公司 Method, device and processor for extracting structured events based on generative network
CN112487109A (en) * 2020-12-01 2021-03-12 朱胜青 Entity relationship extraction method, terminal and computer readable storage medium
CN112597283A (en) * 2021-03-04 2021-04-02 北京数业专攻科技有限公司 Notification text information entity attribute extraction method, computer equipment and storage medium
CN112612871A (en) * 2020-12-17 2021-04-06 浙江大学 Multi-event detection method based on sequence generation model

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040205598A1 (en) * 1998-12-18 2004-10-14 Toru Takahashi Method and system for management of structured document and medium having processing program therefor
CN102298588A (en) * 2010-06-25 2011-12-28 株式会社理光 Method and device for extracting object from non-structured document
KR20130023563A (en) * 2011-08-29 2013-03-08 두산동아 주식회사 Apparatus and method for learning from text structure
CN103886080A (en) * 2014-03-25 2014-06-25 中国科学院地理科学与资源研究所 Method for extracting road traffic information from Internet unstructured text
CN108304911A (en) * 2018-01-09 2018-07-20 中国科学院自动化研究所 Knowledge Extraction Method and system based on Memory Neural Networks and equipment
CN109446513A (en) * 2018-09-18 2019-03-08 中国电子科技集团公司第二十八研究所 The abstracting method of event in a kind of text based on natural language understanding
CN111078825A (en) * 2019-12-20 2020-04-28 北京百度网讯科技有限公司 Structured processing method, structured processing device, computer equipment and medium
CN111339311A (en) * 2019-12-30 2020-06-26 智慧神州(北京)科技有限公司 Method, device and processor for extracting structured events based on generative network
CN112487109A (en) * 2020-12-01 2021-03-12 朱胜青 Entity relationship extraction method, terminal and computer readable storage medium
CN112612871A (en) * 2020-12-17 2021-04-06 浙江大学 Multi-event detection method based on sequence generation model
CN112597283A (en) * 2021-03-04 2021-04-02 北京数业专攻科技有限公司 Notification text information entity attribute extraction method, computer equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
TONGLIANG LI 等: "AnaSearch: Extract, Retrieve and Visualize Structured Results from Unstructured Text for Analytical Queries", WSDM \'21: PROCEEDINGS OF THE 14TH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING *
杨兵: "一种面向医学文本数据的结构化信息抽取方法", 《小型微型计算机系统》 *

Also Published As

Publication number Publication date
CN113609244B (en) 2023-09-05

Similar Documents

Publication Publication Date Title
CN109492113B (en) Entity and relation combined extraction method for software defect knowledge
CN110334339B (en) Sequence labeling model and labeling method based on position perception self-attention mechanism
CN109992669B (en) Keyword question-answering method based on language model and reinforcement learning
CN116820429B (en) Training method and device of code processing model, electronic equipment and storage medium
CN114168749A (en) Question generation system based on knowledge graph and question word drive
CN112749562A (en) Named entity identification method, device, storage medium and electronic equipment
CN113553850A (en) Entity relation extraction method based on ordered structure encoding pointer network decoding
CN113743119B (en) Chinese named entity recognition module, method and device and electronic equipment
CN115906815B (en) Error correction method and device for modifying one or more types of error sentences
CN111813913A (en) Two-stage problem generation system with problem as guide
CN115688879A (en) Intelligent customer service voice processing system and method based on knowledge graph
CN114238652A (en) Industrial fault knowledge map establishing method for end-to-end scene
CN111597816A (en) Self-attention named entity recognition method, device, equipment and storage medium
CN115115984A (en) Video data processing method, apparatus, program product, computer device, and medium
JP7466784B2 (en) Training Neural Networks Using Graph-Based Temporal Classification
CN113609244A (en) Structured record extraction method and device based on controllable generation
CN112131879A (en) Relationship extraction system, method and device
CN113449517B (en) Entity relationship extraction method based on BERT gated multi-window attention network model
CN115455937A (en) Negative analysis method based on syntactic structure and comparative learning
Liu Task-Oriented Explainable Semantic Communication Based on Semantic Triplets
CN114638238A (en) Training method and device of neural network model
CN110390010A (en) A kind of Method for Automatic Text Summarization
Hu Research on Named Entity Recognition Technology based on pre-trained model
Jiang et al. Automatic Question Answering Method Based on IMGRU-Seq2seq
Maqsood Evaluating NewsQA Dataset With ALBERT

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant