WO2023274059A1 - 交替序列生成模型训练方法、从文本中抽取图的方法 - Google Patents

交替序列生成模型训练方法、从文本中抽取图的方法 Download PDF

Info

Publication number
WO2023274059A1
WO2023274059A1 PCT/CN2022/101089 CN2022101089W WO2023274059A1 WO 2023274059 A1 WO2023274059 A1 WO 2023274059A1 CN 2022101089 W CN2022101089 W CN 2022101089W WO 2023274059 A1 WO2023274059 A1 WO 2023274059A1
Authority
WO
WIPO (PCT)
Prior art keywords
training
information
text
graph
sequence
Prior art date
Application number
PCT/CN2022/101089
Other languages
English (en)
French (fr)
Inventor
任立椋
Original Assignee
任立椋
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 任立椋 filed Critical 任立椋
Publication of WO2023274059A1 publication Critical patent/WO2023274059A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/288Entity relationship models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Definitions

  • the invention relates to the technical field of information processing, in particular to a method for training an alternating sequence generation model, a method for extracting a graph from a text, an electronic device, and a computer-readable storage medium.
  • the method of scoring in pairs needs to traverse all possible text pairs, it will have a high time complexity; while the method of using a multi-dimensional cyclic neural network needs to store the implicit representation of the entire graph connection table, so it will have a high time complexity. space complexity. Representing graph nodes as specific words or text will cause the node classifier to be unable to accurately estimate the probability distribution of uncommon/unseen words, so that these words cannot be accurately extracted as graph nodes, and this will also affect graph extraction. Overall accuracy and precision.
  • the pairwise scoring method regards each edge as an independent element and classifies them separately, which ignores the dependency between edges.
  • the triple sequence generation method independently classifies each edge when generating triples. Edges and nodes are classified, which ignores the dependencies between edges and nodes. These ignorance of dependencies will affect the overall accuracy and precision of graph extraction.
  • Embodiments of the present invention provide a method for training an alternating sequence generation model, a method for extracting a graph from a text, an electronic device, and a computer-readable storage medium, which are used to solve at least one of the above-mentioned technical problems.
  • an embodiment of the present invention provides a method for training an alternating sequence generation model, including:
  • the training sample pair includes a paired training text and a training information graph, and the training information graph includes a plurality of nodes and at least one connection between two nodes in the plurality of nodes side;
  • An alternate sequence generation model is trained based on the training text and the training alternate sequence.
  • the generating a training alternate sequence containing node information and side information according to the training information graph includes: using a preset traversal algorithm to traverse the training information graph to generate a training sequence containing node information and side information Alternate sequence.
  • the training alternate sequence includes node information and side information spaced apart from each other.
  • the node information includes node type information
  • the edge information includes actual edge type information and virtual edge type information.
  • the training information graph includes a span as an address of an input text segment and a type as a representation of an abstract concept, wherein the type may be node type information, actual edge type information, and virtual edge type information
  • the length of the vocabulary is a span of 1.
  • training the alternate sequence generation model according to the training text and the training alternate sequence includes: processing the output distribution of the alternate sequence generation model with an alternate mask to obtain mutually spaced node information and An alternating sequence of side information.
  • an embodiment of the present invention provides a method for extracting images from text, including:
  • a target information map is generated according to the target alternation sequence.
  • the generating the target information map according to the target alternating sequence includes:
  • the target alternating sequence is processed according to the preset traversal algorithm adopted for training the alternating sequence generation model to generate a target information map.
  • the embodiment of the present invention provides a storage medium, and one or more programs including execution instructions are stored in the storage medium, and the execution instructions can be read by electronic devices (including but not limited to computers, servers, or network equipment, etc.) to read and execute, so as to execute any one of the above-mentioned methods for extracting images from text in the present invention.
  • electronic devices including but not limited to computers, servers, or network equipment, etc.
  • an electronic device which includes: at least one processor, and a memory connected to the at least one processor in communication, wherein the memory stores instructions executable by the at least one processor, The instructions are executed by the at least one processor, so that the at least one processor can execute any one of the above methods for extracting graphs from text in the present invention.
  • an embodiment of the present invention further provides a computer program product, the computer program product includes a computer program stored on a storage medium, the computer program includes program instructions, and when the program instructions are executed by a computer, the The computer executes any one of the above methods for extracting images from text.
  • the graph when the information graph is extracted from the text through the model, the graph is not directly modeled, but the problem of extracting the graph from the text is transformed into the problem of extracting the alternate sequence from the text, so that the problem of this embodiment
  • the alternating sequence generation model obtained by the method has only linear time and space complexity when used for graph extraction, and the time and space efficiency has been significantly improved.
  • Fig. 1 is the flowchart of an embodiment of the alternate sequence generation model training method of the present invention
  • Fig. 2 is the flowchart of an embodiment of the method for extracting figure from text of the present invention
  • Fig. 3 is a schematic diagram of an embodiment of an alternate sequence of an information multiple map of the present invention.
  • FIG. 4 is a schematic diagram of an embodiment of the encoder architecture of the present invention.
  • FIG. 5 is a schematic diagram of an embodiment of an alternate sequence of knowledge graphs in the ACE05 data set of the present invention.
  • FIG. 6 is a schematic diagram of an embodiment of a hybrid span decoder of the present invention.
  • FIG. 7 is a schematic diagram of BFS traversal embedding of alternate sequences of the present invention.
  • Fig. 8 is the schematic diagram of the distribution of remaining errors on the ACE05 test set of the present invention.
  • Fig. 9 is a schematic structural diagram of a converter with a mixed attention layer in the present invention.
  • FIG. 10 is a schematic structural diagram of an embodiment of the electronic device of the present invention.
  • the invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer.
  • program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types.
  • the invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network.
  • program modules may be located in both local and remote computer storage media including storage devices.
  • an embodiment of the present invention provides a method for training an alternating sequence generation model, including:
  • the training sample pair includes a paired training text and a training information graph
  • the training information graph includes a plurality of nodes and at least one link connecting two of the plurality of nodes The edge of the node.
  • the training infograph can be viewed as a heterogeneous multigraph (Li et al., 2014; Shi et al., 2017)
  • G (V, E)
  • V is a set of nodes (usually representing span (t s , t e ))
  • E is a multiset of edges with a node type mapping function ⁇ : V ⁇ Q and an edge type mapping function ⁇ : E ⁇ R.
  • node types and edge types are extracted from a limited vocabulary. Node types can be used to represent entity types (PER, ORG, etc.), while edge types can represent relationships between nodes (PHYS, OGR-AFF, etc.).
  • mapping s ⁇ f s (G, ⁇ ) from G to the sequence space S ⁇ is constructed.
  • f s depends on the (given) ordering ⁇ of nodes and their edges in G, constructed by graph traversal algorithms such as breadth-first search (BFS) or depth-first search (DFS) and internal ordering of node and edge types.
  • BFS breadth-first search
  • DFS depth-first search
  • the node information includes node type information
  • the edge information includes actual edge type information and virtual edge type information.
  • the training alternate sequence includes node information and side information spaced apart from each other.
  • the sequence s ⁇ s 0 ⁇ ,...s n ⁇ has an alternating structure, where s 0 ⁇ ,s 2 ⁇ ,s 4 ⁇ ,... represent nodes V, s 1 ⁇ ,s 3 ⁇ ,s 5 ⁇ , ...represent real or virtual edges.
  • this application exploits the fact that BFS visits nodes layer by layer (i.e., in the order p i , c i1 ,..., c ik , p j , where c ik is the kth child of parent node p i , defined by edge e ik connection, p j may or may not be equal to one of the child nodes of p i ), this application turns it into a sequence,
  • this application uses a special edge type [SEP] to describe the hierarchy in the graph.
  • the name of the specific special edge type can be arbitrary, including but not limited to [SEP] mentioned in this paper.
  • the [SEP] type appears immediately after the leaf nodes. This representation allows the applicant to unambiguously recover the original information graph if the applicant knows which type of graph traversal algorithm (BFS or DFS) the alternating sequence is based on.
  • the graph when the information graph is extracted from the text through the model, the graph is not directly modeled, but the problem of extracting the graph from the text is transformed into the problem of extracting the alternate sequence from the text, so that the problem of this embodiment
  • the alternating sequence generation model obtained by the method has only linear time and space complexity when used for graph extraction, and the time and space efficiency has been significantly improved.
  • step S20 generating a training alternate sequence containing node information and side information according to the training information graph includes: using a preset traversal algorithm to traverse the training information graph to generate a training sequence containing node information and side information Train alternating sequences.
  • the preset traversal algorithm may be a breadth-first search (BFS) algorithm or a depth-first search (DFS) algorithm, which is not limited in this application.
  • the training information graph includes a span as an address of an input text segment and a type as a representation of an abstract concept, wherein the type may be node type information, actual edge type information, and virtual edge type information
  • the length of the vocabulary is a span of 1.
  • training the alternate sequence generation model according to the training text and the training alternate sequence includes: processing the output distribution of the alternate sequence generation model with an alternate mask to obtain mutually spaced node information and An alternating sequence of side information.
  • the alternating sequence generation model is a neural decoder that is forced to generate alternating sequences only by decoding strides and types in a mixed manner.
  • the decoder of the present application only has relative input sequence length
  • the linear space and time complexity of and due to its nature as a sequential decision process, it can capture the interdependencies between referents and types.
  • the graph when the information graph is extracted from the text through the model, the graph is not directly modeled, but the problem of extracting the graph from the text is transformed into the problem of extracting the alternate sequence from the text, and then the Graph extraction has only linear time and space complexity, and has been significantly improved in time and space efficiency.
  • the generating the target information map according to the target alternating sequence includes:
  • the target alternating sequence is processed according to the preset traversal algorithm used for training the alternating sequence generation model to generate a target information graph.
  • Text-to-Graph extraction aims to automatically extract information graphs consisting of references (or entities) and types from natural language texts.
  • Existing methods such as table filling and pairwise scoring, show impressive performance on various information extraction tasks, but due to their second-order space/time complexity with respect to input length, they are difficult to scale to have A dataset of longer input texts.
  • this application proposes a Hybrid SPan Generator (HySPA) to map an information graph to an alternating sequence of node and edge types, and directly generate such a sequence through a hybrid span decoder, which can linearly Loop decoding of strides and types in and space complexities.
  • HySPA Hybrid SPan Generator
  • IE Information Extraction
  • NER Named Entity Recognition
  • RE Relation Extraction
  • Another approach is to treat the joint information extraction task as a table filling problem (Zhang et al., 2017; Wang and Lu, 2020), and use a multidimensional recurrent neural network to generate a 2D table (Graves et al., 2007). This can capture the interrelationships between entities and relations, but the space complexity grows quadratically with the length of the input text, making this approach impractical for long sequences.
  • Seq2RDF Liu et al., 2018
  • IMoJIE Killuru et al., 2020
  • this application proposes a first-order method to reversibly map a target graph to an alternating sequence of nodes and edges, and applies a hybrid span generator that directly learns to generate such an alternating sequence.
  • the main contributions of this application are threefold:
  • This application proposes a general technique to do reversible mappings between information graphs and alternating sequences (assuming a given graph traversal algorithm). Generating alternating sequences is equivalent to generating raw information graphs.
  • the present application proposes a new neural decoder that is forced to generate alternating sequences only by decoding spans and types in a hybrid manner.
  • our decoder has only linear space and time complexity with respect to the length of the input sequence, and due to its nature as a sequential decision process, it can capture the interdependencies between referents and types .
  • An information graph can be viewed as a heterogeneous multigraph (Li et al., 2014; Shi et al., 2017)
  • G (V, E), where V is a set of nodes (usually representing spans in the input document (t s , t e )), E is a multiset of edges with a node type mapping function ⁇ : V ⁇ Q and an edge type mapping function ⁇ : E ⁇ R.
  • node and edge types are drawn from a finite vocabulary. Node types can be used, for example, to represent entity types (PER, ORG, etc.), while edge types can represent relationships between nodes (PHYS, ORG-AFF, etc.). In this work, we denote node types as individual nodes connected to their nodes v by special edge types.
  • f s depends on the (given) ordering ⁇ of nodes and their edges in G, constructed by graph traversal algorithms such as breadth-first search (BFS) or depth-first search (DFS) and internal ordering of node and edge types.
  • BFS breadth-first search
  • DFS depth-first search
  • this application exploits the fact that it visits nodes layer by layer, i.e., in the order p i , c i1 ,..., c ik , p j (where c ik is the kth child of parent node p i , represented by edge e ik connection, p j may or may not be equal to one of the child nodes of p i ), this application turns s ⁇ into a sequence,
  • this application uses a special edge type [SEP] to describe the hierarchy in the graph.
  • SEP edge type
  • This representation allows the applicant to unambiguously recover the original graph if the applicant knows which type of graph traversal (BFS or DFS) to assume.
  • Algorithm 1 (used in this application to convert graphs in training data into sequences) shows how an alternating sequence can be traversed using BFS to construct a given graph.
  • Figure 3 shows an alternating sequence of informational multimaps.
  • is linearly limited by the size of the graph O(
  • ) O(
  • FIG. 3 This application represents a directed multigraph as an alternating sequence of nodes (A, B, C, D, E) and edges (1, 2, 3, 4, [S]).
  • the graph is traversed by breadth-first search (BFS) in ascending order of node and edge types.
  • BFS breadth-first search
  • [s]" or [SEP] are virtual edge types that represent each BFS level
  • Node and Edge Representation The node and edge representation in this application (explained below) relies on the observation that there are only two kinds of objects in an infograph: spans (as addresses of input text fragments) and types (as representations of abstract concepts) . Since the application can treat types as special spans of length 1 based on the vocabulary of all types (Q ⁇ R ⁇ U), the application only needs O(nm+
  • indices can be reversibly mapped back to types or text spans based on their size (details of this mapping are explained in Section 3.2).
  • the task of generating an infographic is thus transformed into generating alternating sequences of mixed spans.
  • the application will discuss in the following sections how to force the sequence generator h to generate sequences only in the space S ⁇ , since the application does not want h to assign non-zero probabilities to arbitrary sequences without corresponding graphs.
  • HySPA hybrid span generation for alternating sequences
  • HySPA model takes as input a piece of text (e.g., a sentence or a paragraph) and predefined node and edge types, and outputs an alternating sequence representation of an information graph. This application forces the alternate generation of this sequence by applying an alternation mask to the output probabilities.
  • a piece of text e.g., a sentence or a paragraph
  • predefined node and edge types e.g., a sentence or a paragraph
  • This application forces the alternate generation of this sequence by applying an alternation mask to the output probabilities.
  • the detailed architecture is described in the following subsections.
  • the colored tables on the right represent the metatype assignments for different chunks of connective vectors from H0.
  • the application arranges the type list v as the concatenation of edge type, virtual edge type, and node type label names, namely
  • l p
  • is the quantity of various types
  • W 0 ⁇ R de ⁇ dm is the weight matrix of the linear projection layer
  • d m is the size of the hidden layer of the model of the application.
  • This embedding pathway is also used to embed words in the input text x.
  • this application represents words as the contextual embedding of the first sub-token from the pre-trained language model (LM, egBERT (Devlin et al., 2018)), and in an end-to-end form for the language Model fine-tuning.
  • LM pre-trained language model
  • H0 is the concatenation of word vectors from four different types of tokens (i.e., edge type, virtual edge type, node type, and text)
  • a metatype embedding is applied to indicate this type between blocks of vectors from representations H0 difference (as shown in Figure 4).
  • the final context representation H is obtained by adding the elements of H0 through metatype embedding,
  • l h l p +
  • is the height of the mixed representation matrix H of the present application.
  • this application transforms the span t into an index k in representation H by mapping g k , k ⁇ l p ,
  • Figure 5 shows a specific example of the alternating sequence of the knowledge graph in the ACE05 dataset of the present application, an example of an alternating sequence representation (middle) of the knowledge graph (bottom) from the ACE05 training set, in which: He was in Baghdad late on a Monday night He was captured in Baghdad late Monday night.
  • a 1 represents Algorithm 1
  • “19” in the alternating sequence is the index of the span (0,1) of "he”
  • "83” is the index of the span (4,5) of "Baghdad”
  • "10" is the virtual edge type [SEP].
  • the input text for this chart (top) is "He was arrested in Baghdad late Monday night”.
  • mapping gt into our decoder to map alternating sequences y ⁇ back to spans in the hybrid representation H.
  • Fig. 6 shows the general model architecture of the hybrid span decoder of the present application.
  • the decoder of the present application takes as input a context representation H and iteratively decodes alternating sequences y ⁇ given a sequence start marker.
  • N is the number of decoder layers, before the softmax function Denotes the concatenation operator, H y N is the hidden representation of the sequence y ⁇ from the last decoder layer.
  • the hybrid span decoder of this application can be understood as an autoregressive model operating in the closed context space and output space defined by H.
  • Traversal Embedding To distinguish between mixed spans at different positions in y ⁇ , a simple approach is to add sinusoidal positional embeddings to Hy (Vaswani et al., 2017). However, this approach treats alternating sequences as ordinary sequences and ignores the underlying graph structure it encodes. To alleviate this problem, this application proposes a novel traversal embedding method that captures traversal level information, parent-child information, and intra-level connection information as a substitute for the original positional embedding.
  • the traversal embeddings of the present application can encode BFS or DFS traversal patterns. As an example, this application assumes BFS traversal here.
  • FIG. 7 Examples of BFS traversal embeddings for alternating sequences, ["he", type, PER, [SEP],”Baghdad", type, GPE, PHYS, "he”].
  • the BFS traversal embedding of the present application is the pointwise sum of the layer embedding L, the parent-child embedding P, and the tree embedding T given an alternating sequence y,
  • the layer embedding assigns the same embedding vector L i to each position of the BFS traversal level i, and fills the value of the embedding vector according to the non-parametric sinusoidal position embedding, because the application hopes that the embedding of the application can be extrapolated to the sequence than the training set Any sequence is long.
  • Parent-child embedding assigns different randomly initialized embedding vectors to the positions of the parent and child nodes in the BFS traversal level to help the model distinguish between the two kinds of nodes.
  • connection between each node in the BFS layer can be regarded as a tree of depth 3, where the first depth takes the parent node, and the second depth fills the edges Type
  • the third depth consists of child nodes corresponding to each edge type.
  • the tree embeddings of our application are then formed by encoding the position information of depth 3 trees using tree position embeddings (Shiv and Quirk, 2019) at each BFS level.
  • Figure 7 shows a concrete example of how these embeddings work for a given alternating sequence.
  • the obtained traversal embeddings are then added point-wise to the hidden representation of the alternating sequence H y to inject graph-structured traversal information.
  • Inner block By sliced input text representation Htext from mixture representation H and target sequence representation Hy, we apply N-layer Transformer structure with hybrid attention (He et al., 2018) to allow our model Utilizes edges or nodes from different attention layers when decoding alternate sequences. Note that our hybrid span decoder is perpendicular to the practical choice of the neural architecture of the inner block, and this application chooses the design of the hybrid attention transformer (He et al., 2018) because its hierarchical coordination properties are empirically more suitable for this application Heterogeneous decoding of sequence elements of two different types. The detailed structure of the internal blocks is explained in Appendix E.
  • Hybrid-span decoding For the hybrid-span decoding module, this application first cuts out the hidden representation of the alternating sequence y ⁇ from the output of the inner block of N layers, and denote it as H y N . Then for each hidden representation h yi N ⁇ H y N , 0 ⁇ i ⁇
  • W 5 , W 6 ⁇ R dm ⁇ dm and b 5 , b 6 ⁇ R dm are learnable parameters. Then the application calculates the scores of the types segment of H and the target spans of the text segment separately, and connects them together before the final softmax operator to jointly estimate the probability of text spans and type spans,
  • h i is the score vector of possible spans in H's type segment
  • t i is the score vector of possible spans in H's text segment. Since the span length of type span is always 1, this application only needs one element-wise addition calculation h i between the start position score h si and the end position score h ei .
  • the entry of t i contains the fraction of the text span, t si,j + t ei,k ; kj ⁇ m, computed with the help of an unroll function that transforms a vector t ei ⁇ R n into a stack of n sliding windows of size m, maximum span length, and stride 1.
  • the alternation mask m a ⁇ R lp , m a ′ ⁇ R n is defined as:
  • this application uses F1 as the evaluation metric for NER and RE.
  • NER a prediction is marked as correct when both type and boundary span match the golden entity.
  • RE task the prediction is correct when both entities have correct relation types and boundaries.
  • ALBERT-xxlarge-v1 has a drop rate of 0.1
  • RoBERTalarge achieves a stealth drop rate of 0.2
  • the hybrid span decoder of our application also has a drop rate of 0.1 during training.
  • Table 1 Joint NER and RE F1 scores of the IE model on the ACE05 test set. Computes the complexity of the entity and relation decoding part of the model (n is the length of the input text). The performance of the TabSeq model reported here is based on the same ALBERT-xxlarge (Lan et al., 2020) pretrained language model as in this application.
  • Table 1 compares our model with previous state-of-the-art results on the ACE05 test set.
  • the model using ALBERT in this application has significantly better performance in both NER scores and RE scores, while maintaining a smaller score than TabSeq Order of magnitude linear space complexity.
  • the model of the present application is the first joint model with both linear space and time complexity, and thus has the best scalability for large-scale real-world applications.
  • this application conducted an ablation experiment on the ACE05 data set.
  • the REF1 score drops significantly after our application removes the traversal embedding, which indicates that our ergodic embedding can help encode graph structures and improve relation prediction.
  • both NER F1 and RE F1 scores drop significantly if alternating masking is abandoned, demonstrating the importance of enforcing alternating patterns.
  • This application can observe that the hybrid attention layer contributes significantly to relation extraction. This is because layer-by-layer coordination can help the decoder untangle source features and utilize different layer features between entity and relation predictions.
  • the present application can also observe that the performance of DFS traversal is worse than that of BFS. This application suspects that this is because due to the nature of knowledge graphs, the resulting alternating sequences from DFS are usually longer than those from BFS, increasing the learning difficulty.
  • Fig. 8 is a schematic diagram of the distribution of remaining errors on the ACE05 test set. These may require additional features and strategies to address.
  • Inherent ambiguity Many examples have inherent ambiguity, for example the European Union can be classified as an organization or a political entity, while some entities (for example, military bases) can be both a location and an organization or facility.
  • NER is often done jointly with RE to reduce error propagation and learn interrelationships between tasks.
  • One approach is to view the joint task as a square table filling problem (Miwa and Sasaki, 2014; Gupta et al., 2016; Wang and Lu, 2020), where the ith column or row represents the ith token.
  • the table has diagonal lines indicating sequential tags of entities and other entries as relationships between tag pairs.
  • Another line of work is to perform RE after NER.
  • BiLSTM Gramves et al., 2013
  • Tree-LSTM based on dependency graph.
  • a method of building a dynamic text span graph is adopted to detect entities and relations.
  • Seq2Seq-based models have been proposed (Zhang et al., 2020; Zeng et al., 2018, 2020; Wei et al., 2019; Zhang et al., 2019) to generate triples (i.e. node- edge-nodes)
  • the model of this application is fundamentally different from them in that: (1) it generates a BFS/DFS traversal of the target graph, which captures dependencies between nodes and edges and has a shorter target sequence, (2)
  • the present application models the nodes because the span in the text has nothing to do with vocabulary, so even if the mark of the node is an uncommon word or an unseen word, the present application can still generate a span for it according to the context information.
  • Hybrid Span Generation the first end-to-end text-to-graph extraction model with linear space and time complexity in the graph decoding stage.
  • the model achieves the current state-of-the-art performance on the ACE05 joint entity and relation extraction task.
  • another method in which the mixed attention layer is removed and a standard Transformer-encoder-decoder structure is used.
  • This version has a simpler structure but worse performance than the version using mixed attention layers.
  • an alternative method uses DFS traversal instead of BFS traversal to build an alternating sequence representation of the graph, and this version also uses DFS traversal embedding (see Appendix D for details) instead of BFS traverses embeddings.
  • This version of graph extraction is less accurate than BFS traversal.
  • an alternative method in which words in a span are averaged to encode the span instead of attention-based span encoding.
  • This version of the model structure is simpler and has fewer model parameters but the graph extraction accuracy is inferior to the attention-based span encoding.
  • This application uses 100-dimensional GloVe word embeddings trained on 6B tokens as initialization and freezes their updates during training.
  • the feature embeddings have a 30-dimensional LSTM encoding, and the out-of-vocabulary labeled GloVe embeddings are replaced with randomly initialized vectors, following Wang and Lu (2020).
  • This application uses a gradient clipping of 0.25 during training.
  • the number of heads for the hybrid attention in this application is set to 8.
  • the bundle size and length penalty are determined by a grid search on the validation set of the ACE05 dataset, with bundle sizes ranging from 1 to 7 with a step size of 1 and length penalties ranging from 0.7 to 1.2 with a step size of 0.1.
  • This application selects the optimal beam size and length penalty based on the metric extracted from the relationship F1 score.
  • the model of this application uses the ALBERT-xxlarge pre-trained language model with 236 million parameters. On average, the best model in this application using ALBERT-xxlarge can be trained distributedly for 20 hours on two NVIDIA TITAN X GPUs.
  • the Automatic Content Extraction (ACE) 2005 dataset contains English, Arabic, and Chinese training data for the evaluation of the 2005 Automatic Content Extraction (ACE) technique, providing entity, relation, and event annotations.
  • This application follows Wadden et al. (2019) for preprocessing and data splitting.
  • the preprocessed data contains 7100 relations, 38000 entities and 14500 sentences.
  • the split contains 10051 training samples, 2424 development samples and 2050 testing samples.
  • DFS layer embedding assigns the same embedding vector Li to each location at DFS traversal layer i, but the values of the embedding vectors are initialized randomly instead of being filled with non-parametric sinusoidal location embeddings, because the proximity DFS traversal No information exists between levels.
  • This application encodes this distance information with a sinusoidal position embedding, which becomes our connection embedding, which captures intra-layer connection information.
  • Appendix E Transformers with Mixed Attention Layers
  • This application first cuts out the hidden representation of the input text from the hybrid representation H and denote it as H text , then feeds the output of the input text representation H text and the hybrid span encoding H y into the N hybrid attention/feedforward block
  • the stack has the structure shown in Figure 9.
  • n
  • is the length of the input text
  • l m
  • is the total length of the source and target features.
  • the concatenation of source features H text and target features H y is denoted as H 0
  • source/target embeddings are also added to H 0 before the first layer of hybrid attention to allow the model to distinguish from source Features of sequences and target sequences.
  • a hybrid attention layer is combined with a feed-forward layer to form a decoder block:
  • W q,k,v , b q,k,v , W 3 ⁇ R dm ⁇ 4dm , W 4 ⁇ R 4dm ⁇ dm , b 3 , b 4 are learnable parameters
  • LayerNorm is the layer normalization layer (Ba et al., 2016).
  • the decoder blocks are stacked N times to obtain the final hidden representation H N , and output the final representation H N y of the target sequence.
  • the time complexity of mixed attention is O(n 2 ) when encoding source features, but due to the causal masking of target features, this application can cache this part of the hidden representation when generating target tokens, thus maintaining the time complexity of each decoding O(n) for steps.
  • the embodiments of the present invention provide a non-volatile computer-readable storage medium, and one or more programs including execution instructions are stored in the storage medium, and the execution instructions can be executed by electronic devices (including But not limited to computers, servers, or network devices, etc.) to read and execute, so as to execute any one of the methods for extracting images from texts in the present invention.
  • electronic devices including But not limited to computers, servers, or network devices, etc.
  • the embodiments of the present invention further provide a computer program product
  • the computer program product includes a computer program stored on a non-volatile computer-readable storage medium
  • the computer program includes program instructions, and when the When the above program instructions are executed by a computer, the computer is made to execute any one of the above methods for extracting images from text.
  • the embodiment of the present invention also provides an electronic device, which includes: at least one processor, and a memory connected to the at least one processor in communication, wherein the memory stores information that can be accessed by the at least one processor. Instructions executed by a processor, the instructions are executed by the at least one processor, so that the at least one processor can execute the method for extracting graphs from text.
  • the embodiment of the present invention further provides a storage medium, on which a computer program is stored, and the feature is that, when the program is executed by a processor, a method for extracting a graph from a text is realized.
  • FIG. 10 is a schematic diagram of the hardware structure of an electronic device that performs a method for extracting images from text provided by another embodiment of the present application. As shown in FIG. 10 , the device includes:
  • processors 1010 and memory 1020 one processor 1010 is taken as an example in FIG. 10 .
  • the device for executing the method for extracting graphs from text may further include: an input device 1030 and an output device 1040 .
  • the processor 1010, the memory 1020, the input device 1030, and the output device 1040 may be connected via a bus or in other ways, and connection via a bus is taken as an example in FIG. 10 .
  • the memory 1020 can be used to store non-volatile software programs, non-volatile computer-executable programs and modules, such as the method for extracting graphs from text in the embodiment of the present application Corresponding program instruction/module.
  • Processor 1010 executes various functional applications and data processing of the server by running non-volatile software programs, instructions and modules stored in memory 1020, that is, implements the method of extracting graphs from text in the above method embodiments.
  • the memory 1020 may include a program storage area and a data storage area, wherein the program storage area may store an operating system and an application program required by at least one function; the data storage area may store data created according to the use of the device for extracting graphs from text Wait.
  • the memory 1020 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage devices.
  • the storage 1020 may optionally include storages that are remotely located relative to the processor 1010, and these remote storages may be connected to the device for extracting graphs from text through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
  • the input device 1030 may receive input numeric or character information, and generate signals related to user settings and function control of the device for extracting images from text.
  • the output device 1040 may include a display device such as a display screen.
  • the one or more modules are stored in the memory 1020, and when executed by the one or more processors 1010, execute the method for extracting images from text in any of the above method embodiments.
  • the electronic equipment of the embodiment of the present application exists in various forms, including but not limited to:
  • Mobile communication equipment This type of equipment is characterized by mobile communication functions, and its main goal is to provide voice and data communication.
  • Such terminals include: smart phones (such as iPhone), multimedia phones, feature phones, and low-end phones.
  • Ultra-mobile personal computer equipment This type of equipment belongs to the category of personal computers, has computing and processing functions, and generally has the characteristics of mobile Internet access.
  • Such terminals include: PDA, MID and UMPC equipment, such as iPad.
  • Portable entertainment equipment This type of equipment can display and play multimedia content.
  • Such devices include: audio and video players (such as iPod), handheld game consoles, e-books, as well as smart toys and portable car navigation devices.
  • the device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in One place, or it can be distributed to multiple network elements. Part or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each embodiment can be implemented by means of software plus a general hardware platform, and of course also by hardware.
  • the essence of the above technical solutions or the part that contributes to related technologies can be embodied in the form of software products, and the computer software products can be stored in computer-readable storage media, such as ROM/RAM, disk , CD, etc., including several instructions to make a computer device (which may be a personal computer, server, or network device, etc.) execute the methods described in each embodiment or some parts of the embodiments.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Evolutionary Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开一种交替序列生成模型训练方法,包括:从样本库中获取训练样本对,所述训练样本对包括成对的训练文本和训练信息图,所述训练信息图中包括多个节点和至少一条连接所述多个节点中的两个节点的边;根据所述训练信息图生成包含节点信息和边信息的训练交替序列;根据所述训练文本和所述训练交替序列训练交替序列生成模型。通过模型从文本中提取信息图时并未直接对图进行建模,而是将从文本中提取图的问题转化为了从文本中提取交替序列的问题,从而使得本实施例的方法得到的交替序列生成模型在用于图抽取时只具有线性的时间和空间复杂度,在时间和空间效率上得到了显著的提升。

Description

交替序列生成模型训练方法、从文本中抽取图的方法 技术领域
本发明涉及信息处理技术领域,尤其涉及一种交替序列生成模型训练方法、从文本中抽取图的方法、电子设备及计算机可读存储介质。
背景技术
现有的从文本中提取图方法通常先使用神经网络编码一段文本,然后使用成对打分的方法来生成图的边;或者使用多维度循环神经网络生成一张图的连接表;或者使用生成图的节点-边-节点的三元组序列的方法来从文本中抽取图。同时有些技术会将图的节点表示成具体的文字或者单词。
这些技术的时间和空间复杂度通常较高(大于线性复杂度),或者无法准确抽取含有非常见/未见过的单词的节点,或者忽视了图元素(边和节点)之间的依赖关系,图抽取的准确率和精度较低;
因为采用成对打分的方法要遍历所有可能的文本对,所以会具有较高的时间复杂度;而使用多维度循环神经网络的方法需要存储整张图连接表的隐表示,所以会具有较高的空间复杂度。将图节点表示成具体的单词或者文字会导致节点分类器无法准确估计非常见/未见过的单词的概率分布,从而无法准确抽取这些单词作为图的节点,而这也会影响到图抽取的整体准确率和精度。成对打分的方法将每条边视为相互独立的元素分别进行分类,而这忽视了边与边之间的依赖关系,三元组序列生成的方法在生成三元组的时候分别独立地对边和节点进行分类,而这忽视了边与节点之间的依赖关系。这些对依赖关系的忽视都会影响图抽取的整体准确率和精度。
总的来说,在使用现有技术的时候发明人发现这些方案的时间复杂度或者空间复杂度较高,而图抽取的综合准确度和精度较低,难以应用到大规模长文本的实际工业级使用场景。
发明内容
本发明实施例提供一种交替序列生成模型训练方法、从文本中抽取图的方法、电子设备及计算机可读存储介质,用于至少解决上述技术问题之一。
第一方面,本发明实施例提供一种交替序列生成模型训练方法,包括:
从样本库中获取训练样本对,所述训练样本对包括成对的训练文本和训练信息图,所述训练信息图中包括多个节点和至少一条连接所述多个节点中的两个节点的边;
根据所述训练信息图生成包含节点信息和边信息的训练交替序列;
根据所述训练文本和所述训练交替序列训练交替序列生成模型。
在一些实施例中,所述根据所述训练信息图生成包含节点信息和边信息的训练交替序列,包括:采用预设遍历算法对所述训练信息图进行遍历生成包含节点信息和边信息的训练交替序列。
在一些实施例中,所述训练交替序列包括相互间隔的节点信息和边信息。
在一些实施例中,所述节点信息包括节点类型信息,所述边信息包括实际边类型信息和虚拟边类型信息。
在一些实施例中,所述训练信息图中包括作为输入文本片段的地址的跨度和作为抽象概念的表示的类型,其中,所述类型可以为节点类型信息、实际边类型信息和虚拟边类型信息的词汇表的长度为1的跨度。
在一些实施例中,根据所述训练文本和所述训练交替序列训练交替序列生成模型,包括:对所述交替序列生成模型的输出分布采用交替掩码进行处理,以得到相互间隔的节点信息和边信息构成的交替序列。
第二方面,本发明实施例提供一种从文本中抽取图的方法,包括:
将待抽取文本输入采用前述方法训练得到的交替序列生成模型得到目标交替序列;
根据所述目标交替序列生产成目标信息图。
在一些实施例中,所述根据所述目标交替序列生成目标信息图包括:
根据训练所述交替序列生成模型所采用的预设遍历算法对所述目标 交替序列进行处理,生成目标信息图。
第三方面,本发明实施例提供一种存储介质,所述存储介质中存储有一个或多个包括执行指令的程序,所述执行指令能够被电子设备(包括但不限于计算机,服务器,或者网络设备等)读取并执行,以用于执行本发明上述任一项从文本中抽取图的方法。
第四方面,提供一种电子设备,其包括:至少一个处理器,以及与所述至少一个处理器通信连接的存储器,其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行本发明上述任一项从文本中抽取图的方法。
第五方面,本发明实施例还提供一种计算机程序产品,所述计算机程序产品包括存储在存储介质上的计算机程序,所述计算机程序包括程序指令,当所述程序指令被计算机执行时,使所述计算机执行上述任一项从文本中抽取图的方法。
本实施例中在通过模型从文本中提取信息图时并未直接对图进行建模,而是将从文本中提取图的问题转化为了从文本中提取交替序列的问题,从而使得本实施例的方法得到的交替序列生成模型在用于图抽取时只具有线性的时间和空间复杂度,在时间和空间效率上得到了显著的提升。
附图说明
为了更清楚地说明本发明实施例的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本发明的交替序列生成模型训练方法的一实施例的流程图;
图2为本发明的从文本中抽取图的方法的一实施例的流程图;
图3为本发明的信息多重图的交替序列的一实施例的示意图;
图4为本发明的编码器架构的一实施例的示意图;
图5为本发明的在ACE05数据集中知识图的交替序列的一实施例的示意图;
图6为本发明的混合跨度解码器的一实施例的示意图;
图7为本发明的交替序列的BFS遍历嵌入示意图;
图8为本发明的为在ACE05测试集上剩余错误的分布示意图;
图9为本发明中的具有混合注意力层的转换器的结构示意图;
图10为本发明的电子设备的一实施例的结构示意图。
具体实施方式
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。
本发明可以在由计算机执行的计算机可执行指令的一般上下文中描述,例如程序模块。一般地,程序模块包括执行特定任务或实现特定抽象数据类型的例程、程序、对象、元件、数据结构等等。也可以在分布式计算环境中实践本发明,在这些分布式计算环境中,由通过通信网络而被连接的远程处理设备来执行任务。在分布式计算环境中,程序模块可以位于包括存储设备在内的本地和远程计算机存储介质中。
发明人在使用现有技术的时候发现这些方案的时间复杂度或者空间复杂度较高,而图抽取的综合准确度和精度较低,难以应用到大规模长文本的实际工业级使用场景。因此发明人提出一种既具有高性能又具有高效率的方案以适应现有的工业级应用场景。
如图1所示,本发明的实施例提供一种交替序列生成模型训练方法,包括:
S11、从样本库中获取训练样本对,所述训练样本对包括成对的训练文本和训练信息图,所述训练信息图中包括多个节点和至少一条连接所述多个节点中的两个节点的边。
示例性地,训练信息图可以被视为异构多重图(Li et al.,2014;Shi et al.,2017)G=(V,E),其中V是一组节点(通常代表输入文档中的跨度(t s,t e)),E是具有节点类型映射函数φ:V→Q和边类型映射函数ψ:E→R 的边的多重集。假设节点类型和边类型是从有限词汇表中提取的。可以使用节点类型来表示实体类型(PER、ORG等),而边类型可以表示节点之间的关系(PHYS、OGR-AFF等)。
S12、根据所述训练信息图生成包含节点信息和边信息的训练交替序列。
本实施例中,没有直接对异构多重图G的空间建模,而是构建从G到序列空间S π的映射s π=f s(G,π)。f s取决于节点的(给定)排序π及其在G中的边,由广度优先搜索(BFS)或深度优先搜索(DFS)等图遍历算法以及节点和边类型的内部排序构建。
在一些实施例中,所述节点信息包括节点类型信息,所述边信息包括实际边类型信息和虚拟边类型信息。
本申请假设序列s π的元素s i π是从节点表示V(定义如下)、节点类型集Q、边类型集R和“虚拟”边类型集U的有限集中获取,
Figure PCTCN2022101089-appb-000001
s i π∈V∪Q∪R∪U。虚拟边类型U={[SOS],[EOS],[SEP]}不代表G中的边,但用于控制序列的生成,指示序列的开始/结束和图中层级的分离。
示例性地,所述训练交替序列包括相互间隔的节点信息和边信息。例如,假设序列s π=s 0 π,…s n π,具有交替结构,其中s 0 π,s 2 π,s 4 π,…代表节点V,s 1 π,s 3 π,s 5 π,…代表实际或虚拟边。在BFS的情况下,本申请利用BFS逐层访问节点的事实(即以顺序p i,c i1,…,c ik,p j,其中c ik是父节点p i的第k个子节点,由边e ik连接,p j可能等同于也可能不等同于p i的子节点子之一),本申请把它变成一个序列,
s π=p i,ψ(e i1),c i1,...,
ψ(e ik),c ik,[SEP],p j,...
其中,本申请使用特殊的边类型[SEP]来描绘图中的层级。具体的特殊边类型的名称可以是任意的,包括但不限于本文中提到的[SEP]。在DFS的情况下,[SEP]类型紧接着出现在叶节点之后。如果本申请知道交替序列是基于哪种类型的图遍历算法(BFS或DFS)的话,这种表示允许本申请明确地恢复原始信息图。
S13、根据所述训练文本和所述训练交替序列训练交替序列生成模型。
本实施例中在通过模型从文本中提取信息图时并未直接对图进行建 模,而是将从文本中提取图的问题转化为了从文本中提取交替序列的问题,从而使得本实施例的方法得到的交替序列生成模型在用于图抽取时只具有线性的时间和空间复杂度,在时间和空间效率上得到了显著的提升。
在一些实施例中,对于步骤S20根据所述训练信息图生成包含节点信息和边信息的训练交替序列,包括:采用预设遍历算法对所述训练信息图进行遍历生成包含节点信息和边信息的训练交替序列。其中,预设遍历算法可以是广度优先搜索(BFS)算法或深度优先搜索(DFS)算法,本申请对此不作限定。
在一些实施例中,所述训练信息图中包括作为输入文本片段的地址的跨度和作为抽象概念的表示的类型,其中,所述类型可以为节点类型信息、实际边类型信息和虚拟边类型信息的词汇表的长度为1的跨度。
交替序列的节点和边的表示依赖于这样一个观察,即信息图中只有两种对象:跨度(作为输入文本片段的地址)和类型(作为抽象概念的表示)。由于本申请可以将类型视为基于所有类型(Q∪R∪U)的词汇表的长度为1的特殊跨度,本申请将这些由文本跨度和长度为1的类型跨度组成的有序集合定义为混合跨度。有序集合中的索引可以根据它们的大小可逆地映射回类型或文本跨度。通过跨度和类型的联合索引,生成信息图的任务因此转换为生成混合跨度的交替序列。
在一些实施例中,根据所述训练文本和所述训练交替序列训练交替序列生成模型,包括:对所述交替序列生成模型的输出分布采用交替掩码进行处理,以得到相互间隔的节点信息和边信息构成的交替序列。
示例性地,交替序列生成模型为一种神经解码器,它被强制只通过以混合方式解码跨度和类型来生成交替序列,对于每个解码步骤,本申请的解码器仅具有相对于输入序列长度的线性空间和时间复杂度,并且由于其作为序贯决策过程的性质,它可以捕获指称项和类型之间的相互依赖性。
如图2所示,为本发明的从文本中抽取图的方法的一实施例的流程图,该实施例中包括:
S21、将待抽取文本输入采用前述交替序列生成模型训练方法训练得 到的交替序列生成模型得到目标交替序列;
S22、根据所述目标交替序列生产成目标信息图。
本实施例中在通过模型从文本中提取信息图时并未直接对图进行建模,而是将从文本中提取图的问题转化为了从文本中提取交替序列的问题,从而再从文本中进行图抽取时只具有线性的时间和空间复杂度,在时间和空间效率上得到了显著的提升。
在一些实施例中,所述根据所述目标交替序列生成目标信息图包括:
根据训练所述交替序列生成模型所采用的预设遍历算法对所述目标交替序列进行处理,生成目标信息图。
为更加清楚的介绍本发明的技术方案,也为更直接地证明本发明的可实时性以及相对于现有技术的有益性,以下将对本发明的技术背景、技术方案以及所进行的实验等进行更为详细的介绍。
摘要
Text-to-Graph提取旨在从自然语言文本中自动提取由指称(或实体)和类型组成的信息图。现有的方法,如表格填充和成对评分,在各种信息提取任务上表现出令人印象深刻的性能,但由于它们相对于输入长度的二阶空间/时间复杂性,它们难以扩展到具有更长输入文本的数据集。在这项工作中,本申请提出了一个Hybrid SPan Generator(HySPA)将信息图映射到节点和边类型的交替序列,并通过混合跨度解码器直接生成这样的序列,混合跨度解码器可以在线性时间和空间复杂度中对跨度和类型进行循环解码。在ACE05数据集上的大量实验表明,本申请的方法在联合实体和关系提取任务上也显著优于现有方法。
1、介绍
信息提取(IE)可以被视为Text-to-Graph的提取任务,旨在从非结构化文本中提取由指称(或实体)和类型组成的信息图(Li et al.,2014;Shi et al.,2017),其中,图的节点是指称或实体类型,边是表示节点之间关系的关系类型。图提取的典型方法是将提取过程分解为子任务,例如命名实体识别(NER)(Florian等人,2006,2010)和关系提取(RE)(Sun等人,2011年;Jiang和Zhai,2007年),和分别执行它们(Chan和Roth, 2011年)或联合执行它们(Li和Ji,2014年;Eberts和Ulges,2019年)。
最近的联合IE模型(Wadden等人,2019年;Wang和Lu,2020年;Lin等人,2020年)在各种IE任务上表现出令人印象深刻的性能,因为它们可以减轻错误传播并利用任务之间的相互依赖性。以前的工作经常使用成对评分技术来识别实体之间的关系类型。然而,这种方法计算效率低下,因为它需要枚举文档中所有可能的实体对,并且由于实体之间关系的稀疏性,关系类型在大多数情况下为空值。此外,成对评分技术独立评估每种关系类型,因此无法捕获不同指称对的关系类型之间的相互关系。
另一种方法是将联合信息提取任务视为表格填充问题(Zhang et al.,2017;Wang and Lu,2020),并使用多维循环神经网络生成二维表格(Graves et al.,2007)。这可以捕获实体和关系之间的相互关系,但空间复杂度相对于输入文本的长度呈二次方增长,使得这种方法对于长序列不切实际。
一些尝试,例如,Seq2RDF(Liu et al.,2018)和IMoJIE(Kolluru et al.,2020),利用Seq2seq模型(Cho et al.,2014)的强大功能来捕捉具有一阶复杂度的指称和类型之间的相互关系,但它们都使用预先定义的词汇表进行指称预测,这在很大程度上取决于目标词的分布,并且无法处理看不见的词汇表外的词。
为了解决这些问题,本申请提出了一种一阶方法,将目标图可逆地映射到节点和边的交替序列,并应用直接学习生成这种交替序列的混合跨度生成器。本申请的主要贡献有三方面:
·本申请提出了一种通用技术来在信息图和交替序列之间进行可逆映射(假设给定的图遍历算法)。生成交替序列相当于生成原始信息图。
·本申请提出了一种新的神经解码器,它被强制只通过以混合方式解码跨度和类型来生成交替序列。对于每个解码步骤,本申请的解码器仅具有相对于输入序列长度的线性空间和时间复杂度,并且由于其作为序贯决策过程的性质,它可以捕获指称项和类型之间的相互依赖性。
·本申请对自动内容提取(ACE)数据集进行了大量实验,这表明本申请的模型在旨在从一段非结构化文本中提取知识图的联合实体和关系提取任务上实现了当前最先进的性能。
2、将信息图建模为交替序列
信息图可以被视为异构多重图(Li et al.,2014;Shi et al.,2017)G=(V,E),其中V是一组节点(通常代表输入文档中的跨度(t s,t e)),E是具有节点类型映射函数φ:V→Q和边类型映射函数ψ:E→R的边的多重集。假设节点和边类型是从有限词汇表中提取的。可以使用节点类型例如来表示实体类型(PER、ORG等),而边类型可以表示节点之间的关系(PHYS、ORG-AFF等)。在这项工作中,本申请将节点类型表示为单独的节点,这些单独的节点通过特殊的边类型连接到它们的节点v。
将信息图表示为序列:本申请没有直接对异构多重图G的空间建模,而是构建从G到序列空间S π的映射s π=f s(G,π)。f s取决于节点的(给定)排序π及其在G中的边,由广度优先搜索(BFS)或深度优先搜索(DFS)等图遍历算法以及节点和边类型的内部排序构建。本申请假设结果序列s π的元素s i π是从有限集中获得,该有限集包括节点表示V、节点类型Q、边类型(实际边类型)R和“虚拟”边类型U:
Figure PCTCN2022101089-appb-000002
s i π∈V∪Q∪R∪U。虚拟边类型U={[SOS],[EOS],[SEP]}不代表G中的边,但用于控制序列的生成,指示序列的开始/结束和图中层级的划分。
本申请进一步假设s π=s 0 π,…s n π,表示图具有交替结构,其中s 0 π,s 2 π,s 4 π,…代表节点V,s 1 π,s 3 π,s 5 π,…代表实际或虚拟边。在BFS的情况下,本申请利用它逐层,即以顺序p i,c i1,…,c ik,p j访问节点的事实(其中c ik是父节点p i的第k个子节点,由边e ik连接,p j可能等于也可能不等于p i的子节点子之一),本申请把s π变成一个序列,
s π=p i,ψ(e i1),c i1,...,
ψ(e ik),c ik,[SEP],p j,...
其中,本申请使用特殊的边类型[SEP]来描绘图中的层级。如果本申请知道假设哪种类型的图遍历(BFS或DFS)的话,这种表示允许本申请明确地恢复原始图。算法1(本申请用来将训练数据中的图转换为序列)显示了一个交替序列如何可以使用BFS遍历构建给定的图。图3显示了信息多重图的交替序列。长度|s π|受图的大小O(|s π|)=O(|V|+|E|)的线性限制(这也是BFS/DFS等典型图遍历算法的复杂性)。
图3:本申请将有向多重图表示为节点(A、B、C、D、E)和边(1、2、3、4、[S])的交替序列。在这里,该图由广度优先搜索(BFS)以节点和边类型的升序遍历。“[s]”或[SEP]是虚拟边类型,代表每个BFS级
别的结束。
Figure PCTCN2022101089-appb-000003
节点和边表示:本申请的节点和边表示(在下面解释)依赖于这样一个观察,即信息图中只有两种对象:跨度(作为输入文本片段的地址)和类型(作为抽象概念的表示)。由于本申请可以将类型视为基于所有类型(Q∪R∪U)的词汇表的长度为1的特殊跨度,本申请只需要O(nm+|Q ∪R∪U|)个索引来明确表示基于串联的跨度类型词汇表和输入文本的表示,其中n是最大输入长度,m是最大跨度长度,m<<n。本申请将这些由文本跨度和长度为1的类型跨度组成的有序集合定义为混合跨度。这些索引可以根据它们的大小可逆地映射回类型或文本跨度(此映射的详细信息在第3.2节中解释)。通过跨度和类型的联合索引,生成信息图的任务因此转换为了生成混合跨度的交替序列。
生成序列:本申请通过带有参数θ的序列生成器h对分布p(s π)进行建模(|s|是s π的长度):
Figure PCTCN2022101089-appb-000004
Figure PCTCN2022101089-appb-000005
本申请将在以下部分中讨论如何强制序列生成器h仅在空间S π中生成序列,因为本申请不希望h将非零概率分配给没有相应图的任意序列。
3、HySPA:交替序列的混合跨度生成
为了直接生成一个目标序列(该目标序列在表示输入中跨度的节点和依赖于本申请的提取任务的节点/边类型集合之间交替),本申请首先构建了一个混合表示H,它是来自边类型、节点类型和输入文本的隐藏表示的串联。这种表示既作为本申请的解码器的上下文空间又作为输出空间。然后本申请将输入文本的跨度和类型的索引都可逆地映射到基于表示H的混合跨度。最后通过混合跨度解码器自动生成的混合跨度,来形成交替序列y π∈S π。通过将图提取任务转换为序列生成任务,本申请可以轻松地使用波束搜索解码来减少序列决策过程中可能出现的曝光偏差(Wiseman和Rush,2016年),从而找到全局更好的图表示。
HySPA的高级概述:HySPA模型以一段文本(例如,一个句子或段落)以及预定义的节点和边类型作为输入,并输出信息图的交替序列表示。本申请通过对输出概率应用交替掩码来强制交替生成此序列。详细架构在以下小节中描述。
3.1、文本和类型编码器
图4显示了本申请提出的模型的编码器架构,其中符号
Figure PCTCN2022101089-appb-000006
是连接运算 符,k是H 0中词向量的索引,l e=|R|+|U|。右侧的彩色表格表示来自H 0的连接词向量的不同块的元类型分配。对于节点类型集Q、边类型集R和虚拟边类型U,本申请安排类型列表v作为边类型、虚拟边类型和节点类型的标签名称的串联,即
Figure PCTCN2022101089-appb-000007
Figure PCTCN2022101089-appb-000008
Figure PCTCN2022101089-appb-000009
Figure PCTCN2022101089-appb-000010
其中,
Figure PCTCN2022101089-appb-000011
表示两个列表之间的连接运算符,
Figure PCTCN2022101089-appb-000012
分别是集合R,U,Q中类型名称的列表(例如,
Figure PCTCN2022101089-appb-000013
)。请注意,类型名称列表之间的连接顺序可以是任意的,只要在整个模型中保持一致即可。然后,就像在表-序列编码器的嵌入部分(Wang和Lu,2020)一样,对于每种类型v i,本申请使用来自预训练语言模型的上下文词嵌入、GloVe嵌入(Pennington et al.,2014)和特征嵌入来嵌入类型的标签符号,其中,GloVe的全称叫Global Vectors for Word Representation,它是一个基于全局词频统计(count-based&overall statistics)的词表征(word representation)工具。
E 1=ContextualizedEmbed(v),
Figure PCTCN2022101089-appb-000014
E 2=GloveEmbed(v),
Figure PCTCN2022101089-appb-000015
E 3=CharacterEmbed(v),
Figure PCTCN2022101089-appb-000016
Figure PCTCN2022101089-appb-000017
Figure PCTCN2022101089-appb-000018
其中,l p=|R|+|U|+|Q|是各种类型的数量,W 0∈R de×dm是线性投影层的权重矩阵,d e=d c+d g+d k是总嵌入维数,d m是本申请模型的隐藏层的大小。在本申请获得每种类型v i∈v的标记的上下文嵌入后,本申请将这些标记向量的平均值作为v i的表示,并在训练期间冻结其更新。更多细节可参照附录A。
该嵌入途径还用于嵌入输入文本x中的单词。与类型嵌入的途径不同,本申请将单词表示为来自预训练语言模型(LM,egBERT(Devlin et al.,2018))的第一个子标记的上下文嵌入,并以端到端的形式对该语言模型微 调。
在分别获得类型嵌入E v和文本嵌入E x后,本申请将它们沿序列长度维度连接起来,形成混合表示H 0。由于H 0是来自四种不同类型标记(即边类型、虚拟边类型、节点类型和文本)的词向量的串联,因此应用元类型嵌入来指示来自表示H 0的向量块之间的这种类型差异(如图4所示)。最终的上下文表示H是通过元类型嵌入和H 0的元素相加得到的,
Figure PCTCN2022101089-appb-000019
Figure PCTCN2022101089-appb-000020
Figure PCTCN2022101089-appb-000021
其中,l h=l p+|x|是本申请的混合表示矩阵H的高。
3.2、Span&Types与混合Span之间的可逆映射
给定文本中的跨度,t=(t s,t e)∈N 2,t s<t e,本申请通过映射g k将跨度t转换为表示H中的索引k,k≥l p
k=g k(t s,t e)=t sm+t e-t s-1+l p∈N,
其中,m是跨度的最大长度,l p=|R|+|U|+|Q|。本申请保持图中的类型索引不变,因为它们小于l p和k≥l p。由于对于信息图,指称的最大跨度长度m通常远小于文本的长度,即m<<n,因此本申请可以通过仅考虑长度小于m的跨度减少k的最大量级从O(n 2)到O(nm),从而保持本申请的解码器相对于输入文本长度n的线性空间复杂度。图5显示了本申请在ACE05数据集中知识图的交替序列的具体示例,来自ACE05训练集的知识图(底部)的交替序列表示(中间)示例,该示例中以:他在星期一深夜在巴格达被捕(He was captured in Baghdad late Monday night)。其中A 1表示算法1,本申请在此示例中取m=16和l p=19。交替序列中的“19”是“他”的跨度(0,1)的索引,“83”是“巴格达”的跨度(4,5)的索引,“10”是虚拟边类型[SEP]。该图表的输入文本(顶部)是“他在星期一深夜在巴格达被捕”。
由于t s,t e,k都是自然数,本申请可以构造一个逆映射g t,将H中的索引k转换回t=(t s,t e),
Figure PCTCN2022101089-appb-000022
t e=gt e(k)=gt s(k)+max(0,k-l p)mod m,
其中,
Figure PCTCN2022101089-appb-000023
是整数底函数,mod是模运算符。请注意,g t(k)可以直接应用于H的类型段中的索引,并保持它们的值不变,即,
g t(k)=(k,k),
Figure PCTCN2022101089-appb-000024
有了这个特性,本申请可以轻松地将映射g t合并到本申请的解码器中,以将交替序列y π映射回混合表示H中的跨度。
3.3混合跨度解码器
图6显示了本申请的混合跨度解码器的一般模型架构。本申请的解码器将上下文表示H作为输入,并在给定序列开始标记的情况下循环解码交替序列y π。N是解码器层数,在softmax函数之前的
Figure PCTCN2022101089-appb-000025
表示连接运算符,H y N是来自最后一个解码器层的序列y π的隐藏表示。本申请的混合跨度解码器可以理解为一个自回归模型,在由H定义的封闭上下文空间和输出空间中操作。
基于注意力的混合跨度编码:给定交替序列y π和映射g t(第3.2节),本申请的解码器首先将y π中的每个索引映射到一个跨度,(t si,t ei)=g t(y i π),以表示H为基础,然后转换注意力掩码M 0的跨度,以允许模型学习将跨度表示为跨度引用的上下文词表示片段的加权和,
Figure PCTCN2022101089-appb-000026
Figure PCTCN2022101089-appb-000027
Figure PCTCN2022101089-appb-000028
Figure PCTCN2022101089-appb-000029
其中,
Figure PCTCN2022101089-appb-000030
是序列标记[CLS]开头的|y π|次重复隐藏表示,来自H的文本段,H y是本申请对混合的最终表示跨度为y π。W 1,W 2,b 1,b 2是可学习的参数,t si,t ei是本申请正在编码的跨度的开始和结束位置。请注意,对于长度为1的类型spans,softmax计算的结果将始终为1,这导致其span表示恰好是本申请希望的嵌入向量。
遍历嵌入:为了区分yπ中不同位置的混合跨度,一种简单的方法是向Hy添加正弦位置嵌入(Vaswani et al.,2017)。然而,这种方法将交替序列视为普通序列并忽略它编码的底层图结构。为了缓解这个问题,本申请提出了一种新颖的遍历嵌入方法,该方法捕获遍历级别信息、父子信息 和级别内连接信息作为原始位置嵌入的替代。本申请的遍历嵌入可以编码BFS或DFS遍历模式。作为一个例子,本申请在这里假设BFS遍历。
图7:交替序列的BFS遍历嵌入示例,[“他”,类型,PER,[SEP],“巴格达”,类型,GPE,PHYS,“他”]。本申请的BFS遍历嵌入是层嵌入L、父-子嵌入P和给定交替序列y的树嵌入T的逐点总和,
TravEmbed(y)=L(y)+P(y)+T(y)∈R |y|×dm
其中,层嵌入为BFS遍历级别i的每个位置分配相同的嵌入向量L i,并且根据非参数正弦位置嵌入填充嵌入向量的值,因为本申请希望本申请的嵌入外推到序列比训练集中的任何序列都长。父-子嵌入在BFS遍历级别中的父节点和子节点的位置分配不同的随机初始化嵌入向量,以帮助模型区分这两种节点。为了对层内连接信息进行编码,本申请的见解是,BFS层中每个节点之间的连接可以看作是一个深度3的树,其中第一个深度取父节点,第二个深度填充边类型第三个深度由每个边类型对应的子节点组成。然后,本申请的树嵌入是通过使用每个BFS级别的树位置嵌入(Shiv and Quirk,2019)对深度3树的位置信息进行编码来形成的。图7显示了这些嵌入如何针对给定交替序列发挥作用的具体示例。然后将获得的遍历嵌入逐点添加到交替序列H y的隐藏表示中,以注入图结构的遍历信息。
内部块:通过从混合表示H和目标序列表示H y切片的输入文本表示H text,本申请应用具有混合注意力的N层转换器结构(He等人,2018年),以允许本申请的模型利用来自不同注意层在解码交替序列的边缘或节点时。注意本申请的混合跨度解码器垂直于内部块的神经结构的实际选择,本申请选择混合注意变换器的设计(He et al.,2018)因为它的分层协调特性在经验上更适合本申请对两种不同类型序列元素的异构解码。内部块的详细结构在附录E中解释。
混合跨度解码:对于混合跨度解码模块,本申请首先从N层内部块的输出中切出交替序列y π的隐藏表示,并将其表示为H y N。然后对于每个隐藏表示h yi N∈H y N,0≤i≤|y π|,本申请应用两个不同的线性层来获得起始位置表示s yi和结束位置表示e yi
Figure PCTCN2022101089-appb-000031
Figure PCTCN2022101089-appb-000032
其中W 5,W 6∈R dm×dm和b 5,b 6∈R dm是可学习的参数。然后本申请分别 计算H的types segment和text segment的target spans的分数,并在最终的softmax算子之前将它们连接在一起,以联合估计text spans和type spans的概率,
Figure PCTCN2022101089-appb-000033
Figure PCTCN2022101089-appb-000034
Figure PCTCN2022101089-appb-000035
Figure PCTCN2022101089-appb-000036
Figure PCTCN2022101089-appb-000037
Figure PCTCN2022101089-appb-000038
Figure PCTCN2022101089-appb-000039
其中,h i是H的类型段中可能的跨度的得分向量,而t i是H的文本段中可能的跨度的得分向量。由于类型跨度的跨度长度始终为1,因此本申请只需要一个element-wise开始位置分数h si和结束位置分数h ei之间的加法计算h i。t i的条目包含文本跨度的分数,t si,j+t ei,k
Figure PCTCN2022101089-appb-000040
k-j<m,在展开函数的帮助下计算,该函数将向量t ei∈R n转换为大小为m、最大跨度长度、步幅为1的n个滑动窗口的堆栈。交替掩码m a∈R lp,m a′∈R n定义为:
Figure PCTCN2022101089-appb-000041
Figure PCTCN2022101089-appb-000042
其中,l e=|R|+|U|是边类型的总数。这样,虽然本申请有节点和边类型的联合模型,输出分布由交替掩码强制执行以产生节点和边类型的交替解码,这就是本申请称此解码器为混合跨度的主要原因解码器。
4、实验
4.1、实验设置
本申请在LDC3分发的ACE 2005数据集上测试本申请的模型,该数据集包括1万4千5百个句子、3万8千实体(具有7种类型)和7100个关系(具有6种类型),这些数据集来自一般新闻领域,详情参见附录C。
根据之前的工作,本申请使用F1作为NER和RE的评估指标。对于NER任务,当类型和边界跨度都与黄金实体匹配时,预测被标记为正确。 对于RE任务,当两个实体的关系类型和边界都正确时,预测是正确的。
4.2、实现细节
在训练本申请的模型时,本申请应用标签平滑因子为0.1的交叉熵损失。使用每批次2048个标记(大约为28个批次)对模型进行训练,使用AdamW优化器(Loshchilov和Hutter,2018)训练25000次,学习率为2e -4,权重衰减为0.01,使用反平方根调度器进行2000次预热。遵循TabSeq模型(Wang和Lu,2020),本申请在训练期间使用RoBERTa-large(Liu等人)或ALBERT-xxlarge-v1(Lan等人,2020年)作为预训练语言模型,并将其学习率减慢了0.1倍。当ALBERT-xxlarge-v1具有0.1的下降率时,RoBERTalarge的隐形下降率达到0.2。在训练期间本申请的混合跨度解码器的下降率也有0.1。本申请设置最大跨度长度,m=16,本申请模型的隐藏大小,d m=256,以及解码器块的数量,N=12。尽管理论上波束搜索应该帮助本申请减少曝光偏差,但本申请没有观察光束大小的网格搜索期间的任何性能增益和验证集的长度损失(详细的网格搜索设置在附录A中)。因此,本申请将普通光束大小设置为1,将长度惩罚设置为1,并将这一理论实验矛盾留待未来研究。本申请的模型是使用FAIRSEQ工具包(Ott et al.,2019)构建的,用于高效的分布式训练,所有实验均在两个NVIDIA TITAN X GPU上进行。
Figure PCTCN2022101089-appb-000043
表1:IE模型在ACE05测试集上的联合NER和RE F1分数。计算模型的实体和关系解码部分的复杂性(n是输入文本的长度)。此处报告的TabSeq模型的性能基于与本申请相同的ALBERT-xxlarge(Lan等人,2020)预训练语言模型。
Figure PCTCN2022101089-appb-000044
表2:对ACE05测试集的消融研究。“-Traversal-embedding”:本申请去掉了遍历embedding,改用正弦位置embedding,下面的ablation是基于这个ablation之后的模型。“–Masking”:本申请从混合跨度解码器中移除交替掩码。“--BFS”:本申请使用DFS代替BFS作为遍历。“–Mixedattention”:本申请移除了混合注意力层并使用了标准的转换器编码器解码器结构。“–Span-attention”:本申请移除了跨度编码模块中的跨度注意力,取而代之的是对跨度中的单词进行平均。
4.3、结果
表1将本申请的模型与之前在ACE05测试集上的最新结果进行了比较。与之前使用ALBERT预训练语言模型的SOTA、TabSeq(Wang and Lu,2020)相比,本申请使用ALBERT的模型在NER分数和RE分数上都有明显更好的性能,同时保持了比TabSeq小一个数量级的线性空间复杂度。与之前所有联合IE模型相比,本申请的模型是第一个同时具有线性空间和时间复杂性的联合模型,因此对于大规模现实世界应用程序具有最佳的可扩展性。
4.4、消融研究
为了证明本申请方法的有效性,本申请在ACE05数据集上进行了消融实验。如表2所示,在本申请去除遍历嵌入后,RE F1分数显着下降,这表明本申请的遍历嵌入可以帮助对图结构进行编码并改进关系预测。此外,如果放弃交替掩蔽,NER F1和RE F1分数都会显着下降,这证明了强制执行交替模式的重要性。本申请可以观察到混合注意层对关系提取有显着贡献。这是因为逐层协调可以帮助解码器解开源特征并利用实体和关系预测之间的不同层特征。本申请还可以观察到DFS遍历的性能比BFS差。本申请怀疑这是因为由于知识图的性质,来自DFS的结果交替序列通常比来自BFS的交替序列更长,从而增加了学习难度。
4.5、误差分析
在分析了80个剩余错误后,本申请对以下常见情况进行了分类和讨论(图8为在ACE05测试集上剩余错误的分布示意图)。这些可能需要额外的功能和策略来解决。
上下文不足:在许多示例中,答案实体是一个代词,鉴于上下文有限,无法准确键入:在“本申请注意到他们说他们不想使用销毁的词,事实上,他们说让别人这样做”,很难正确地将本申请归类为一个组织。这可以通过使用整个文档作为输入,利用跨句子上下文来缓解。
生僻字:生僻字问题是测试集中的词很少出现在训练集中,并且通常不会在字典中出现。在句子“基地还有海军FA-18和海军Heriers”,术语“Heriers”(一种被模型错误地归类为人的车辆)既没有出现在训练集中,也没有被预训练的语言模型很好地理解;在这种情况下,模型只能依靠子词级表示。
需要背景知识:通常句子中提到的实体很难从上下文中推断出来,但通过查阅知识库很容易识别:在“空客应该发出更强烈的警报”中,本申请的模型错误地预测了空客是一种车辆,而这里的空客指的是欧洲航空航天公司。本申请的系统也将联合国安理会分为联合国和安理会两个实体,产生了一个不存在的关系三元组(安理会、联合国的一部分)。通过查阅知识库(例如DBpedia(Bizer等,2009)或执行实体链接)可以避免此类错误。
固有的歧义:许多例子都有固有的歧义,例如欧盟可以被归类为组织或政治实体,而一些实体(例如,军事基地)可以既是地点又是组织或设施。
5、相关工作
NER通常与RE联合完成,以减少错误传播并学习任务之间的相互关系。一种方法是将联合任务视为平方表填充问题(Miwa和Sasaki,2014年;Gupta等人,2016年;Wang和Lu,2020年),其中第i列或行代表第i令牌。该表具有指示实体和其他条目的顺序标记的对角线作为标记对之间的关系。另一行工作是在NER之后执行RE。在Miwa和Bansal(2016年)的工作中,作者使用BiLSTM(Graves等人,2013年) 作为NER,因此使用了基于依赖关系图的Tree-LSTM。另一方面,采用构建动态文本跨度图的方法来检测实体和关系。进一步结合了基于跨子任务和实例约束的全局特征,旨在将IE结果提取为图形。请注意,本申请的模型与ONEIE(Lin et al.,2020)的不同之处在于,本申请的模型通过自回归生成自动捕获全局关系,而ONEIE使用特征工程模板;此外,ONEIE需要对关系提取进行成对分类,而本申请的方法有效地生成现有关系和实体。
虽然已经提出了几个基于Seq2Seq的模型(Zhang et al.,2020;Zeng et al.,2018,2020;Wei et al.,2019;Zhang et al.,2019)来生成三元组(即node-edge-节点),本申请的模型与它们的根本不同在于:(1)它生成目标图的BFS/DFS遍历,它捕获节点和边之间的依赖关系并具有更短的目标序列,(2)本申请对节点进行建模由于文本中的跨度与词汇无关,因此即使节点的标记是生僻词或未见过的词,本申请仍然可以根据上下文信息对其生成跨度。
6、结论
在这项工作中,本申请提出了混合跨度生成(HySPA)模型,这是第一个在图解码阶段具有线性空间和时间复杂度的端到端文本到图提取模型。除了可扩展性之外,该模型还在ACE05联合实体和关系提取任务上实现了当前最先进的性能。鉴于本申请的混合跨度生成器的结构的灵活性,未来仍有丰富的研究方向,例如结合外部知识进行混合跨度生成,应用更有效的稀疏自注意力,并开发更好的搜索方法来找到更多由交替序列表示的全局合理图。
在一些实施例中,还提供另一种方法,该方法中移除了混合注意力层而使用了标准的Transformer编码器解码器结构。这种版本的结构更简单但性能要劣于使用了混合注意力层的版本。
在一些实施例中,还提供另一种方法,该方法中使用了DFS遍历而不是BFS遍历来构建图的交替序列表示,同时这种版本还使用了DFS遍历嵌入(详情参见附录D)而不是BFS遍历嵌入。这种版本的图抽取准确度要劣于BFS遍历。
在一些实施例中,还提供另一种方法,该方法中将跨度中的单词进行 平均来编码跨度而不是进行基于注意力的跨度编码。这种版本的模型结构要更为简单并且模型参数更少但图抽取准确度要劣于基于注意力的跨度编码。
附录A:超参数
本申请使用在6B令牌上训练的100维GloVe词嵌入作为初始化,并在训练期间冻结其更新。特征嵌入有30维的LSTM编码,词汇外标记的GloVe嵌入被替换为随机初始化的向量,遵循Wang和Lu(2020)。本申请在训练期间使用0.25的梯度裁剪。本申请的混合注意力的头数设置为8。束大小和长度惩罚由对ACE05数据集的验证集的网格搜索决定,束大小的范围从1到7,步长大小为1,长度惩罚从0.7到1.2,步长为0.1。本申请根据关系提取F1分数的度量选择最佳光束大小和长度惩罚。
附录B:训练细节
本申请的模型使用ALBERT-xxlarge预训练语言模型有2.36亿个参数。平均而言,本申请使用ALBERT-xxlarge的最佳模型可以在两个NVIDIA TITAN X GPU上分布式训练20小时。
附录C:数据
自动内容提取(ACE)2005数据集包含用于2005自动内容提取(ACE)技术评估的英语、阿拉伯语和中文训练数据,提供实体、关系和事件注释。本申请跟随瓦登等人(2019)用于预处理和数据拆分。预处理数据包含7100个关系、3万8千个实体和1万4千5百个句子。拆分包含10051个训练样本、2424个开发样本和2050个测试样本。
附录D:DFS遍历嵌入
由于父子信息已经包含在DFS遍历的层内连接中,本申请只有层嵌入和DFS遍历嵌入的连接嵌入之和。与BFS嵌入类似,DFS层嵌入在DFS遍历层i为每个位置分配相同的嵌入向量Li,但嵌入向量的值是随机初始化的,而不是用非参数正弦位置嵌入填充,因为接近度DFS的遍历层级之间不存在信息。但是,对于DFS级别中的元素,本申请确实有明确的距离信息,即对于DFS级别D=[A,B,C,...,[sep]],从A到元素的距离[A,B,C,...,[sep]]是[0,1,2;3,...,|D|-1]。本申请用正弦位置嵌入对这个距离信息进行编码,这成为本申请的连接嵌入,捕获层内连接信息。
附录E:具有混合注意力层的转换器
本申请首先从混合表示H中切出输入文本的隐藏表示,并将其表示为H text,然后将输入文本表示H text和混合跨度编码H y的输出输入到N混合注意力/前馈块的堆栈中具有图9所示的结构。
由于生成节点和边缘类型可能需要来自不同层的特征,本申请使用混合注意力(He et al.,2018),这允许本申请的模型在对文本段,H text和目标进行编码时利用来自不同注意力层的特征特点,H y
Figure PCTCN2022101089-appb-000045
Figure PCTCN2022101089-appb-000046
其中n=|x|是输入文本的长度,l m=|x|+|yπ|是源特征和目标特征的总长度。将源特征H text和目标特征H y的串联表示为H 0,在混合注意力的第一层之前还向H 0添加了源/目标嵌入(He et al.,2018)以允许模型区分来自源序列和目标序列的特征。混合注意力层与前馈层结合形成解码器块:
Figure PCTCN2022101089-appb-000047
Figure PCTCN2022101089-appb-000048
其中W q,k,v,b q,k,v,W 3∈R dm×4dm,W 4∈R 4dm×dm,b 3,b 4是可学习参数,LayerNorm是层归一化层(Ba et al.,2016)。解码器块堆叠N次以获得最终的隐藏表示H N,并输出目标序列的最终表示H N y。混合注意力在编码源特征时的时间复杂度为O(n 2),但由于目标特征的因果掩蔽,本申请可以在生成目标标记时缓存这部分的隐藏表示,从而保持时间复杂度每个解码步骤的O(n)。
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作合并,但是本领域技术人员应该知悉,本发明并不受所描述的动作顺序的限制,因为依据本发明,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定是本发明所必须 的。在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。
在一些实施例中,本发明实施例提供一种非易失性计算机可读存储介质,所述存储介质中存储有一个或多个包括执行指令的程序,所述执行指令能够被电子设备(包括但不限于计算机,服务器,或者网络设备等)读取并执行,以用于执行本发明上述任一项从文本中抽取图的方法。
在一些实施例中,本发明实施例还提供一种计算机程序产品,所述计算机程序产品包括存储在非易失性计算机可读存储介质上的计算机程序,所述计算机程序包括程序指令,当所述程序指令被计算机执行时,使所述计算机执行上述任一项从文本中抽取图的方法。
在一些实施例中,本发明实施例还提供一种电子设备,其包括:至少一个处理器,以及与所述至少一个处理器通信连接的存储器,其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行从文本中抽取图的方法。
在一些实施例中,本发明实施例还提供一种存储介质,其上存储有计算机程序,其特征在于,该程序被处理器执行时实现从文本中抽取图的方法。
图10是本申请另一实施例提供的执行从文本中抽取图的方法的电子设备的硬件结构示意图,如图10所示,该设备包括:
一个或多个处理器1010以及存储器1020,图10中以一个处理器1010为例。
执行从文本中抽取图的方法的设备还可以包括:输入装置1030和输出装置1040。
处理器1010、存储器1020、输入装置1030和输出装置1040可以通过总线或者其他方式连接,图10中以通过总线连接为例。
存储器1020作为一种非易失性计算机可读存储介质,可用于存储非易失性软件程序、非易失性计算机可执行程序以及模块,如本申请实施例中的从文本中抽取图的方法对应的程序指令/模块。处理器1010通过运行存储在存储器1020中的非易失性软件程序、指令以及模块,从而执行服 务器的各种功能应用以及数据处理,即实现上述方法实施例从文本中抽取图的方法。
存储器1020可以包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需要的应用程序;存储数据区可存储根据从文本中抽取图的装置的使用所创建的数据等。此外,存储器1020可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他非易失性固态存储器件。在一些实施例中,存储器1020可选包括相对于处理器1010远程设置的存储器,这些远程存储器可以通过网络连接至从文本中抽取图的装置。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。
输入装置1030可接收输入的数字或字符信息,以及产生与从文本中抽取图的装置的用户设置以及功能控制有关的信号。输出装置1040可包括显示屏等显示设备。
所述一个或者多个模块存储在所述存储器1020中,当被所述一个或者多个处理器1010执行时,执行上述任意方法实施例中的从文本中抽取图的方法。
上述产品可执行本申请实施例所提供的方法,具备执行方法相应的功能模块和有益效果。未在本实施例中详尽描述的技术细节,可参见本申请实施例所提供的方法。
本申请实施例的电子设备以多种形式存在,包括但不限于:
(1)移动通信设备:这类设备的特点是具备移动通信功能,并且以提供话音、数据通信为主要目标。这类终端包括:智能手机(例如iPhone)、多媒体手机、功能性手机,以及低端手机等。
(2)超移动个人计算机设备:这类设备属于个人计算机的范畴,有计算和处理功能,一般也具备移动上网特性。这类终端包括:PDA、MID和UMPC设备等,例如iPad。
(3)便携式娱乐设备:这类设备可以显示和播放多媒体内容。该类设备包括:音频、视频播放器(例如iPod),掌上游戏机,电子书,以及智能玩具和便携式车载导航设备。
(4)其他具有数据交互功能的电子装置。
以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到各实施方式可借助软件加通用硬件平台的方式来实现,当然也可以通过硬件。基于这样的理解,上述技术方案本质上或者说对相关技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在计算机可读存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行各个实施例或者实施例的某些部分所述的方法。
最后应说明的是:以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。

Claims (10)

  1. 一种交替序列生成模型训练方法,包括:
    从样本库中获取训练样本对,所述训练样本对包括成对的训练文本和训练信息图,所述训练信息图中包括多个节点和至少一条连接所述多个节点中的两个节点的边;
    根据所述训练信息图生成包含节点信息和边信息的训练交替序列;
    根据所述训练文本和所述训练交替序列训练交替序列生成模型。
  2. 根据权利要求1所述的方法,其特征在于,所述根据所述训练信息图生成包含节点信息和边信息的训练交替序列,包括:
    采用预设遍历算法对所述训练信息图进行遍历生成包含节点信息和边信息的训练交替序列。
  3. 根据权利要求1或2所述的方法,其特征在于,所述训练交替序列包括相互间隔的节点信息和边信息。
  4. 根据权利要求3所述的方法,其特征在于,所述节点信息包括节点类型信息,所述边信息包括实际边类型信息和虚拟边类型信息。
  5. 根据权利要求4所述的方法,其特征在于,所述训练信息图中包括作为输入文本片段的地址的跨度和作为抽象概念的表示的类型,其中,所述类型包括节点类型信息、所述实际边类型信息和所述虚拟边类型信息的词汇表的长度为1的跨度。
  6. 根据权利要求3所述的方法,其特征在于,根据所述训练文本和所述训练交替序列训练交替序列生成模型,包括:
    对所述交替序列生成模型的输出分布采用交替掩码进行处理,以得到相互间隔的节点信息和边信息构成的交替序列。
  7. 一种从文本中抽取图的方法,包括:
    将待抽取文本输入采用权利要求1-6所述的方法训练得到的交替序列生成模型得到目标交替序列;
    根据所述目标交替序列生产成目标信息图。
  8. 根据权利要求7所述的方法,其特征在于,所述根据所述目标交替序列生成目标信息图包括:
    根据训练所述交替序列生成模型所采用的预设遍历算法对所述目标交替序列进行处理,生成所述目标信息图。
  9. 一种电子设备,其包括:至少一个处理器,以及与所述至少一个处理器通信连接的存储器,其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行权利要求7-8中任意一项所述方法的步骤。
  10. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,该程序被处理器执行时实现权利要求7-8中任意一项所述方法的步骤。
PCT/CN2022/101089 2021-06-29 2022-06-24 交替序列生成模型训练方法、从文本中抽取图的方法 WO2023274059A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110725279.XA CN113487024A (zh) 2021-06-29 2021-06-29 交替序列生成模型训练方法、从文本中抽取图的方法
CN202110725279.X 2021-06-29

Publications (1)

Publication Number Publication Date
WO2023274059A1 true WO2023274059A1 (zh) 2023-01-05

Family

ID=77936505

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/101089 WO2023274059A1 (zh) 2021-06-29 2022-06-24 交替序列生成模型训练方法、从文本中抽取图的方法

Country Status (2)

Country Link
CN (1) CN113487024A (zh)
WO (1) WO2023274059A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116860999A (zh) * 2023-07-07 2023-10-10 清华大学 超大语言模型分布式预训练方法、装置、设备及介质

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113487024A (zh) * 2021-06-29 2021-10-08 任立椋 交替序列生成模型训练方法、从文本中抽取图的方法
CN115759098B (zh) * 2022-11-14 2023-07-18 中国科学院空间应用工程与技术中心 一种航天文本数据的中文实体和关系联合抽取方法、系统
CN117332180B (zh) * 2023-12-01 2024-03-12 浙商期货有限公司 基于大语言模型的研报智能写作方法、设备和存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111639189A (zh) * 2020-04-29 2020-09-08 西北工业大学 一种基于文本内容特征的文本图构建方法
US20200334416A1 (en) * 2019-04-16 2020-10-22 Covera Health Computer-implemented natural language understanding of medical reports
CN112289239A (zh) * 2020-12-28 2021-01-29 之江实验室 一种可动态调整的讲解方法、装置及电子设备
CN112597774A (zh) * 2020-12-14 2021-04-02 山东师范大学 中文医疗命名实体识别方法、系统、存储介质和设备
CN113487024A (zh) * 2021-06-29 2021-10-08 任立椋 交替序列生成模型训练方法、从文本中抽取图的方法

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101140263B1 (ko) * 2010-07-07 2012-06-13 엔에이치엔(주) 텍스트 패턴 추출을 이용하여 웹문서를 정제하기 위한 방법, 시스템 및 컴퓨터 판독 가능한 기록 매체
CN108415898B (zh) * 2018-01-19 2021-09-24 思必驰科技股份有限公司 深度学习语言模型的词图重打分方法和系统
CN111008266B (zh) * 2019-12-06 2023-09-26 北京金山数字娱乐科技有限公司 文本分析模型的训练方法及装置、文本分析方法及装置
CN111221984B (zh) * 2020-01-15 2024-03-01 北京百度网讯科技有限公司 多模态内容处理方法、装置、设备及存储介质
CN112149400B (zh) * 2020-09-23 2021-07-27 腾讯科技(深圳)有限公司 一种数据处理方法、装置、设备及存储介质
CN112270181A (zh) * 2020-11-03 2021-01-26 北京明略软件系统有限公司 序列标注方法、系统、计算机可读存储介质及计算机设备

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200334416A1 (en) * 2019-04-16 2020-10-22 Covera Health Computer-implemented natural language understanding of medical reports
CN111639189A (zh) * 2020-04-29 2020-09-08 西北工业大学 一种基于文本内容特征的文本图构建方法
CN112597774A (zh) * 2020-12-14 2021-04-02 山东师范大学 中文医疗命名实体识别方法、系统、存储介质和设备
CN112289239A (zh) * 2020-12-28 2021-01-29 之江实验室 一种可动态调整的讲解方法、装置及电子设备
CN113487024A (zh) * 2021-06-29 2021-10-08 任立椋 交替序列生成模型训练方法、从文本中抽取图的方法

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116860999A (zh) * 2023-07-07 2023-10-10 清华大学 超大语言模型分布式预训练方法、装置、设备及介质
CN116860999B (zh) * 2023-07-07 2024-04-19 清华大学 超大语言模型分布式预训练方法、装置、设备及介质

Also Published As

Publication number Publication date
CN113487024A (zh) 2021-10-08

Similar Documents

Publication Publication Date Title
WO2023274059A1 (zh) 交替序列生成模型训练方法、从文本中抽取图的方法
Bang et al. Explaining a black-box by using a deep variational information bottleneck approach
US20110087668A1 (en) Clustering of near-duplicate documents
Hoai et al. Representation and structural difficulty in genetic programming
CN114565104A (zh) 语言模型的预训练方法、结果推荐方法及相关装置
CN112149400B (zh) 一种数据处理方法、装置、设备及存储介质
US11113470B2 (en) Preserving and processing ambiguity in natural language
Wang et al. The APVA-TURBO approach to question answering in knowledge base
CN115099219A (zh) 一种基于增强图卷积神经网络的方面级情感分析方法
Lee et al. Ensembles of Lasso screening rules
Ren et al. HySPA: Hybrid span generation for scalable text-to-graph extraction
Thapa et al. Hdxplore: Automated blackbox testing of brain-inspired hyperdimensional computing
CN113553411B (zh) 查询语句的生成方法、装置、电子设备和存储介质
CN112925914B (zh) 数据安全分级方法、系统、设备及存储介质
JP2023517518A (ja) ヌル値又は同等の値を有するリレーショナル・テーブルのためのベクトル埋込モデル
CN116432125B (zh) 基于哈希算法的代码分类方法
Frazier et al. Learning from a consistently ignorant teacher
CN111562943B (zh) 一种基于事件嵌入树及gat网络的代码克隆检测方法和装置
CN110147393B (zh) 面向电影信息数据集中数据空间的实体解析方法
CN114579605A (zh) 表格问答数据处理方法、电子设备及计算机存储介质
CN112463161A (zh) 基于联邦学习的代码注释生成方法、系统及装置
Jeon et al. Random forest algorithm for linked data using a parallel processing environment
Morena Mutual stabilization of chaotic systems through entangled cupolets
CN113742455B (zh) 基于人工智能的简历搜索方法、装置、设备及存储介质
Kabra et al. Student’s Performance Prediction Using Genetic Algorithm

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22831873

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE