CN110889276A

CN110889276A - Method, system and computer medium for extracting pointer-type extraction triple information by complex fusion features

Info

Publication number: CN110889276A
Application number: CN201911083955.7A
Authority: CN
Inventors: 杨家兵; 高怀恩; 张学习; 龙土志; 董海涛
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2019-11-07
Filing date: 2019-11-07
Publication date: 2020-03-17
Anticipated expiration: 2039-11-07
Also published as: CN110889276B

Abstract

The invention provides a method, a device and computer equipment for extracting pointer-type extraction triple information by complex fusion characteristics, which comprises the following steps: s1: acquiring a text and a corresponding triple SPO label; s2: training to obtain a word vector of each word; s3: inputting each character in the text into a network according to a character vector to train and complete feature extraction; s4: inputting the extracted features into a pointer model for training; s5: and extracting the triple SPO by using the trained model. The invention provides a brand-new model for extracting triples in a text, which adopts complex fused feature vectors, trains a pointer network model according to a subject S and an object P pointer in sequence, and then extracts all triples in a target by using the trained model.

Description

Method, system and computer medium for extracting pointer-type extraction triple information by complex fusion features

Technical Field

The present invention relates to the field of text feature extraction and information extraction, and more particularly, to a method, a system, and a computer medium for pointer extraction of triple information by using complex fusion feature extraction.

Background

In order to meet the challenges of information explosion, some automated tools are urgently needed to help people to quickly find really needed information from massive information sources, wherein all massive information consists of each sentence, and each sentence consists of a plurality of 'subject-predicate-object (subject S, object O and relationship P between the subject S and the object O)' triples. One hundred degree encyclopedia is found at random: the XX technology company Limited is a civil-camp communication technology company which produces and sells communication equipment, and is formally registered in 1987, and the headquarters are located in the Dragon sentry region of Shenzhen city in Guangdong province in China. In this sentence, all triples are { S: "XX technology Co., Ltd", O: "1987", P: "formation time" } and { S: "XX technology Co., Ltd", O: "Dragon sentry region of Guandong Shenzhen city", P: "headquarters point". How to extract the key information of the online text efficiently and accurately is a great challenge in the field. In most of the existing deep learning methods, one type is combined extraction, a sentence is input, a combined model is extracted through entity identification and relation, the original relation extraction related to a sequence marking task and a classification task is completely changed into a sequence marking problem through the combined model, and then a triple is directly obtained through an end-to-end neural network model. The other method is a two-step method, a sentence is input, named entity recognition is firstly carried out, then the recognized subject S and the recognized object O are combined pairwise, system p extraction is carried out, the relation p classification corresponding to the (S, O) combination is obtained, finally the triples with the entity relation are used as input, and finally all the triples are stored.

Disclosure of Invention

Aiming at the problems that in the prior art, all triples cannot be extracted by deep learning, entity relationship overlapping is not supported according to a sequence labeling strategy, and a two-step extraction method cannot effectively extract the forms of 'one S, a plurality of P, O' and the same pair of S, O 'which can also correspond to a plurality of P', the invention provides a method for extracting pointer type extraction triplet information by using complex fusion features.

A method for extracting pointer-type extraction triple information by complex fusion features comprises the following steps:

s1: obtaining sentences and corresponding triple tags from various texts, wherein the triple tags are a subject S, an object O and a relation P;

s2: coding each sentence into a vector format, and obtaining a word vector of each word through word position Embedding layer training;

s3: inputting each character in the sentence into a feature extraction network according to the character vector to complete feature extraction after training, and obtaining the feature vector of each sentence;

s4: inputting the feature vector of each sentence into a pointer model for training;

s5: extracting all S main bodies in the target by using the trained model; extracting all corresponding relations P according to the guidance of all the main bodies S; and guiding and extracting all O objects according to all combinations of (S, P), wherein the extracted objects and the labels have one-to-one correspondence.

In a preferred embodiment, in step S1, sentences and corresponding triples are obtained through web crawlers and manual annotations, respectively.

In a preferred embodiment, the specific steps at S2 are as follows:

s21, coding each character in all sentences without numbering corresponding to different characters;

s22, determining a fixed sequence length X, and cutting the sentence length to be 100 when the sentence length exceeds 100; if the length of the sentence is less than 100, 0 is supplemented after the sentence until the length of the sentence is 100, and a sentence vector recognized by a computer is formed;

and S23, placing the sentence vectors obtained in the step S22 in a word position coding layer Embedding layer for coding to obtain word vectors of the word position codes.

In a preferred embodiment, the specific steps of S3 are as follows:

s31, sending the word vectors of the obtained word position codes into a feature extraction network for training, wherein the feature extraction network comprises a convolution network and a circulation network;

s32, extracting sentence characteristic vector A ═ a from the convolution network₁,a₂,…,a_i](ii) a The sentence characteristic vector extracted by the loop network is B ═ B₁,b₂,…,b_i]Wherein a and b are single word vectors;

s33, rewriting A and B into complex number mode, wherein

n is a₁、b₁The number of elements in the vector;

s34, A^{^}And B^{^}Performing complex addition, determining the magnitude of the modulus, if the magnitude of the modulus is larger than that of the matrix

Or

The modulus values of (a) are fused, i.e. a is selected₁+b₁(ii) a Otherwise, select A^{^}And B^{^}The one with larger median modulus is used as the final feature extraction vector, namely, the vector is extracted from a_iAnd b_iSelecting one;

s35, obtaining a final fused feature vector H ═ H₁，h₂，h₃…h_i]。

In a preferred embodiment, the specific steps at S4 are as follows:

s41, the fused feature vector H is [ H ═ H₁，h₂，h₃…h_i]Each h is a vector per word;

s42, calculating a score by using the current state as a current unit Attention _1 score:

where Va and Wa are parameters, e represents the fraction score of the current feature vector, score, Va and W_ah_iThe similarity calculation is carried out to obtain the e,

the method is characterized in that a training parameter vector Va is transposed, a common transpose for similarity calculation between vectors is used, and a weight A is obtained in a normalization mode;

s43, obtaining Wa dimension dxd, hi dimension dx1 and Va dimension dx1 through training, obtaining a final vector C which is AxH through Attention _1, and taking a result C of weighted summation of an original vector H as an Attention value; by representing the text by the attention value vector, the network can learn the attention on the corresponding predicted SOP according to the label information. In the prediction S, the main position vector is actually two classes, and in the two-class problem, the set of values may be {0, 1}, so the loss function still uses the cross entropy of the two classes.

S44, using a 2 classifier to classify the attention vector H_i＝[h₁，h₂，h₃…h_i]And extracting corresponding SPO triple information, training the model by using an Adam optimizer, training by using a smaller learning rate according to the test, then loading the optimal result of the training, and continuing to train to the optimal state by using a smaller learning rate.

In a preferred embodiment, the pointer model includes an Attention1 model, an Attention2 model, and an Attention3 model.

In a preferred embodiment, the specific steps of S5 are as follows:

s51, extracting all S main bodies in the target by using the trained model, sampling one of the S1 main bodies, and processing the fused feature vectors and the corresponding extracted feature vectors by using an Attention1 model;

s52, combining the fused feature vectors through the position of S1, extracting P by using an attention2 model, and obtaining a relation P vector through a softmax layer;

s53, combining different (S, P) combinations, combining the fused characteristic vector with the predicted S and P vectors to form a new vector, and sequentially adding a sigmoid layer to an Attention _3 model to predict the position of the corresponding O, wherein the Attention _2 model has the same structure as the Attention _1 and the Attention _2 models, only the training parameters are different, the Attention is to the weight (corresponding extraction information) of the position of the object O, and the sigmoid layer has the same principle as the prediction S.

And S54, outputting the corresponding SPO triple information.

The invention discloses a system for extracting pointer-type extraction triple information by using complex fusion characteristics, which is characterized in that a web crawler is used for acquiring original data for extracting all triple information of a target object from a corpus, and after training, a first extraction module is used for extracting a main body S by using an extraction model; a second extraction module for extracting the relation P using an extraction model; and the third extraction module is used for extracting the object O by using the extraction model.

In a third aspect of the present invention, a computer-readable storage medium includes a program of a complex fused feature extraction pointer-type triple information extraction method of a machine, and when the complex fused feature extraction pointer-type triple information extraction method is executed by a processor, the steps of the complex fused feature extraction pointer-type triple information extraction method are implemented.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that:

the invention provides a tree structure visualization method based on node introductivity change, and provides a brand-new model for extracting triples in a text, a pointer network model is trained according to a subject S and an object P pointer after a complex number is adopted to fuse feature vectors, and then all triples in a target are extracted by the trained model.

Drawings

Fig. 1 is a general flow chart of a method for extracting pointer-type extraction triple information by using complex fusion features according to the present invention;

fig. 2 is a schematic diagram of processing word vectors in step S2 in embodiment 2;

FIG. 3 is a schematic flowchart of step S3 in example 2;

FIG. 4 is a schematic flowchart of step S5 in example 2;

FIG. 5 is an expanded view of the Attention1 model in step S5 in example 2;

fig. 6 is a schematic block diagram of a system for extracting pointer-type extraction triple information by using complex fusion features provided in embodiment 2.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and are used for illustration only, and should not be construed as limiting the patent. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.

Example 1

A method for extracting pointer-type extraction triple information by using complex fusion features, as shown in fig. 1, includes the following steps:

Example two

In a preferred embodiment, as shown in fig. 2, the specific steps at S2 are as follows:

In a preferred embodiment, as shown in fig. 3, the specific steps of S3 are as follows:

s33, rewriting A and B into complex number mode, wherein

n is a₁、b₁The number of elements in the vector;

Or

s35, obtaining a final fused feature vector H ═ H₁，h₂，h₃…h_i]。

In a preferred embodiment, the specific steps at S4 are as follows:

The Attention _1 model is developed as shown in fig. 5, and an Attention network processes the fused feature vectors and the correspondingly extracted feature vectors to generate final brand new sentence feature vectors. And finally, respectively predicting the starting position of the first word of the S main body and the ending position of the last word by two sigmoid layers. Such as [1,0,0,0,0, 0] and [0,0,0,0, 1,0,0,0,0,0 ].

In a preferred embodiment, as shown in fig. 5, the specific steps of S5 are as follows:

And S54, outputting the corresponding SPO triple information.

The invention discloses a system for extracting pointer-type extraction triple information by using complex fusion features, which is characterized in that as shown in fig. 6, original data are obtained by using a web crawler and are used for extracting all triple information of a target object from a corpus, and after training, a first extraction module is used for extracting a main body S by using an extraction model; a second extraction module for extracting the relation P using an extraction model; and the third extraction module is used for extracting the object O by using the extraction model.

It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. A method for extracting pointer-type extraction triple information by complex fusion features is characterized by comprising the following steps:

s5: extracting all S main bodies in the target by using the trained model; extracting all corresponding relations P according to the guidance of all the main bodies S; and extracting all O objects according to all the combination indexes of (S, P).

2. The method for extracting pointer-type extraction triplet information according to claim 1, wherein in step S1, sentences and corresponding triples are obtained through web crawlers and manual annotations respectively.

3. The method for extracting pointer-type extraction triplet information from complex number fusion features as claimed in claim 2, wherein the specific steps at S2 are as follows:

s21, coding each character in all sentences, wherein different numbers correspond to different characters;

s22, determining a fixed sequence length X, and cutting the sentence length to be X when the sentence length exceeds X; if the length of the sentence is less than X, 0 is supplemented after the sentence until the length of the sentence is X, and a sentence vector recognized by a computer is formed;

4. The method for extracting pointer-type extraction triplet information from complex number fusion features as claimed in claim 3, wherein the specific steps of S3 are as follows:

s32, extracting sentence characteristic vector A ═ a from the convolution network₁，a₂，..，a_i](ii) a The sentence characteristic vector extracted by the loop network is B ═ B₁，b₂，..，b_i]Wherein a and b are single word vectors;

s33, rewriting A and B into complex number mode, wherein

n is a₁、b₁The number of elements in the vector;

s34, carrying out complex addition on A and B, judging the size of the module, if the size of the module is larger than that of the module at the same time

Or

The modulus values of (a) are fused, h_iBoth select a_i+b_i(ii) a On the contrary, the one with larger modulus value in A and B is selected as the final feature extraction vector, h_iFrom a_iAnd b_iSelecting one;

s35, obtaining a final fused feature vector H ═ H₁，h₂，h₃…h_i]。

5. The method of claim 1, wherein the pointer models include an Attention1 model, an Attention2 model, and an Attention3 model.

6. The method for extracting pointer-type extraction triplet information from complex number fusion features as claimed in claim 5, wherein the specific steps at S4 are as follows:

s41, fusing the feature vectors H_i＝[h₁，h₂，h₃…h_i]Each h is a vector per word;

A＝softmax(V_a ^Ttanh(W_aH_i) Where Va and Wa are parameters, e represents the fraction score of the current feature vector, Va and W_ah_iCalculating the similarity to obtain e, V_a ^TThe method is characterized in that a training parameter vector Va is transposed, a common transpose for similarity calculation between vectors is used, and a weight A is obtained in a normalization mode;

s43, obtaining Wa dimension dxd, hi dimension dx1 and Va dimension dx1 through training, obtaining a final vector C which is AxH through Attention _1, and taking a result C of weighted summation of an original vector H as an Attention value; the attention value vector is used for representing the text, and the network can learn the attention on the corresponding predicted SOP according to the label information;

7. The method for extracting pointer-type extraction triplet information from complex number fusion features as claimed in claim 4, wherein the specific steps of S5 are as follows:

s52, combining the fused feature vectors through the position of the main body S1, extracting P by using an attention2 model and obtaining a relation P vector through a softmax layer;

s53, combining different (S, P) combinations, the fused characteristic vector and the predicted S, P vector to form a new vector, and sequentially predicting the position of the corresponding O through an Attention _3 model and a sigmoid layer;

and S54, outputting the corresponding SPO triple information.

8. A system for extracting pointer-type extraction triple information by using complex fusion features is characterized in that original data are obtained by using a web crawler and used for extracting all triple information of a target object from a corpus, and after training, a first extraction module is used for extracting a main body S by using an extraction model; a second extraction module for extracting the relation P using an extraction model; and the third extraction module is used for extracting the object O by using the extraction model.

9. A computer-readable storage medium, comprising a program of a complex fused feature extraction pointer-type triple information extracting method of a machine, wherein when the program is executed by a processor, the method implements the steps of the complex fused feature extraction pointer-type triple information extracting method as claimed in any one of claims 1 to 7.