CN110889276B

CN110889276B - Method, system and computer medium for extracting pointer type extraction triplet information by complex fusion characteristics

Info

Publication number: CN110889276B
Application number: CN201911083955.7A
Authority: CN
Inventors: 杨家兵; 高怀恩; 张学习; 龙土志; 董海涛
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2019-11-07
Filing date: 2019-11-07
Publication date: 2023-04-25
Anticipated expiration: 2039-11-07
Also published as: CN110889276A

Abstract

The invention provides a method, a device and computer equipment for extracting pointer type extraction triplet information by complex fusion features, which comprises the following steps: s1: acquiring a text and a corresponding triplet SPO label; s2: training to obtain a word vector of each word; s3: inputting each word in the text into a network according to a word vector to complete feature extraction; s4: inputting the extracted features into a pointer model for training; s5: and extracting the triplet SPO by using the trained model. The invention provides a brand new model for extracting triples in a text, which sequentially trains a pointer network model according to a main body S and a object P pointer of the model after a plurality of fusion feature vectors are adopted, and then extracts all triples in a target by using the trained model.

Description

Method, system and computer medium for extracting pointer type extraction triplet information by complex fusion characteristics

Technical Field

The present invention relates to the field of text feature extraction and information extraction, and more particularly, to a method, a system, and a computer medium for extracting triplet information by using complex fusion features.

Background

In order to cope with the challenges of information explosion, there is an urgent need for some automated tools to help people quickly find truly needed information from massive information sources, all of which consist of each sentence, and each sentence consists of several "subject-predicate-object (subject S, object O and relationship P between them)". Randomly go hundred degrees encyclopedia to find a sentence as follows: "XX technology Limited company is a civil communication technology company for producing and selling communication equipment, and is formally registered in 1987, and the headquarter is located in Shenzhen Longshou district of Shenzhen city in Guangdong, china. In this sentence, ", all triples are { S: "XX technology Co., ltd.," O: "1987", P: "hold time" } and { S: "XX technology Co., ltd.," O: "Shenzhen city Longguang district", P: "headquarter location" }. How to extract the key information of the online text efficiently and accurately has been a great challenge in the field. In most of the deep learning methods at present, one type is joint extraction, a sentence is input, a joint model is extracted through entity recognition and relation, the joint model completely changes the relation extraction originally related to a sequence labeling task and a classification task into a sequence labeling problem, and then a triplet is directly obtained through an end-to-end neural network model. The other is a two-step method, a sentence is input, named entity recognition is firstly carried out, then the recognized subjects S and objects O are combined in pairs, then the system p is extracted, the relation p classification corresponding to the combination of (S, O) is obtained, finally the triples with entity relation are taken as input, and finally all triples are stored.

Disclosure of Invention

Aiming at the problems that in the prior art, deep learning cannot extract all triples, because according to a sequence labeling strategy, the situation of overlapping of entity relations is not supported, and a two-step extraction method cannot effectively extract the form of one S, a plurality of (P, O) and the same pair (S, O) possibly corresponding to a plurality of P, the invention provides a method for extracting pointer type extraction triplet information by complex fusion characteristics.

A method for extracting pointer type extraction triplet information by complex fusion features comprises the following steps:

s1: acquiring sentences and corresponding triplet labels from various texts, wherein the triplet labels are a subject S, an object O and a relation P;

s2: each sentence is encoded into a vector format, and the word vector of each word is obtained through word position Embedding layer training;

s3: inputting each word in the sentence into a feature extraction network according to the word vector to complete feature extraction, and obtaining a feature vector of each sentence;

s4: inputting the feature vector of each sentence into a pointer model for training;

s5: extracting all S main bodies in the target by using the trained model; extracting all corresponding relations P according to all main body S directions; and extracting all O objects according to all the combination directions of (S, P), wherein the extracted targets and the labels have a one-to-one correspondence.

In a preferred scheme, in the step S1, sentences and corresponding triples are obtained through web crawlers and manual annotation respectively.

In a preferred embodiment, the specific step at S2 is as follows:

s21, coding each character in all sentences without numbering corresponding to different characters;

s22, determining a fixed sequence length X, and cutting the sentence to 100 when the sentence length exceeds 100; if the sentence length is less than 100, supplementing 0 after the sentence until the sentence length is 100, and forming a sentence vector recognized by a computer;

s23, the sentence vector obtained in the step S22 is placed in a word position coding layer and is coded in a word position coding layer, and word vectors of the word position coding are obtained.

In a preferred embodiment, the specific step of S3 is as follows:

s31, sending the obtained word vector with the word position codes into a feature extraction network for training, wherein the feature extraction network comprises a convolution network and a circulation network;

s32, extracting sentence feature vectors from a convolution network as A= [ a ] ₁ ,a ₂ ,…,a _i ]The method comprises the steps of carrying out a first treatment on the surface of the Sentence characteristic vector extracted by the cyclic network is B= [ B ] ₁ ,b ₂ ,…,b _i ]Wherein a, b is a single word vector;

s33, rewriting A and B into complex mode, wherein

n is a ₁ 、b ₁ The number of elements in the vector;

s34, carrying out complex addition on A and B, judging the size of a module, and if the size of the module is larger than that of the module

Or (b)

The module values of (a) are fused, namely a is selected ₁ +b ₁ The method comprises the steps of carrying out a first treatment on the surface of the Conversely, the one with larger modulus A and B is selected as the final feature extraction vector, namely a _i And b _i Selecting one;

s35, obtaining a final fused feature vector H= [ H ] ₁ ，h ₂ ，h ₃ …h _i ]。

In a preferred embodiment, the specific step at S4 is as follows:

s41, merging the feature vector H= [ H ] ₁ ，h ₂ ，h ₃ …h _i ]Each h is a vector for each word;

s42, calculating a score by using the current state as the current unit attribute_1score:

where Va and Wa are parameters, e represents the score of the current feature vector, va and W _a h _i Similarity calculation is carried out to obtain e->

The method is a transposition of training parameter vectors Va, the similarity among the vectors is calculated to be a common transposition, and a weight A is obtained in a normalization mode;

s43, obtaining a final vector C= AxH through training, wherein the Wa dimension dxd, the hi dimension dx1 and the Va dimension dx1, and the final vector C= AxH is obtained through the attention_1, and the result C obtained by weighting and summing the original vector H is the Attention value; by expressing text with the vector of attention values, the network can learn that attention is on the corresponding predicted SOP based on the tag information. In predicting S-body position vector, two classes are actually used, and in the two classes, the value set may be {0,1}, so the loss function still uses the two classes cross entropy.

S44, using a 2 classifier to focus on the vector H _i ＝[h ₁ ，h ₂ ，h ₃ …h _i ]And extracting corresponding SPO triplet information, training the model by using an Adam optimizer, training by using a smaller learning rate according to the test, loading the training optimal result, and continuing training to be optimal by using the smaller learning rate.

In a preferred embodiment, the pointer model includes an Attention1 model, an Attention2 model, and an Attention3 model.

In a preferred embodiment, the specific step of S5 is as follows:

s51, extracting all S main bodies in the target by using the trained model, and sampling one main body S1, wherein the attribute 1 model processes the fused feature vector and the corresponding extracted feature vector;

s52, extracting P by combining the fused feature vectors through the position of the S1 and using an attribute 2 model, and obtaining a relation P vector through a softmax layer;

s53, combining the fused feature vectors and the predicted S and P vectors to form new vectors, and predicting the corresponding O positions by sequentially adding a signature layer through an attribute_3 model, wherein the attribute_2 and the attribute_1, the attribute_2 model have the same structure, only the trained parameters are different, the weight (corresponding extraction information) of the object O positions is more noted, and the signature layer is the same as the principle of predicting S.

S54, outputting corresponding SPO triplet information.

The invention discloses a system for extracting pointer type extraction triplet information by complex fusion features, which utilizes a web crawler to acquire original data, is used for extracting all triplet information of a target object from corpus, and is used for extracting a main body S by using an extraction model after training; the second extraction module is used for extracting the relation P by using the extraction model; and the third extraction module is used for extracting the object O by using the extraction model.

The third aspect of the present invention provides a computer readable storage medium, where the computer readable storage medium includes a complex fusion feature extraction pointer type triplet information extraction method program for a machine, and when the complex fusion feature extraction pointer type triplet information extraction method is executed by a processor, the steps of the complex fusion feature extraction pointer type triplet information extraction method are implemented.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that:

the invention provides a tree structure visualization method based on node degree variation, and provides a brand new model for extracting triples in a text.

Drawings

FIG. 1 is a general flow diagram of a method for extracting triplet information by extracting pointers from complex fusion features;

FIG. 2 is a schematic diagram of the processing of word vectors in step S2 of example 2;

FIG. 3 is a schematic flow chart of step S3 in example 2;

FIG. 4 is a schematic flow chart of step S5 of example 2;

FIG. 5 is an expanded view of the Attention1 model in step S5 of example 2;

fig. 6 is a schematic diagram of a system for extracting pointer type triplet information from complex fusion features provided in embodiment 2.

Detailed Description

The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only some embodiments of the present invention, which are only for illustration and not to be construed as limitations of the present patent. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The technical scheme of the invention is further described below with reference to the accompanying drawings and examples.

Example 1

A method for extracting pointer type extraction triplet information by complex fusion features is shown in fig. 1, and comprises the following steps:

Example two

In a preferred embodiment, as shown in fig. 2, the specific steps in S2 are as follows:

In a preferred embodiment, as shown in fig. 3, the specific steps of S3 are as follows:

s33, rewriting A and B into complex mode, wherein

n is a ₁ 、b ₁ The number of elements in the vector;

Or (b)

In a preferred embodiment, the specific step at S4 is as follows:

The attention_1 model is developed as shown in fig. 5, and the Attention network processes the fused feature vector and the corresponding extracted feature vector to generate a final brand new sentence feature vector. And finally, respectively predicting the starting position and the ending position of the last word of the S main body by two sigmoid layers. Such as [1,0,0,0,0,0,0,0,0,0] and [0,0,0,0,0,1,0,0,0,0].

In a preferred embodiment, as shown in fig. 4, the specific step S5 is as follows:

S54, outputting corresponding SPO triplet information.

The second aspect of the present invention discloses a system for extracting triple information by using a complex fusion feature extraction pointer, as shown in fig. 6, the system uses a web crawler to obtain original data, and is used for extracting all the triple information of a target object from corpus, and after training, a first extraction module is used for extracting a main body S by using an extraction model; the second extraction module is used for extracting the relation P by using the extraction model; and the third extraction module is used for extracting the object O by using the extraction model.

It is to be understood that the above examples of the present invention are provided by way of illustration only and not by way of limitation of the embodiments of the present invention. Other variations or modifications of the above teachings will be apparent to those of ordinary skill in the art. It is not necessary here nor is it exhaustive of all embodiments. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the invention are desired to be protected by the following claims.

Claims

1. The method for extracting the pointer type extraction triplet information by the complex fusion features is characterized by comprising the following steps of:

s5: extracting all S main bodies in the target by using the trained model; extracting all corresponding relations P according to all main body S directions; extracting all O objects according to all the combined guide directions of (S, P);

the specific steps in the step S2 are as follows:

s21, coding each character in all sentences, wherein different numbers correspond to different characters;

s22, determining a fixed sequence length X, and cutting the sentence until the length is X when the sentence length exceeds X; if the sentence length is less than X, supplementing 0 after the sentence until the sentence length is X, and forming a sentence vector recognized by a computer;

s23, placing the sentence vector obtained in the step S22 in a word position coding layer to code so as to obtain a word vector of word position coding;

the specific steps of the S3 are as follows:

s33, rewriting A and B into complex mode, wherein

n is a ₁ 、b ₁ The number of elements in the vector;

s34, carrying out complex addition on A and B, judging the size of a mode, and if the size of the mode is simultaneously larger than that of the mode

Or->

The modulus values of (h) are fused, h _i Not only select a _i +b _i The method comprises the steps of carrying out a first treatment on the surface of the Conversely, A and B are selectedThe one with larger middle modulus value is used as the final feature extraction vector, h _i Not only from a _i And b _i Selecting one;

2. The method for extracting the triple information by the pointer type extraction of the complex fusion features according to claim 1, wherein sentences and corresponding triples are respectively obtained through web crawlers and manual annotation in the step S1.

3. The method for extracting the triplet information from the complex fusion features as defined in claim 1, wherein the pointer model comprises an Attention1 model, an Attention2 model, and an Attention3 model.

4. A method for extracting pointer type triplet information from plural fusion features as defined in claim 3, wherein the specific steps in S4 are as follows:

s41, merging the feature vectors H _i ＝[h ₁ ，h ₂ ，h ₃ …h _i ]Each h is a vector for each word;

Is the transposition of training parameter vector Va, the common transposition for calculating similarity between vectors, and weight values are obtained in a normalization modeA；/>

S43, obtaining a final vector C= AxH through training, wherein the Wa dimension dxd, the hi dimension dx1 and the Va dimension dx1, and the final vector C= AxH is obtained through the attention_1, and the result C obtained by weighting and summing the original vector H is the Attention value; representing text by means of the attention value vector, wherein the attention is learned on the corresponding predicted SOP according to the label information network;

5. The method for extracting the triplet information by extracting the pointer from the complex fusion features according to claim 1, wherein the specific step of S5 is as follows:

s52, extracting P by combining the position of the main body S1 with the fused feature vector and obtaining a relation P vector by a softmax layer by using an attribute 2 model;

s53, combining the fused feature vector and the predicted S and P vectors to form a new vector for different (S and P) combinations, and predicting the corresponding O position by adding a signature layer to the attribute_3 model in sequence;

s54, outputting corresponding SPO triplet information.

6. The system for extracting the triple information by the complex fusion feature extraction pointer of the method for extracting the triple information according to any one of claims 1 to 5, wherein the system is characterized in that raw data is acquired by using a web crawler and is used for extracting all the triple information of a target object from a corpus, and after training, a first extraction module is used for extracting a main body S by using an extraction model; the second extraction module is used for extracting the relation P by using the extraction model; and the third extraction module is used for extracting the object O by using the extraction model.

7. A computer readable storage medium, comprising a complex fusion feature extraction pointer type triplet information extraction method program of a machine, wherein the complex fusion feature extraction pointer type triplet information extraction method is executed by a processor, and the steps of the complex fusion feature extraction pointer type triplet information extraction method according to any one of claims 1 to 5 are implemented.