CN112487203A

CN112487203A - Relation extraction system integrated with dynamic word vectors

Info

Publication number: CN112487203A
Application number: CN202011387516.8A
Authority: CN
Inventors: 张力文; 程国艮
Original assignee: Global Tone Communication Technology Co ltd
Current assignee: Global Tone Communication Technology Co ltd
Priority date: 2019-01-25
Filing date: 2019-01-25
Publication date: 2021-03-12
Anticipated expiration: 2039-01-25
Also published as: CN112487203B; CN109871451A; CN109871451B

Abstract

The invention provides an entity relation extraction method and system integrated with a dynamic word vector technology. The system utilizes a remote supervision method to correspond the existing knowledge base to rich unstructured data so as to generate a large amount of training data, thereby relieving the problem of insufficient artificial labeling linguistic data, reducing the dependency on the labeling data and effectively reducing the labor cost. In order to acquire characteristic information among entities as much as possible, a basic framework of the model adopts a segmented convolutional neural network; and the semantic information of the example sentence is further extracted by integrating the dynamic word vector technology.

Description

Relation extraction system integrated with dynamic word vectors

Technical Field

The invention relates to the field of information extraction, in particular to a method for mining semantic relations between entities.

Background

The information extraction aims at extracting structured information from large-scale unstructured or semi-structured natural language texts, and the main tasks include entity extraction, relation extraction and event extraction. The main content of the Relation Extraction (RE) research is to extract semantic relations between entities from text contents, and extract deep Relation structures between entities by using a Relation Extraction technology, so that the RE Relation Extraction method has profound theoretical significance and great research value, and is also the basic work for optimizing a search engine, establishing a knowledge graph and developing an intelligent question-answering system.

Practice proves that the supervised learning method can extract more effective features, the accuracy rate and the recall rate are higher, but the supervised learning method seriously depends on natural language processing labels such as part of speech labels, syntactic analysis and the like to provide classification features. The natural language processing and labeling tool often has a large number of errors, and the errors are continuously propagated and amplified in the relationship extraction system, so that the effect of relationship extraction is finally influenced. With the rapid development of deep learning, the neural network model can automatically learn sentence characteristics without relying on complex characteristic engineering. Much research work has tended to use neural network models to solve the problem of relationship extraction.

Using neural network models faces two major problems: (1) without enough labeling data, the coverage rate of the training data set on the entities and entity relationships is low, and the universality is poor. And a great deal of time and energy are needed for manually marking the training data; (2) because word usage is semantically and grammatically complex and variable, existing models are "static" using pre-trained word vectors and cannot change as the language environment changes. Thus, its characterization capability has certain limitations.

Disclosure of Invention

In view of the above, the present invention provides a relation extraction model and system merged into a dynamic word vector, which automatically construct a large amount of training data by aligning a knowledge base with an unstructured text by using a remote supervision method, reduce dependence of the model on manual labeling data, and enhance the adaptability of the model across fields. And an attention mechanism is introduced to eliminate the influence of noise data generated by remote supervision. And finally, a dynamic word vector technology is used, so that the accuracy rate of relation extraction is improved. Thereby overcoming, at least to some extent, one or more of the problems due to limitations and insufficient corpus numbers of the related art.

In order to achieve the above object, according to one aspect of the present invention, the present invention provides the following technical solutions: a method for extracting a relation fused into a dynamic word vector comprises the following steps:

part one: obtaining a dynamic word vector: firstly, pre-training a deep bidirectional Language Model (LM) on a large text corpus, and then using a function learned according to the internal state of the model as a word vector; such a word vector is not invariant, but varies over time depending on the context. The part uses an existing ELMO model or BERT model as a generation model of the word vector.

The ELMO model, based on a bi-directional language model, represents one way of word vectors using linear combinations between layers.

The BERT model is represented by a transform's bidirectional encoder. Unlike other language representation models in the near future, BERT aims to pre-train the deep bi-directional representation by jointly adjusting the context in all layers. Therefore, the pre-trained BERT representation can be fine-tuned through an additional output layer, and is suitable for building the most advanced model of a wide range of tasks.

And part two: and training by adopting a segmented neural network model introducing an attention mechanism, respectively calculating weights corresponding to the word vectors according to the obtained word vectors, and splicing to obtain sentence vector representation. The same entity pair may represent different relationships in different statements. Therefore, when using remote supervision, extracting statements by entity pair inevitably introduces noise data. This section is to greatly reduce the effect of noise on the relationship extraction.

More specifically: the input of the segmented neural network model is a training sentence containing an entity pair; the training sentence is divided into three sections by the entity, and the three sections are mapped into corresponding three sections of word vectors; using a convolutional neural network to respectively extract the features of the three segments of word vectors to obtain three segments of feature vectors, and respectively calculating the weights and relationship vectors of the three segments of feature vectors; specifically, the weight and the relation vector of the three sections of feature vectors are calculated by adopting the following formula:

b_i＝Conv(vector_{sentence_i}),i＝1,2,3 (1)

w_i＝w_a ^T(tanh[b_i；v_relation])+b_a (2)

v_relation＝v_ent1-v_ent2 (4)

wherein, b_iThe feature vector, representing the i-th part of the sentence, extracted by the convolutional neural network_{sentence_i}Representing the i-th partial word vector of the sentence sensor; w is a_iRepresenting the fusion relation vector of the ith part of feature vectors to obtain a new feature vector w_a、b_aIs a model parameter; alpha is alpha_iIs a feature vector weight, v_relationIs a relation vector; v. of_ent1And v_ent2Respectively representing a first entity vector and a second entity vector;

multiplying the three sections of feature vectors by corresponding weights according to the obtained weights of the feature vectors, and splicing to obtain a final sentence vector representation, wherein the sentence vector is classified by softmax after passing through a full connection layer, and the method specifically comprises the following steps:

s＝contact[b₁·α₁；b₂·α₂；b₃·α₃] (5)

c＝softmax(w·s+b) (6)

wherein s represents a new vector spliced after each w is subjected to weighted summation; c represents a vector of categories.

According to another aspect of the present invention, the present invention further provides a system for extracting relationships merged into a dynamic word vector, the system comprising:

and the corpus collection module extracts an entity pair from the manually constructed triple knowledge base in a remote supervision mode, and extracts sentences containing the entity pair from the network text by taking the entity pair as a keyword to serve as training corpuses under the relationship. The module is essentially a web crawler module, automatically selects entity pairs according to the principle of language material category balance, and uses keywords thereof to crawl corresponding language materials as training language materials of the relationship; in addition, the module also has a data cleaning task, filters invalid non-text data and stores the data into a database according to the text length of a sentence;

the dynamic word vector production module converts the sentence characters into vectors, and the dynamic word vectors are generated according to different sentences, so that a sentence length threshold value is set, if the sentence length threshold value is larger than the threshold value, the sentence length threshold value is intercepted, and if the sentence length threshold value is smaller than the threshold value, the sentence length threshold value is supplemented; inputting the aligned sentences into a word vector generation module, outputting word vectors of each word in the sentences, splicing to obtain sentence vectors of the sentences, and sending the sentence vectors into a relation receiving module for training;

and the relation extraction module extracts the relation of the entity pair in the sentence, takes the sentence vector made by the module II as the input of the module and finally outputs the category of the sentence.

The relation extraction module adopts a segmented neural network model, and the input of the segmented neural network model is a training sentence containing an entity pair;

the training sentence is divided into three sections by the entity, and the three sections are mapped into corresponding three sections of word vectors;

and respectively extracting the features of the three sections of word vectors by using a convolutional neural network to obtain three sections of feature vectors, and respectively calculating the weights and relationship vectors of the three sections of feature vectors.

Compared with the prior art, the embodiment of the invention has the beneficial effects that: the method utilizes a remote supervision method to correspond the existing knowledge base to rich unstructured data so as to generate a large amount of training data, thereby relieving the problem of insufficient linguistic data of manual labeling. In order to obtain characteristic information among entities as much as possible, a segmented Convolutional Neural network (PIecewise Convolutional Neural Networks) is adopted as a basic framework of the model; and the semantic information of the example sentence is further extracted by integrating the dynamic word vector technology. Finally, an Attention mechanism is also used to reduce the effects of noisy data. The invention adopts a remote supervision mode to automatically acquire the training corpus, combines an attention mechanism, eliminates error labels generated when the corpus is automatically acquired, and finally integrates a dynamic word vector technology, thereby further improving the accuracy and recall rate of entity relationship extraction.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.

FIG. 1 is a schematic diagram of a segmented neural network model incorporating the attention mechanism of the present invention.

FIG. 2 is a schematic diagram of a dynamic word vector production process according to the present invention.

FIG. 3 is a schematic diagram of a data flow of a system for extracting relationships incorporated into dynamic word vectors according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the examples of the present invention and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, and "a" and "an" generally include at least two, but do not exclude at least one, unless the context clearly dictates otherwise.

It should be understood that the term "and/or" as used herein is merely one type of association that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

It should be understood that although the terms first, second, third, etc. may be used to describe … … in embodiments of the present invention, these … … should not be limited to these terms. These terms are used only to distinguish … …. For example, the first … … can also be referred to as the second … … and similarly the second … … can also be referred to as the first … … without departing from the scope of embodiments of the present invention.

The words "if", as used herein, may be interpreted as "at … …" or "at … …" or "in response to a determination" or "in response to a detection", depending on the context. Similarly, the phrases "if determined" or "if detected (a stated condition or event)" may be interpreted as "when determined" or "in response to a determination" or "when detected (a stated condition or event)" or "in response to a detection (a stated condition or event)", depending on the context.

It is also noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a good or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such good or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a commodity or system that includes the element.

In addition, the sequence of steps in each method embodiment described below is only an example and is not strictly limited. Preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

As shown in fig. 1, the segmented neural network model with attention mechanism introduced in the present invention mainly includes:

the input to the model is a training statement that contains some pair of entities, for example: monday, Youka and potatoes are combined to form a first large video website.

The main body of the neural network is a convolutional neural network, and the sentence is divided into three segments by two entities in the sentence and is mapped into corresponding word vectors. And (4) respectively extracting the features of the three segments of vectors by using a convolutional neural network. When the attention mechanism is adopted, the weight of the three sections of feature vectors and the relation vector is calculated by the following formula:

b_i＝Conv(vector_{sentence_i}),i＝1,2,3 (1)

w_i＝w_a ^T(tanh[b_i；v_relation])+b_a (2)

v_relation＝v_ent1-v_ent2 (4)

wherein, b in the formula (1)_iThe feature vector, representing the i-th part of the sentence, extracted by the convolutional neural network_{sentence_i}Representing the i-th partial word vector of the sentence sensor; formula (2) fuses the relationship vectors so that the feature vectors carry more information, w_iRepresenting the fusion relation vector of the ith part of feature vectors to obtain a new feature vector w_a、b_aIs a model parameter; equation (3) calculates the corresponding weight, alpha, of the feature vector at the attribute level_iIs a feature vector weight, v_relationIs a relation vector; formula (4) calculates a relationship vector by subtracting the corresponding first entity vector and second entity vector, v_ent1And v_ent2Respectively representing a first entity vector and a second entity vector, randomly initializing the entity vectors, changing with model training, initializing a matrix in advance, wherein each row of the matrix represents an entityAnd vector, each entity is marked with a label, and the corresponding entity vector is found according to the label.

s＝contact[b₁·α₁；b₂·α₂；b₃·α₃] (5)

c＝softmax(w·s+b) (6)

Fig. 2, which shows exemplarily the generation process of a dynamic word vector.

The core part of the dynamic word vector generation module is an ELMO/BERT model, the training corpus takes 100 sentences as a batch of training batches, and the sentence length is set to 200 words. If the word number of the sentence is larger than the threshold, the part larger than 200 is truncated to only retain the content of the first 200 words, and if the word number of the sentence is smaller than the threshold, the sentence is filled with the character "< pad >". The output of the model is the word vector corresponding to the training corpus.

FIG. 3 illustrates a relationship extraction system that incorporates dynamic word vectors.

The system consists of a corpus collection module, a dynamic word vector generation module and a relation extraction algorithm module.

The core of the corpus collection module is a remote supervision module which randomly extracts relation triples from a knowledge base (the knowledge base is constructed by domain experts and is stored by triples in the form of (entity 1, entity 2, relation)). And using the entity pair in the triple as a keyword, and using a crawler system to crawl sentences containing the entity pair from massive network texts, wherein the sentences are used as training corpora and are stored in a database. The domain experts update the knowledge base at regular time, relation triples with prior knowledge are continuously added, the acquisition module continuously operates in a 24 x 7 mode, corpora of corresponding entity pairs are automatically extracted from massive unstructured texts, and the corpora are stored in the database.

The dynamic word vector generation module is a module for converting the sentence characters into vectors. When the model is trained, firstly, sentences are extracted from the database, sent to the ELMO/BERT model, output corresponding dynamic word vectors and sent to the relation extraction module for training.

The relation extraction module is a core part of the system, and the model used by the module is a segmented neural network model introducing an attention mechanism. The model is trained over a large number of training sentences and then saved. Used to predict the entity relationship category of the new text statement.

The system utilizes a remote supervision method to correspond the existing knowledge base to rich unstructured data so as to generate a large amount of training data, thereby relieving the problem of insufficient artificial labeling linguistic data, reducing the dependency on the labeling data and effectively reducing the labor cost. In order to obtain characteristic information among entities as much as possible, a segmented Convolutional Neural network (PIecewise Convolutional Neural Networks) is adopted as a basic framework of the model; and the semantic information of the example sentences is further extracted by integrating the dynamic word vector technology popular in the academic world at present. Finally, an Attention mechanism is also used to reduce the effects of noisy data. The invention adopts a remote supervision mode to automatically acquire the training corpus, combines an attention mechanism, eliminates error labels generated when the corpus is automatically acquired, and finally integrates a dynamic word vector technology, thereby further improving the accuracy and recall rate of entity relationship extraction.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A system for extracting relationships fused into dynamic word vectors, the system comprising:

the corpus collection module randomly extracts the relation triplets from the knowledge base in a remote supervision mode, crawls sentences containing entity pairs from massive network texts by using a crawler system by taking the entity pairs in the triplets as keywords, and stores the sentences as training corpuses in a database;

the dynamic word vector production module is used for converting the sentence characters into vectors, firstly extracting the sentence linguistic data from the database, sending the sentence linguistic data into a deep bidirectional language model, outputting corresponding dynamic word vectors, further calculating to obtain sentence vectors, and sending the sentence vectors into the relation extraction module for training;

the relation extraction module uses a model which is a segmented neural network model introducing an attention mechanism, trains the model through a large number of training sentences, and then stores the model for predicting entity relation categories of new text sentences;

inputting a training sentence containing an entity pair into the segmented neural network model;

using a convolutional neural network to respectively extract the features of the three segments of word vectors to obtain three segments of feature vectors, and respectively calculating the weights and relationship vectors of the three segments of feature vectors; specifically, the weight and the relation vector of the three sections of feature vectors are calculated by adopting the following formula:

b_i＝Conv(vector_{sentence_i}),i＝1,2,3 (1)

w_i＝w_a ^T(tanh[b_i；v_relation])+b_a (2)

v_relation＝v_ent1-v_ent2 (4)

s＝contact[b₁·α₁；b₂·α₂；b₃·α₃] (5)

c＝softmax(w·s+b) (6)

wherein s represents a new vector spliced after each w is subjected to weighted summation; c represents a vector of categories; w represents a new obtained feature vector after the feature vector is fused with the relation vector; b represents the feature vector of the sentence sensor extracted by the convolutional neural network.

2. The system according to claim 1, wherein said corpus collection module further comprises a data cleansing function for filtering invalid non-textual data and storing crawled sentences in a database according to sentence text length.

3. The system of claim 2, wherein the dynamic word vectors are generated from different sentences, and a sentence length threshold is set, wherein the sentence length threshold is greater than the threshold and is truncated, and the sentence length threshold is smaller than the threshold and is filled; and inputting the aligned sentences into a dynamic word vector generation module, outputting word vectors of each word in the sentences, and splicing to obtain the sentence vectors of the sentences.

4. The system of claim 3, wherein the relationship extraction module takes the sentence vector generated by the dynamic word vector generation module as input, extracts the relationship of the entity pair in the sentence, and outputs the category to which the sentence belongs.

5. The system of claim 4, wherein the deep bi-directional language model is an ELMO model or a BERT model.