CN110674642A

CN110674642A - Semantic relation extraction method for noisy sparse text

Info

Publication number: CN110674642A
Application number: CN201910806205.1A
Authority: CN
Inventors: 赵翔; 庞宁; 谭真; 郭爱博; 殷风景; 唐九阳; 葛斌; 肖卫东
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2019-08-29
Filing date: 2019-08-29
Publication date: 2020-01-10
Anticipated expiration: 2039-08-29
Also published as: CN110674642B

Abstract

The invention discloses a semantic relation extraction method for noisy sparse texts, which comprises the following steps of: establishing a training sample set; constructing a semantic relation extraction model; training a semantic relation extraction model; establishing a data set of semantics to be extracted; and extracting semantic relations from the data set of the semantics to be extracted by using the trained semantic relation extraction model. The method adopts different convolutional neural networks to respectively extract the word segmentation sequence and the characteristics of the corresponding dependency path, avoids error accumulation, and has obvious effect improvement compared with the traditional method for extracting the relationship based on the characteristics and the kernel; the two information representations of the relationship example are fully utilized, and are effectively combined through the feature fusion layer, so that more comprehensive information is provided for accurately predicting the semantic relationship of the target entity pair; and a multi-instance learning method is added for noise suppression under the condition of sample sparsity, and compared with an attention mechanism, the mechanism has no under-fitting problem and is more suitable for the semantic relation extraction problem under the sparse sample.

Description

Semantic relation extraction method for noisy sparse text

Technical Field

The invention belongs to the field of extraction of semantic relations of Chinese texts, and particularly relates to a method for extracting entity semantic relations in sparse Chinese texts containing noise.

Background

In recent years, the knowledge graph plays an extremely important role in a series of knowledge-driven applications, such as machine translation, a recommendation system, a question-answering system and the like, and the relation extraction technology is a key ring for automatically constructing the knowledge graph and has important practical significance. The relation extraction is a process of obtaining the semantic relation of the labeled entity pair by understanding the semantic information contained in the unstructured text. Currently, the mainstream relational extraction method is a supervised and remote supervised based relational extraction method.

In order to avoid the problem that the traditional supervised relationship extraction method is influenced by error accumulation in a natural language processing tool, a neural network is widely used for embedding and representing texts, and the semantic features of the texts are automatically extracted. The supervision method needs definite manual annotation of texts, and the annotation process is time-consuming and labor-consuming. To solve this problem, an alternative paradigm, remote supervision, is proposed. The paradigm provides oversight with the existing knowledge graph, Freebase, heuristically aligning text with Freebase to generate large amounts of weakly annotated data. It is clear that this heuristic alignment method can introduce noisy data, which can seriously affect the performance of the relationship extractor.

To solve the problem of wrong annotation, a multi-instance learning method is proposed which can be used to alleviate the problem of wrong annotation under remote supervision, and in addition, a selective attention mechanism has trainable parameters, and by learning, probability distribution on noise is fitted, and noise instance influence is dynamically de-weakened. However, in the case of sparse data, the conventional attention mechanism and multi-instance learning do not fit well to the probability distribution on the noisy data, so that the semantic relation is not extracted from the noisy sparse text ideally. In addition, the existing relation extraction method is advanced in development of English corpus, and the relation extraction research of Chinese corpus is relatively lagged behind.

Disclosure of Invention

In view of the above, the present invention provides a semantic relationship extraction method for a noisy sparse text, which is used for extracting structured knowledge from an unstructured corpus, and in particular, extracting semantic relationships from a noisy sparse chinese text.

Based on the above purpose, the semantic relationship extraction method for the noisy sparse text provided by the invention comprises the following steps:

step 1, establishing a Chinese text training sample set;

step 2, constructing a semantic relation extraction model;

step 3, training a semantic relation extraction model;

step 4, establishing a data set of semantics to be extracted;

and 5, extracting the semantic relation from the data set of the semantics to be extracted by using the trained semantic relation extraction model.

The training sample set is data which is weakly labeled by using linguistic data on a knowledge graph remote supervision Wikipedia, and each training instance comprises a target entity pair, a word segmentation sequence, a dependency path and a weak supervision label;

the dependency path is the shortest dependency path and is defined as: shortest paths between pairs of entities in the syntactic analysis dependency tree.

Furthermore, the semantic relation extraction model comprises an input layer, an embedded layer, a convolutional layer, a feature fusion layer and a full connection layer, wherein the input layer is sequentially connected with the embedded layer, and provides an input interface for describing an example packet formed by all word segmentation sequences of an entity pair and corresponding dependency paths; the embedded layer maps the input word segmentation sequence and the corresponding dependency path to a low-dimensional vector space in a representation learning mode; the convolution layers are two independent convolution networks and are respectively used for extracting semantic features of all participle sequences and all corresponding dependency paths in the example package; the feature fusion layer fuses complementary semantic features from two aspects of a word sequence and a corresponding dependency path; and the full connection layer maps the instances to the defined relation set to obtain the semantic relation between the entity pairs.

Furthermore, the semantic relation extraction model also comprises a multi-instance learning mechanism module, wherein the multi-instance learning mechanism module is used for acquiring data from the full-connection layer, feeding back a learning result to the convolutional layer and guiding the calculation operation of the convolutional layer; the multi-instance learning mechanism module selects the best instance in the instance packet as a training and predicting instance in the model learning process, discards other instances and inhibits the influence of noise instances.

Specifically, in step 3, the process of training the semantic relationship extraction model is as follows: after initialization, the cross entropy is used as a loss function, a random gradient descent method is adopted to iteratively update model parameters of the semantic relationship extraction model through a multi-instance learning method, the gradient is checked once every iteration to find the optimal solution of the weight and the bias of each network layer, and the optimal semantic relationship extraction model of the training is obtained after iteration is carried out for multiple times.

Thus, in step 5, the trained semantic relationship extraction model is used to extract the semantic relationship of the noisy Chinese text, and structured knowledge is obtained from the unstructured text data.

Compared with the prior art, the invention has the following advantages and beneficial effects:

(1) the invention adopts different convolutional neural networks to respectively extract the characteristics of the word segmentation sequence and the corresponding dependency path, automatically generates the embedded representation, avoids error accumulation, and has obvious effect improvement compared with the traditional method for extracting the relationship based on the characteristics and the kernel.

(2) The invention fully utilizes two information representations of the relationship example, namely the word segmentation sequence and the dependency path, and effectively combines the word segmentation sequence and the dependency path through the characteristic fusion layer, thereby providing more comprehensive information for accurately predicting the semantic relationship of the target entity pair.

(3) On the basis of a model, a multi-instance learning method is added for noise suppression under the condition of sparse Chinese samples, and compared with an attention mechanism, the mechanism has no under-fitting problem and is more suitable for semantic relation extraction under sparse samples.

The method respectively provides specific solutions for solving the problems that data construction depends on manpower, a denoising method is under-fitted under the condition of sparse Chinese samples and semantic information is not fully utilized in the prior art, so that noise influence can be effectively reduced, the semantic information can be more fully acquired, the relation can be more accurately predicted, and the reliability is high.

Drawings

FIG. 1 is a schematic overall flow chart of an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of the semantic relationship extraction model of the present invention.

Detailed Description

The invention is further described with reference to the accompanying drawings, but the invention is not limited in any way, and any alterations or substitutions based on the teaching of the invention are within the scope of the invention.

As shown in fig. 1, a semantic relationship extraction method for noisy sparse text includes the following steps:

step 1, establishing a Chinese text training sample set;

step 2, constructing a semantic relation extraction model;

step 3, training a semantic relation extraction model;

step 4, establishing a data set of semantics to be extracted;

The training sample set is data which is weakly labeled by utilizing linguistic data on a knowledge graph remote supervision Wikipedia, and each training instance comprises a target entity pair, a word segmentation sequence, a dependency path and a weak supervision label. For each Chinese text, entity pairs contained in the Chinese text are predetermined, a word segmentation sequence of the original text is obtained through a word segmentation tool, a syntactic analysis tree is obtained through a syntactic analysis tool, and a dependency path is extracted from the syntactic analysis tree. And putting the instances of the same entity pair together to form an instance packet, and preparing data for denoising of a subsequent multi-instance learning mechanism. The dependency path is the shortest dependency path and is defined as: shortest paths between pairs of entities in the syntactic analysis dependency tree.

As shown in fig. 2, the semantic relationship extraction model includes an input layer, an embedded layer, a convolutional layer, a feature fusion layer, and a full connection layer, which are connected in sequence, where the input layer provides an input interface for describing an example package composed of all word segmentation sequences of an entity pair and corresponding dependency paths; the embedded layer maps the input word segmentation sequence and the corresponding dependency path to a low-dimensional vector space in a representation learning mode; the convolution layers are two independent convolution networks and are respectively used for extracting semantic features of all participle sequences and all corresponding dependency paths in the example package; the feature fusion layer fuses complementary semantic features from two aspects of a word sequence and a corresponding dependency path; and the full connection layer maps the instances to the defined relation set to obtain the semantic relation between the entity pairs.

The semantic relation extraction model also comprises a multi-instance learning mechanism module, wherein the multi-instance learning mechanism module is used for acquiring data from the full-connection layer, feeding back a learning result to the convolutional layer and guiding the calculation operation of the convolutional layer; the multi-instance learning mechanism module selects the best instance in the instance packet as a training and predicting instance in the model learning process, discards other instances and inhibits the influence of noise instances.

Specifically, the input layer provides an input interface for describing an instance package composed of all the segmentation sequences and corresponding dependency paths of an entity pair, in this embodiment, the number of the input interfaces is 2, which respectively correspond to the segmentation sequences and the dependency paths, and the input definition of each instance is as follows:

wherein, x represents the input word segmentation sequence,

representing the ith participle in the participle sequence, s representing the input dependency path,

representing the ith participle on the dependent path, m and n are set to fixed values of 100 and 40 in this embodiment.

Specifically, the embedding layer maps the input Word segmentation sequence and the corresponding dependency path to a low-dimensional vector space in a representation learning manner, and the layer maps each Word segmentation on the input Word segmentation sequence and the dependency path to a vector representation, in this embodiment, the vector representation of each Word segmentation includes a Word vector, a position vector and a part-of-speech tagging vector, where the Word vector is obtained by training in advance through a Word2Vec algorithm and includes semantic information of the Word segmentation, the dimension is 50, the position vector is obtained by random initialization and includes position information of the Word segmentation in the Word segmentation sequence or the dependency path, the dimension is 10, the part-of-speech tagging vector is expressed as a unit vector and includes part-of-speech information of the Word, and the dimension is 15. Therefore, any participle in the participle sequence or the dependency path can be represented by the following vector: w is a_i＝[v_word:v_position:v_tag]Wherein v is_word，v_positionAnd v_tagWord vectors, position vectors and part-of-speech tagging vectors, w, representing participles, respectively_iK, which in this embodiment is 75.

Horizontally connecting each participle vector representation according to the order of the participle sequence and the dependency path to obtain the vector representation of the participle sequence and the dependency path, wherein the vector representation is represented as follows:

wherein X represents the vector representation of the participle sequence after passing through the embedding layer, W_i ^xRepresenting the vector representation of the ith participle in the participle sequence, S representing the vector representation after the dependency path passes through the embedded layer, W_i ^sA vector representation representing the ith participle in the dependency path.

The convolution layer is two independent convolution networks which are respectively used for extracting semantic features of all participle sequences and all corresponding dependency paths in the example package. Since the two convolutional networks have the same operation mechanism, the definition and operation of the layer under this embodiment are only illustrated by the word segmentation sequence. To obtain more useful information from the data, each convolution network is provided with a plurality of convolution filters, denoted as

In this embodiment, the number of convolution filters d is set to 230, the window size w is set to 3, and the convolution operation is defined as:

while

Wherein i is more than 1 and less than d, j is more than or equal to 1 and less than or equal to m-w +1,

for the ith convolution filter, s_i:jFor the horizontal concatenation of the ith participle to jth participle vector representations,

expressing the dot product operation of the matrix, and finally generating an intermediate feature vector by each convolution filterThus, the intermediate eigenvector sequence generated by the full convolution filter is C ═ C₁,c₂,…,c_d}. After convolution, maximum pooling is used to extract the most significant features in each dimension, defined as:c_ijis the element of the corresponding position in C. Finally generating a feature vector of each participle sequence

Similarly, a feature vector may be generated for each dependency path

The feature fusion layer fuses complementary semantic features from the word segmentation sequence and the corresponding dependency path, and essentially, the feature fusion layer is a weighted sum of feature vectors from the word segmentation sequence and the corresponding dependency path, and is defined as: p ═ α p^x+(1-α)p^sWhere α is the weight sparseness, and in this embodiment, the value is 0.5.

The fully-connected layer maps the instances onto a defined set of relationships, obtaining semantic relationships between pairs of entities, defined as: o ═ Up + v, where,

in the form of a matrix of coefficients,in order to be offset,

is a confidence score corresponding to all relationship types, where n_rIs the number of all relationships, set to 5 in this embodiment, the relationship with the highest confidence score is considered the semantic relationship between the pair of entities.

The multi-instance learning mechanism module selects the best instance in the instance packet as a training and predicting instance in the model learning process, discards other instances and inhibits the influence of noise instances. The training data has a series of example packets, denoted as B ═ B₁,B₂,…,B_N}. Any one of the example packages B_iIn which contains | B_iI instances, under this mechanism, the loss function is defined as:

wherein the content of the first and second substances,

as example bag B_iAn example of (1), o_krAs an exampleConfidence score of corresponding relation r, o_kjAs an example

And (4) computing and summing the confidence scores of the corresponding relations j, wherein theta is all parameters in the model. The principle of θ update is:

wherein η is the learning rate.

Therefore, in step 3, the process of training the semantic relation extraction model is as follows: after initialization, the cross entropy is used as a loss function, a random gradient descent method is adopted to iteratively update model parameters of the semantic relationship extraction model through a multi-instance learning method, the gradient is checked once every iteration to find the optimal solution of the weight and the bias of each network layer, and the optimal semantic relationship extraction model of the training is obtained after iteration is carried out for multiple times.

Because the model is trained by the stochastic gradient descent method under different initialization conditions, the prediction results are different every time, the predictions of the model trained under different initialization conditions can be taken as the output of the whole system after being statistically averaged, and finally the prediction system of the semantic relationship is obtained.

Specifically, the specific steps of training the semantic relationship extraction model are as follows:

step 301, writing the instance packet in the training sample data set into a data file, wherein the data format of the data file conforms to the read-in data interface of the semantic relation extraction model;

step 302, setting training parameters: reading a file path, iteration times and a learning rate, setting the dimension and size of each network layer, and setting an initial training weight and a training bias;

step 303, loading a training file: loading a training set consisting of a semantic relation extraction model definition file, a network layer parameter definition file and training data;

304, by a multi-instance learning method, carrying out iteration updating on the semantic relationship extraction model by adopting a random gradient descent method, checking the gradient once every iteration to find the optimal solution of the weight and the bias of each network layer, and iterating for multiple times to obtain the optimal semantic relationship extraction model of the training;

and 305, taking 30% of data in the sample set as a test sample set, adopting the same preprocessing mode as the training sample set for the test sample set, and testing the data in the test sample set by using the obtained semantic relation prediction system.

The existing relation extraction method is developed more advanced on English corpus, and the relation extraction research on Chinese corpus is relatively lagged behind. The existing supervised relation extraction method relies on manual annotation data, the manual annotation process is time-consuming and labor-consuming, and aiming at the problem, the invention adopts a remote supervision technology to heuristically align the unmarked text with the knowledge graph and automatically generate weak annotation data. The existing relationship extraction method based on remote supervision generally utilizes an attention mechanism to suppress the influence of an error labeling example on an extraction result, and the attention mechanism essentially obtains probability distribution on noise data through learning on a large amount of data so as to dynamically remove noise. In fact, the knowledge graph in the Chinese field is slow in development and small in scale, so that training data constructed by utilizing remote supervision is relatively few and is not enough to enable an attention mechanism to be fully fitted, and therefore, aiming at the problem that the attention mechanism is not fit enough, the multi-instance learning method is adopted, the mechanism does not need to learn parameters, and the method is more suitable for the situation that samples are sparse. In addition, current methods of extracting relationships employ a single input, a word sequence or a dependency path, and in fact, the two have a complementary relationship, the word sequence provides supplemental information for the dependency path, and the dependency path removes noise participles in the word sequence. The invention utilizes knowledge map in Chinese entertainment field and weak labeled data of Chinese Wikipedia structure, and combines the improved scheme after preprocessing such as word segmentation and syntactic analysis, thereby solving the existing problems.

The above embodiment is an implementation manner of the method in noisy sparse chinese text, but the implementation manner of the present invention is not limited by the above embodiment, and any other changes, modifications, substitutions, combinations, and simplifications that do not depart from the spirit and principle of the present invention should be regarded as equivalent substitutions, and are included within the scope of the present invention.

Claims

1. A semantic relation extraction method for noisy sparse text is characterized by comprising the following steps:

step 1, establishing a Chinese text training sample set;

step 2, constructing a semantic relation extraction model;

step 3, training a semantic relation extraction model;

step 4, establishing a data set of semantics to be extracted;

2. The semantic relationship extraction method according to claim 1, wherein the semantic relationship extraction model comprises an input layer, an embedded layer, a convolutional layer, a feature fusion layer and a full connection layer, the input layer is connected in sequence, and the input layer provides an input interface for an example package composed of all word segmentation sequences describing a certain entity pair and corresponding dependency paths; the embedded layer maps the input word segmentation sequence and the corresponding dependency path to a low-dimensional vector space in a representation learning mode; the convolution layers are two independent convolution networks and are respectively used for extracting semantic features of all participle sequences and all corresponding dependency paths in the example package; the feature fusion layer fuses complementary semantic features from two aspects of a word sequence and a corresponding dependency path; and the full connection layer maps the instances to the defined relation set to obtain the semantic relation between the entity pairs.

3. The semantic relationship extraction method according to claim 2, wherein the semantic relationship extraction model further comprises a multi-instance learning mechanism module, which acquires data from the fully-connected layer, feeds back a learning result to the convolutional layer, and guides a calculation operation of the convolutional layer; the multi-instance learning mechanism module selects the best instance in the instance packet as a training and predicting instance in the model learning process, discards other instances and inhibits the influence of noise instances.

4. The semantic relationship extraction method according to claim 3, wherein the process of training the semantic relationship extraction model is as follows: after initialization, the cross entropy is used as a loss function, a random gradient descent method is adopted to iteratively update model parameters of the semantic relationship extraction model through a multi-instance learning method, the gradient is checked once every iteration to find the optimal solution of the weight and the bias of each network layer, and the optimal semantic relationship extraction model of the training is obtained after iteration is carried out for multiple times.

5. The semantic relation extraction method according to claim 2 or 3, wherein the number of input interfaces of the input layer is 2, and the input interfaces respectively correspond to the participle sequence and the dependency path, and the input of each instance is defined as follows:

wherein, x represents the input word segmentation sequence,

representing the ith participle on the dependency path;

the embedded layer respectively maps each participle on an input participle sequence and a dependency path into vector representation, the vector representation of each participle comprises a Word vector, a position vector and a part-of-speech tagging vector, wherein the Word vector is obtained by pre-training through a Word2Vec algorithm and comprises semantic information of the participle, the position vector is obtained by random initialization and comprises position information of the participle in the participle sequence or the dependency path, and the part-of-speech tagging vector is represented as a unit vector and comprises part-of-speech information of the participle; any participle in the participle sequence or the dependency path can be represented by the following vector: w is a_i＝[v_word:v_position:v_tag]Wherein v is_word，v_positionAnd v_tagWord vectors, position vectors and part-of-speech tagging vectors, w, representing participles, respectively_iHas a dimension of k;

representing each participle vector according to a participle sequence and according toThe sequences in the storage paths are horizontally connected in sequence to obtain the vector representation of the word segmentation sequence and the dependency path, and the vector representation is as follows:

6. The semantic relationship extraction method according to claim 5, wherein the convolution layer has the same operation mechanism for two independent convolution networks, and each convolution network is provided with a plurality of convolution filters represented asThe number of convolution filters is d, the window size is w, and the convolution operation is defined as:

while

expressing the dot product operation of the matrix, and finally generating an intermediate feature vector by each convolution filter

The intermediate eigenvector sequence generated by all convolution filters is C ═ C₁,c₂,…,c_dMaximum pooling, which is used to extract the most significant features in each dimension, is defined as:

c_ijfinally generating a feature vector of each participle sequence for the elements at the corresponding positions in C

7. The semantic relationship extraction method according to claim 6, wherein the weighted summation of the feature vectors from the word segmentation sequence and the corresponding dependency path by the feature fusion layer is defined as: p ═ α p^x+(1-α)p^sWhere α is sparse weight, p^sFor the feature vector of each dependency path, p^xA feature vector for each sequence of participles.

8. The semantic relationship extraction method according to claim 7, wherein the fully-connected layer maps the instances to the defined set of relationships to obtain the semantic relationship between the entity pairs, which is defined as: o ═ Up + v, where,

in the form of a matrix of coefficients,

in order to be offset,

is a confidence score corresponding to all relationship types, where n_rIs the number of all relationships, the relationship with the highest confidence score is considered the semantic relationship between the pair of entities.

9. The semantic relationship extraction method according to claim 8, wherein the training data in the multi-instance learning mechanism module has a series of instance packages, denoted as B ═ B₁,B₂,…,B_NAny instance packet B_iIn which contains | B_iI instances, under this mechanism, the loss function is defined as:

wherein the content of the first and second substances,

as example bag B_iAn example of (1), o_krAs an example

The confidence scores of the corresponding relations r, theta is all parameters in the model, and the principle of updating theta is as follows:

wherein eta is the learning rate, and the process of training the semantic relation extraction model is as follows: after initialization, the cross entropy is used as a loss function, a random gradient descent method is adopted to iteratively update model parameters of the semantic relationship extraction model through a multi-instance learning method, the gradient is checked once every iteration to find the optimal solution of the weight and the bias of each network layer, and the optimal semantic relationship extraction model of the training is obtained after iteration is carried out for multiple times.