CN115688752A

CN115688752A - Knowledge extraction method based on multi-semantic features

Info

Publication number: CN115688752A
Application number: CN202211131763.0A
Authority: CN
Inventors: 孔纯熠; 高发荣; 张启忠
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2022-09-16
Filing date: 2022-09-16
Publication date: 2023-02-03

Abstract

The invention discloses a knowledge extraction method based on multi-semantic features, which comprises the following steps: step one, semantic vector representation; step two, feature coding; step three, entity recognition; and step four, relation classification. Compared with the prior art, the method has the advantages that: the invention provides a new relation triple extraction method. The method comprises the steps of firstly obtaining word vector representation through a pre-training language model, then utilizing Bi-LSTM to carry out feature coding on character-level features, and coding context semantic information through a multi-head self-attention mechanism to obtain the internal structure and long-distance dependency relationship of a sentence. And then, the semantic features of different levels are spliced to obtain efficient semantic representation, so that more accurate feature vectors are provided for entity identification and relation classification, and the performance of extracting relation triples is effectively improved.

Description

Knowledge extraction method based on multi-semantic features

Technical Field

The invention belongs to the field of knowledge graph knowledge extraction research, and particularly relates to a knowledge extraction method based on multi-semantic features.

Background

Knowledge-graph (KG) is a structured semantic knowledge base that stores information in symbolic form. This knowledge base consists of entity nodes and relationships, represented by means of triples (h, r, t). Most of the existing knowledge maps utilize text data in the internet, and 60% to 70% of internet text data exists in the form of unstructured electronic documents. Knowledge extraction is an important task for automatically extracting effective information from unstructured texts, is also a key step for constructing a knowledge graph, and directly influences the construction quality and subsequent application effect of the knowledge graph. In recent years, the research intensity of researchers on knowledge extraction is continuously increased, related technologies are gradually improved, and the knowledge extraction becomes an important technical basis for intelligent applications such as emotion analysis, intelligent question answering, individual recommendation, machine translation and the like. How to automatically and accurately acquire knowledge from heterogeneous mass data sources has become a hot problem for research in academia and industry.

Knowledge extraction refers to detecting entities from information sources and identifying their semantic relationships. According to the fact that the sequence of completion of the entity identification and the relation classification of the two subtasks is different, the knowledge extraction method can be divided into a pipeline method and a joint learning method. The pipeline method is easy to implement, and each part is more flexible. The joint learning method can consider potential dependency between two tasks and solve the problems of interaction loss, information redundancy and the like. In the early stage, the joint extraction method is mainly based on the traditional machine learning method. They have achieved some effect but require the use of manually constructed features.

Recently, with the development of deep neural networks, knowledge extraction methods have successively achieved up-to-date results. The long-short-term memory neural network (LSTM) can protect and control the information flow state, and effectively capture the long-term dependence of sentences, so the knowledge extraction model based on LSTM and its variants is widely applied and makes some breakthrough. Therefore, the problem of low accuracy of the feature vector is still one of the important points of research for the problem of potential context information loss.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: aiming at the problem of potential context information loss, a knowledge extraction method based on multi-semantic features is provided.

In order to achieve the purpose, the invention adopts the technical scheme that:

a knowledge extraction method based on multi-semantic features is a knowledge extraction method combining a Bi-directional long short term memory neural network (Bi-LSTM) and a self-attention machine system. A rich semantic information vector representation with multiple levels and multiple spaces is obtained by connecting the vocabulary vector and the context vector, and then Bi-directional semantic dependence is better captured by Bi-LSTM. And then, realizing entity identification by using a Conditional Random Field (CRF), and connecting and inputting the predicted entity label and the underlying feature vector into a Sigmoid to realize one or more relation classifications for the entity.

The knowledge extraction method based on the multi-semantic features comprises the following steps:

step one, semantic vector representation

Given a sentence w = w ₁ ,...,w _n ，w _i ,i＝1,2,.., n represents the word in the sentence, and carries out word embedding preprocessing by using the Glov pre-training language model, and each word w _i Conversion into a vector matrix w _Glove So that they can be calculated in the neural network model.

Then, character-level vector representation is performed for each word. Taking each character of the word as input, capturing the morphological characteristics of the word by utilizing a Bi-LSTM neural network to obtain a character level vector matrix w _char . And splicing the word embedding and the character embedding to obtain a vocabulary vector X.

Next, the present invention obtains context embedding by mapping context information to multiple semantic spaces through a multi-headed self-attention mechanism. Converting the vocabulary vector X matrix to obtain a query vector Q ∈ R ^n*d And a pair of key-value vectors K ∈ R ^n*d ，V∈R ^n*d Then, the proportional dot product attention is calculated as follows:

multi-headed attention calculates key values and hidden layers linearly h times using different initialization matrices. At each instant, attention mechanisms are performed in parallel. For the ith header, the query, key value, and encoder hidden layer value are described with respect to the coefficient matrix as, W _i ^Q ∈R ^n*d/h ，W _i ^K ∈R ^n*d/h And W _i ^V ∈R ^n*d/h . The scaled dot product attention is then calculated:

Head _i ＝Attention(QW _i ^Q ,KW _i ^K ,VW _i ^V ) (2)

the output vectors of all h parallel headers are concatenated together. Finally, outputting the mixed semantic representation:

X _context ＝Concat(Head ₁ ,Head ₂ ,...,Head _h )·W ^A (3)

wherein, X _context ∈R ^n*d ，W ^A ∈R ^d*d Is a weight matrix for linear operations.

And finally, splicing the vocabulary vector and the semantic vector containing the context information to obtain the final sentence vector representation, namely the input of the next step, in order to mine the rich semantic information of the sentence.

Step two, feature coding

The LSTM unit consists of three multiplier gates. Forget the information that the door decides to discard; the input gate determines updated information; the output gate updates the status of the information. LSTM protects and controls the flow of information primarily through three gates. The present invention obtains past and future information through its extension Bi-LSTM.

And

the outputs of forward and backward LSTM at time i are represented separately, and the overall output of Bi-LSTM at time i is represented as follows:

step three, entity recognition

The present invention represents entity recognition tasks as a sequence tagging problem. An entity is generally composed of several consecutive words in a sentence, so the BIO tagging scheme is adopted, i.e. B, I and O are used to represent the beginning, inside and outside of the entity respectively, and a label is assigned to each word in the entity. The most likely entity label for each word is first calculated over the CRF. Calculating a label score:

s ^(e) (h _i )＝V ^(e) f(W ^(e) h _i +b ^(e) ) (5)

wherein f (-) is an element activation function, V ^(e) ∈R ^p*l ,W ^(e) ∈R ^l*2d ,b ^(e) ∈R ^l D is the hidden size of the LSTM, p is the type of entity type, and l is the layer width.

The invention then proceeds through a linear chain CRF, taking into account the inter-tag considerationsThe order of (a). For the input word w _i The scoring sequence is expressed as:

wherein,

is shown when w _i Mark y _i Mark score of time, T represents y _i To y _i+1 The transition fraction of (c). Finally, the computation sentence w is marked as a tag sequence

Probability of (c):

since the prediction of the relationship classification depends to some extent on the result of the entity identification, the obtained entity information is taken as an embedded vector g _i Input to the next step, i.e. z _i ＝[h _i ；g _i ]。

Step four, relation classification

The invention regards the relation extraction task as a multi-head selection problem, can effectively identify all relation triples in the sentence, and realizes the extraction of the overlapping relation. In the present invention, any relationship type can be maintained between two entities, and the semantic relationship between each entity is maintained independently.

A sentence w and a set of relationship labels R are used as input in order to identify relationship triples in the sentence. Given relational tag r _k Calculating two entities w _i And w _j Fraction in between:

s ^(r) (w _j ,w _i ,r _k )＝V ^(r) f(U ^(r) z _j +W ^(r) z _i +b ^(r) ) (8)

wherein r, V ^r And d respectively represents the number of hidden units used for the relation classification task, the weight matrix and the LSTM. f (-) represents the activation function. Word w _j And w _i Has a relationship type r between _k The probability is defined as

Pr(head＝w _j ,label＝r _k |w _i )＝σ(s ^(r) (w _j ,w _i ,r _k )) (9)

Where σ is the sigmoid function. sigmoid assumes that all relationships are independent of each other and it does not add the probability of all relationships to 1. When the probability is greater than 0.5, then a relationship between the two entities is considered to exist.

Compared with the prior art, the method has the following advantages:

most of the existing knowledge extraction methods consider the characteristics of a vocabulary level and ignore potential contextual semantic information, but the performance of a model directly depends on the accuracy of the obtained characteristics. Aiming at the problem, the invention provides a new relation triple extraction method. The method comprises the steps of firstly obtaining word vector representation through a pre-training language model, then utilizing Bi-LSTM to carry out feature coding on character-level features, and coding context semantic information through a multi-head self-attention mechanism to obtain the internal structure and long-distance dependency relationship of a sentence. And then, the semantic features of different levels are spliced to obtain efficient semantic representation, more accurate feature vectors are provided for entity identification and relationship classification, and the performance of extracting relationship triples is effectively improved.

Drawings

FIG. 1 is a flow chart of extracting relational triples based on multi-semantic feature vectors according to the present invention.

FIG. 2 is a diagram illustrating a structure of embedding character vectors according to the present invention.

Fig. 3 is a diagram of the self-attention mechanism of the present invention.

FIG. 4 is a diagram of a sentence semantic representation structure according to the present invention.

Detailed Description

The embodiments of the present invention will be described in detail below with reference to the accompanying drawings: the embodiment is implemented on the premise of the technical scheme of the invention, and gives a detailed implementation scheme and a specific operation process.

As shown in fig. 1, the present invention provides a knowledge extraction method based on multi-semantic features, which specifically includes the following steps:

the method comprises the following steps: and (4) preprocessing data. The CoNLL04 dataset consists of 1,441 sentences from the Newswer article, annotated as four entity types (Location, organization, peoples, other) and five relationship types (Work-for, kill, organization-Based-In, lives-In, located-In). These contents are randomly divided into a training set, a validation set, and a test set.

Step two: and (4) semantic vector representation. Firstly, inputting a sentence, carrying out word embedding pretreatment on an original corpus by using Glove, and converting the sentence into a 100-dimensional vector matrix

Next, each character of the word is used as input through a Bi-LSTM neural network, as shown in FIG. 2. Set the hidden dimension of the LSTM to 25, and then join the two final states together to get the character-level vector representation w _char . And splicing the word vector and the character vector to obtain the vocabulary vector representation.

Fig. 3 is a diagram showing a configuration of the self-attention mechanism. Inputting the vocabulary vector X into a multi-head attention mechanism, mapping the context information to a plurality of semantic spaces to obtain a context embedding X _context 。

In order to mine rich semantic information of sentences, output vectors of different modules are spliced to obtain final sentence vector representation. The specific structure is shown in fig. 4. The context module obtained by inputting the vocabulary embedding module into the multi-head self-attention mechanism is spliced with the vocabulary embedding module at a connecting layer to obtain vector representation with multi-layer semantic information.

Step three: inputting the sentence vector obtained in the step two into Bi-LSTM for feature coding to obtain a sentence expression h with long-distance dependence _i 。

Step four: the most likely entity label for each word is calculated by a linear chain CRF. For example, grandeIsle is labeled as B-Location and I-Location, respectively. If a word does not belong to an entity, it is marked as N.

Step five: the type of relationship between the two entities is calculated by the above equation (8) and equation (9).

To verify the effectiveness of the above method, the inventive method was experimented on the CoNLL04 dataset and compared to baseline results. The method model of the invention was developed by using Python and TensorFlow machine learning libraries. To avoid overfitting the model, different Dropout rates are used at the input layer and the hidden layer. And the model is optimized by an Adam optimizer. Furthermore, when the results on the validation set did not improve for 30 consecutive epochs, the training of the model was stopped. More detailed hyper-parameter settings are shown in table 1 below.

TABLE 1 hyper-parameter settings

The method selects precision (P), recall rate (R) and F1 score to evaluate and predict results on a data set. If the boundary and the type of the entity are correct, the entity is judged to be correct; a relationship is correct when both the type of the relationship and the parameter entity are correct.

The experimental results for the CoNLL04 dataset are shown in Table 2. The result shows that the method can successfully share the association information of the entity and the relationship and learn the complex long-distance correlation. The invention improves the performance of the knowledge triple extraction task by utilizing the importance of the multi-feature semantic information.

Table 2 experimental results of the clonl04 dataset

The above-described embodiments are merely illustrative of the preferred embodiments of the present invention, and do not limit the scope of the present invention, and various modifications and improvements of the technical solutions of the present invention can be made by those skilled in the art without departing from the spirit of the present invention, and the scope of the present invention is defined by the claims.

Claims

1. A knowledge extraction method based on multi-semantic features is characterized in that: the knowledge extraction method based on the multi-semantic features comprises the following steps:

step one, semantic vector representation

Given a sentence w = w ₁ ,...,w _n ，w _i I =1,2, n represents words in a sentence, word embedding preprocessing is performed using a Glov pre-training language model, and each word w is processed _i Conversion into a vector matrix w _Glove So that they can be calculated in the neural network model;

then, performing character level vector representation on each word, taking each character of the word as input, capturing morphological characteristics of the word by utilizing a Bi-LSTM neural network to obtain a character level vector matrix w _char Word embedding and character embedding are spliced to obtain a vocabulary vector X;

then, mapping the context information to a plurality of semantic spaces through a multi-head self-attention mechanism to obtain context embedding, and performing matrix transformation on the vocabulary vector X to obtain a query vector Q e R ^n*d And a pair of key-value vectors K ∈ R ^n*d ，V∈R ^n*d Then, the proportional dot product attention is calculated as follows:

the multi-head attention uses different initialization matrixes to linearly calculate key values and hidden layers h times, the attention mechanism is executed in parallel at each moment, and for the ith head, inquiry, key values and encoder hidden layer values are related to a coefficient matrixIs described as W _i ^Q ∈R ^n*d/h ，W _i ^K ∈R ^n*d/h And W _i ^V ∈R ^n*d/h . The scaled dot product attention is then calculated:

Head _i ＝Attention(QW _i ^Q ,KW _i ^K ,VW _i ^V ) (2)

the output vectors of all h parallel headers are concatenated together and finally the mixed semantic representation is output:

X _context ＝Concat(Head ₁ ,Head ₂ ,...,Head _h )·W ^A (3)

wherein, X _context ∈R ^n*d ，W ^A ∈R ^d*d A weight matrix that is a linear operation;

finally, in order to mine the rich semantic information of the sentence, the vocabulary vector and the semantic vector containing the context information are spliced to obtain the final sentence vector representation, namely the input of the next step;

step two, feature coding

The LSTM unit consists of three multiplier gates, a forgetting gate determines the discarded information, an input gate determines the updated information, an output gate updates the state of the information, the LSTM protects and controls the flow of information mainly through the three gates, acquires past and future information through its extension Bi-LSTM,

and

step three, entity recognition

Expressing the entity recognition task as a sequence tagging problem, an entity generally consists of several consecutive words in a sentence, so a BIO tagging scheme is adopted, i.e. B, I and O are used to represent the beginning, inside and outside of the entity, respectively, one tag is assigned to each word in the entity, the most likely entity tag for each word is first calculated over CRF, the tag score is calculated:

s ^(e) (h _i )＝V ^(e) f(W ^(e) h _i +b ^(e) ) (5)

wherein f (-) is an element activation function, V ^(e) ∈R ^p*l ,W ^(e) ∈R ^l*2d ,b ^(e) ∈R ^l D is the hidden size of the LSTM, p is the type of entity type, and l is the layer width;

the invention then proceeds through a linear chain CRF, considering the order between the tags, for the input word w _i The scoring sequence is expressed as:

wherein,

is shown when w _i Mark y _i Mark score of time, T represents y _i To y _i+1 And finally, calculating the mark of the sentence w as a tag sequence

Probability of (c):

since the prediction of the relationship classification depends to some extent on the result of the entity identification, the obtained entity information is taken as an embedded vector g _i Input to the next step, i.e. z _i ＝[h _i ；g _i ]；

Step four, relation classification

The relation extraction task is regarded as a multi-head selection problem, all relation triples in a sentence can be effectively identified, the extraction of overlapping relations is realized, any relation type can be kept between two entities, and the semantic relation between each entity is kept independent;

a sentence w and a set of relationship labels R are used as input, the purpose is to identify relationship triples in the sentence, and the relationship labels R are given _k Calculating two entities w _i And w _j Fraction in between:

s ^(r) (w _j ,w _i ,r _k )＝V ^(r) f(U ^(r) z _j +W ^(r) z _i +b ^(r) ) (8)

wherein r, V ^r D represents the number of hidden units for the relation classification task, weight matrix, LSTM, f (-) represents the activation function, and the word w _j And w _i Has a relationship type r between _k The probability is defined as

Pr(head＝w _j ,label＝r _k |w _i )＝σ(s ^(r) (w _j ,w _i ,r _k )) (9)

Where σ is the sigmoid function. sigmoid assumes that all relationships are independent of each other and it does not add the probabilities of all relationships to 1, when the probability is greater than 0.5, then a relationship between two entities is considered to exist.