CN110866121A

CN110866121A - Knowledge graph construction method for power field

Info

Publication number: CN110866121A
Application number: CN201910917049.6A
Authority: CN
Inventors: 陈振宇; 季晓慧; 李大鹏; 王群弼; 杨清波; 陶蕾; 黄运豪; 狄方春; 李晨
Original assignee: China Electric Power Research Institute Co Ltd CEPRI; China University of Geosciences Beijing
Current assignee: China Electric Power Research Institute Co Ltd CEPRI; China University of Geosciences Beijing
Priority date: 2019-09-26
Filing date: 2019-09-26
Publication date: 2020-03-06

Abstract

The method constructs the knowledge graph facing the electric power field, solves the problem of noise reduction in the construction training set of remote supervised learning, improves the accuracy of entity identification, can construct an accurate knowledge graph in the electric power field, constructs an accurate knowledge base for the electric power field, provides a foundation for knowledge discovery in the electric power field, and lays a foundation for the extension of the knowledge graph in the electric power field.

Description

Knowledge graph construction method for power field

Technical Field

The invention belongs to the field of electric power planning, and particularly relates to a knowledge graph construction method for the electric power field.

Background

The knowledge graph aims to describe various entities or concepts existing in the real world and the relationship thereof, and forms a huge semantic network graph. A knowledge graph mainly describes entities existing in the real world and relationships among the entities, and two entities and the relationships among the entities form a triple of the knowledge graph. In the construction of the knowledge graph, the main tasks are named entity identification and entity relation extraction.

Named entity recognition, also known as entity extraction technology and entity partitioning technology, is a sub-field of natural language processing technology. The goal is to extract named entities referred to in unstructured text, including but not limited to, person names, organization names, place name medical terms, regulatory terms, time, quantity, monetary value, and the like. The prior art is well established in named entity recognition. At present, with the appearance of word vectors, a convolutional neural network model and a bidirectional long-short term memory network are applied to named entity recognition in combination with a conditional random field model, and high accuracy is obtained. The existing named entity recognition has high recognition rate on common texts, but in the professional field, due to the particularity of field vocabularies, the recognition accuracy on proper nouns in the field of some specific grammar structures is not high.

In the aspect of entity relationship extraction, the relationship extraction method of remote supervision learning draws attention of broad scholars. Because supervised learning is used on the premise that a large amount of artificial labeled corpora are needed, the deep learning method based on the neural network usually needs quite large labeled corpora to perform model training. In order to solve the problem of insufficient data in the supervised learning process, Mintz et al propose a remote supervised learning method, which utilizes the existing knowledge in a knowledge base to automatically generate a large amount of labeled data by aligning the knowledge with the text. The generated data is then used for training of a relational extraction neural network model.

Because the training set constructed based on the remote supervision method is too hard to assume, the generated training text is relatively noisy. In addition, if the knowledge base in the field is deficient, the method needs a lot of manual work to construct the initial corpus, which is difficult to be completed without experts in the field, while the power field is in the current situation of deficient knowledge base.

Disclosure of Invention

The invention aims to construct a knowledge graph facing the electric power field, and because the structured data of the electric power text is less, the relation between entities needs to be extracted from a large amount of unstructured data to construct the knowledge graph. The training set is required to be constructed through a remote supervision method, and the problem to be solved firstly is the noise reduction problem of the training set constructed through remote supervision and learning. In addition, in the aspect of named entity recognition, the accuracy of entity recognition is improved by adding a professional dictionary in the power field.

The invention provides a power field knowledge graph construction method, which at least comprises the following steps:

step 1, acquiring structured data and semi-structured and unstructured data in the power field on a network;

step 2, manually screening the obtained structured data through words to serve as the extracted triples and serve as a knowledge base for remote supervised learning; for semi-unstructured data, a natural language processing tool LTP is used for segmenting words of the sentences;

step 3, named entity recognition is carried out through a deep learning Chinese named entity recognition method;

and 4, extracting entity relations by adopting a remote supervised learning method, and realizing construction of the knowledge graph facing the power field.

The knowledge graph oriented to the electric power field is constructed, and because the structured data of the electric power text is less, the relationships among the entities need to be extracted from a large amount of unstructured data to construct the knowledge graph. And the problem to be solved is the noise reduction problem of the training set constructed by remote supervised learning. In addition, in the aspect of named entity recognition, the accuracy of entity recognition is improved by adding a professional dictionary in the power field. In conclusion, the method and the device solve the problem of noise reduction in the construction training set of the remote supervised learning, improve the accuracy of entity identification, construct an accurate knowledge map in the power field, construct an accurate knowledge base for the power field, provide a foundation for knowledge discovery in the power field, and lay a foundation for extension of the knowledge map in the power field.

Drawings

FIG. 1 is a BiLSTM-CRF model in named entity recognition.

FIG. 2 shows a sentence vector representation module in entity relationship extraction.

Fig. 3 is a diagram of a remote supervised learning extraction framework.

Detailed Description

For a better understanding of the present invention, the method and system of the present invention will be further described with reference to the following description of the embodiments in conjunction with the accompanying drawings.

In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the invention. It will be understood by those skilled in the art, however, that the present invention may be practiced without these specific details. In the embodiments, well-known methods, procedures, components, and so forth have not been described in detail as not to unnecessarily obscure the embodiments.

Preferably, in step 1, structured data and semi-structured and unstructured data in the power field are acquired on the network, and specifically encyclopedia texts in the power field are crawled on the principle of breadth first.

Preferably, in step 3, the named entity recognition is performed by a deep learning chinese named entity recognition method, which specifically includes:

step 3-1, carrying out distributed expression on the words;

step 3-2, carrying out model training by using a deep learning network under supervision;

and 3-3, labeling each word in the sequence by using the context information.

Preferably, in step 3-2, supervised model training is performed by using a deep learning network, which specifically includes:

taking a power field data set as a training corpus, and training a word vector through a Skip-gram mode; the training network adopts a three-layer neural network consisting of an input layer, a hidden layer and an output layer, wherein 100 neurons of the hidden layer are set according to experience; the neural network module adopts BilSTM, regards semantic association of words in sentences as a sequence problem, and stores historical information for learning during network training; and (4) considering context correlation, performing sequence annotation by adopting a CRF (domain gradient random access) model, and separating the correlation of an output level.

Preferably, in step 4, the entity relationship extraction is performed by using a remote supervised learning method, which specifically includes:

step 4-1, entity alignment is carried out, and a relation instance set for training and testing is constructed in an entity alignment mode;

mapping the triple relations in the knowledge base to a training document for entity alignment, and generating a relation instance set Q:

Q＝{q_n|q_n＝(s_m,e_i,r_k,e_j),s_m∈D} (1)

wherein e is_i、e_jIs two entities, r_kAs a relationship of two entities in a knowledge base, s_mFor sentences of entity pairs in corpus D, q_nIs the generated relationship instance;

and 4-2, performing intra-sentence relation extraction by adopting a relation extraction model based on an attention mechanism.

Preferably, in the step 4-1, entity alignment is performed, and a relationship instance set for training and testing is constructed in an entity alignment manner, specifically including:

step 4-1-1, a mapping step, namely mapping each entity into a sentence of a text, wherein the co-occurrence of a pair of entities in each sentence is used as a relation example, and a plurality of relation examples with the same relation form a relation system;

4-1-2, training, namely aligning entities by using triples extracted from the structured data and encyclopedic texts;

and 4-1-3, a testing step, namely generating candidate relation pairs in pairs by using all entities in the test set in a permutation and combination mode, and generating corresponding test examples and relation examples by using the candidate relation pairs and the test corpus in an entity alignment mode.

Preferably, in the step 4-2, the intra-sentence relation extraction is performed by using a relation extraction model based on an attention mechanism,

the attention-based relationship extraction model mainly comprises two parts: a sentence vector representation module and a sentence-level attention mechanism module;

the sentence vector representation module is used for obtaining the characteristic representation of each relation instance in the relation system;

and the sentence-level attention mechanism module is used for measuring the importance degree of each relation instance relative to the relation system.

Preferably, in the sentence vector representation module, word vector representation is performed by using word2vec method, and the relative relationship between the vocabulary and the entity in the sentence is captured by using the word position vector;

the word vector for the ith word in the sentence is denoted w_iTo make

And

the expression vocabulary w_iWord position vectors from two entities, using t_iAs a word w_iIs expressed as shown in equation (2):

using BilSTM to obtain the forward state and backward state of each vocabulary, and converting the forward state of each vocabulary into forward state

And backward state

The result of the concatenation is expressed as the state of the vocabulary, as shown in equation (3):

after obtaining the state information of all the words, the vector s of the sentence_iThe representation can be determined jointly by all the lexical states inside it:

preferably, wherein the sentence-level attention mechanism module comprises: the system comprises an attention mechanism calculation unit, an entity feature representation layer, a relation package feature representation layer, a hidden layer and an output layer;

the attention mechanism calculating unit is used for calculating weights of different instances in the relationship package to obtain vector representation of each relationship system. In the weight calculation process, three kinds of characteristic information of a concept vector, a sentence mark vector and a target relation vector are fused on the basis of a sentence vector, wherein the concept vector e_iIncluding descriptor vector c_iAnd a superior-inferior token vector q_i；

The characteristic representation layer of the relationship system is used for obtaining the characteristics of the relationship system, and the characteristics are jointly determined by the relationship examples in the package; the relation system S is composed of n relation examples, S ═ S₁,s₂,...,s_nAnd f, the feature vector u of the relationship packet S can be obtained by formula (5):

wherein, α_kIs the weight of the kth relationship instance, s_kIs the feature vector of the kth relationship instance;

the entity feature representation layer is used for obtaining the abstract features of the entity by using the BilSTM, and specifically comprises the following steps:

the respective probability vectors e of the entities in the relation₁And e₂Taken together, BilSTM is used to obtain the forward state of each entity

And the state of the consequent

The forward state and the back term state of the entity are merged as shown in equation (6).

After obtaining the respective state vectors of the entities in the relationship, summing the state vectors as the final feature representation e of the entity pair_f；

Deriving a feature representation e of an entity pair_fAnd after the feature representation u of the sum relation packet, splicing the sum relation packet and the feature representation u to form a new feature vector k ═ e_f；u]Feeding into a hidden layer;

the hidden layer is used for receiving the new feature vector, obtaining a final feature representation z of the entity through linear and nonlinear changes of the hidden layer,

z＝tanh(W_hk+b_ei) (8)

wherein W_hIs a parameter matrix, b_eiIs an offset, k is from e_fAnd u are spliced to form a new feature vector.

The output layer is configured to output a final classification result, and specifically includes:

and (3) carrying out linear transformation on the feature vector z obtained by the hidden layer, and obtaining the probability score of each relation category by using SoftMax transformation, wherein the formula (9) is as follows:

o＝softmax(W_oz+b_o) (9)

wherein W_oIs a parameter matrix, b_oIs the offset and o is the output result of the entire network.

Preferably, when the model training is performed, a new entity relationship is obtained from the test corpus to form a triple, and the triple is updated to the knowledge base.

There has been described herein only the preferred embodiments of the invention, but it is not intended to limit the scope, applicability or configuration of the invention in any way. Rather, the detailed description of the embodiments is presented to enable any person skilled in the art to make and use the embodiments. It will be understood that various changes and modifications in detail may be effected therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A power domain knowledge graph-oriented construction method is characterized by at least comprising the following steps:

2. The method according to claim 1, wherein in the step 1, structured data and semi-structured and unstructured data of the electric power field are acquired on a network, and encyclopedia texts of the electric power field are crawled on a breadth-first principle.

3. The method according to claim 1, wherein the step 3 of performing named entity recognition by a deep learning chinese named entity recognition method specifically comprises:

step 3-1, carrying out distributed expression on the words;

and 3-3, labeling each word in the sequence by using the context information.

4. The method according to claim 1, wherein the step 3-2 of supervised model training using a deep learning network specifically comprises:

5. The method according to claim 2, wherein the step 4 of performing entity relationship extraction by using a remote supervised learning method specifically comprises:

Q＝{q_n|q_n＝(s_m,e_i,r_k,e_j),s_m∈D} (1)

6. The method according to claim 5, wherein the step 4-1, performing entity alignment, and constructing the relationship instance set for training and testing in an entity alignment manner, specifically comprises:

7. The method according to claim 5, wherein the step 4-2, using a relation extraction model based on attention mechanism to perform intra-sentence relation extraction,

8. The method of claim 7, wherein in the sentence vector representation module, word vector representation is performed using word2vec method, and word position vector is used to capture relative relationship between words and entities in sentences;

the ith word in the sentenceIs denoted as w_iUse of

And

And backward state

9. the method of claim 7, wherein the sentence-level attention mechanism module comprises: the system comprises an attention mechanism calculation unit, an entity feature representation layer, a relation package feature representation layer, a hidden layer and an output layer;

the attention mechanism calculating unit is used for calculating the attention mechanismWeights for different instances in the relationship system are calculated to obtain a vector representation for each relationship system. In the weight calculation process, three kinds of characteristic information of a concept vector, a sentence mark vector and a target relation vector are fused on the basis of a sentence vector, wherein the concept vector e_iIncluding descriptor vector c_iAnd a superior-inferior token vector q_i；

And the state of the consequent

wherein W_hIs a parameter matrix, b_eiIs an offset, k is from e_fAnd (e) forming a new characteristic vector k ═ e by splicing with u_f；u]。

o＝softmax(W_oz+b_o) (9)

10. The method according to claim 4, wherein, during the model training, new entity relationship forming triples are obtained from the test corpus and updated to the knowledge base.