CN115934955A

CN115934955A - Electric power standard knowledge graph construction method, knowledge question answering system and device

Info

Publication number: CN115934955A
Application number: CN202211320954.1A
Authority: CN
Inventors: 周育忠; 林正平; 王冕; 涂亮; 杨宇亮
Original assignee: CSG Electric Power Research Institute; Guizhou Power Grid Co Ltd
Current assignee: CSG Electric Power Research Institute; Guizhou Power Grid Co Ltd
Priority date: 2022-10-26
Filing date: 2022-10-26
Publication date: 2023-04-07

Abstract

The invention discloses a power standard knowledge graph construction method, a knowledge question-answering system and a device, which comprises the steps of constructing a body structure of a power standard knowledge graph through collected power standard data, wherein the body structure comprises entities, attributes and relationships among the entities; acquiring basic data containing power standard knowledge, extracting the basic data to extract entities, attributes and relationships among the entities; and performing knowledge fusion based on the extracted knowledge, storing the fused knowledge, and constructing the power standard knowledge map. The invention effectively solves the problem of difficulty in extracting the knowledge of the electric power standard through the designed model aiming at the knowledge extraction of the text information and the image information, thereby not only ensuring the reliability of the knowledge extraction, but also ensuring the extraction efficiency.

Description

Electric power standard knowledge graph construction method, knowledge question answering system and device

Technical Field

The invention relates to the technical field of electric power, in particular to a construction method of an electric power standard knowledge graph, a knowledge question-answering system and a knowledge question-answering device.

Background

The knowledge graph is a modern theory which achieves the aim of multi-discipline fusion by combining theories and methods of applying subjects such as mathematics, graphics, information visualization technology, information science and the like with methods such as metrology introduction analysis, co-occurrence analysis and the like and utilizing a visualized graph to vividly display the core structure, development history, frontier field and overall knowledge framework of the subjects. The method can display the complex knowledge field through data mining, information processing, knowledge measurement and graph drawing, reveal the dynamic development rule of the knowledge field, and provide practical and valuable reference for subject research.

Applications based on knowledge graphs are many, such as intelligent question answering, personalized recommendation, knowledge reasoning, visualization, and the like. The knowledge question-answering system is similar to a search engine and is also an information retrieval tool, but the knowledge question-answering system can understand and process natural language questions at a semantic level and directly returns answers to the questions to realize semantic retrieval. If the knowledge map is used as a knowledge source of the question-answering system, the knowledge question-answering system based on the knowledge base is formed, the questions in natural language forms can be accepted, the meaning of the questions is understood through semantic analysis, and then the answers of the questions are inquired and returned in the knowledge base.

The existing related knowledge of the power industry generally depends on a search engine, and an intelligent question and answer system in the vertical field is not seen. The reason is that the difficulty in constructing the knowledge graph is caused by high difficulty in extracting relevant knowledge in the process of constructing the power standard knowledge graph.

Disclosure of Invention

This section is for the purpose of summarizing some aspects of embodiments of the invention and to briefly introduce some preferred embodiments. In this section, as well as in the abstract and the title of the invention of this application, simplifications or omissions may be made to avoid obscuring the purpose of the section, the abstract and the title, and such simplifications or omissions are not intended to limit the scope of the invention.

The present invention has been made in view of the above and/or problems occurring in the prior art related to the power industry.

Therefore, the problem to be solved by the invention is how to extract relevant knowledge in the process of constructing the power standard knowledge graph.

In order to solve the technical problems, the invention provides the following technical scheme:

in a first aspect, an embodiment of the present invention provides a power standard knowledge graph construction method, which includes,

constructing an ontology structure of the power standard knowledge graph through the acquired power standard data, wherein the ontology structure comprises entities, attributes and relationships among the entities;

acquiring basic data containing power standard knowledge, extracting the basic data to extract entities, attributes and relationships among the entities;

and performing knowledge fusion based on the extracted knowledge, storing the fused knowledge, and constructing the power standard knowledge map.

As a preferable scheme of the electric power standard knowledge graph construction method of the invention, the method comprises the following steps: the acquiring of basic data containing power standard knowledge and the knowledge extraction of the basic data comprise,

preprocessing the basic data to obtain a plurality of text messages or obtain a plurality of text messages and at least one image message;

for each text message, segmenting the text message and inputting the segmented text message into a Bert submodel to obtain a corresponding vector sequence, inputting the vector sequence into a BGRU submodel, outputting a state matrix for revealing scores of labels corresponding to words in the text message, inputting the state matrix into a CRF submodel, calculating an optimal label sequence, and realizing extraction of an entity and extraction of attributes;

aiming at each piece of image information, inputting the image information into a formula recognition sub tool called by the outside to obtain converted text information, processing the converted text information to obtain at least one formula text, inputting each formula text into a WordBert sub-model together to obtain a corresponding vector sequence, inputting the vector sequence into a BGRU sub-model, outputting a state matrix for revealing each label score corresponding to each formula text in the converted text information, inputting the state matrix into a CRF sub-model, calculating an optimal label sequence, and realizing the extraction of attributes;

and processing the vector sequences of the extracted entities and attributes and then inputting the processed vector sequences into a relation extraction submodel to realize the extraction of the relation between the entities.

As a preferable scheme of the electric power standard knowledge graph construction method of the invention, the method comprises the following steps: the knowledge extraction method for each text message includes,

segmenting the text information to obtain a segmented text w with the length of n;

segmenting the word text w = ([ CLS)],w ₁ ,w ₂ ,…,w _n ,[SEP]) Inputting the word segmentation text w into a Bert submodel to obtain a vector sequence l = (l) corresponding to the word segmentation text w ₀ ,l ₁ ,l ₂ ,…,l _n ,l _n+1 )，l _i ∈R ^n×L Wherein i is epsilon [0, n +1]Vector sequence l = (l) ₀ ,l ₁ ,l ₂ ,…,l _n ,l _n+1 ) For the hidden state corresponding to the participle text w in the last layer of the Bert submodel, [ CLS]As an initiator, [ SEP]For the terminator, L is the hidden state dimension of the Bert submodel;

vector sequence l = (l) ₀ ,l ₁ ,l ₂ ,…,l _n ,l _n+1 ) In each word vector sequence l _i As input for each time step in the BGRU submodel;

hidden state sequence for outputting forward GRU in BGRU submodel

And hidden state sequence of reverse GRU output

Calculating to obtain a hidden state sequence h corresponding to the vector sequence l _n+1 ，h _n+1 ∈R ^n×H H is the hidden state dimension of the BGRU submodel;

will hide the state sequence h _n+1 Mapping from H dimension to k dimension, wherein k is the number of labels;

calculating the label scores of each participle classified into k labels to obtain a state matrix E = (E) ₀ ,e ₁ ,e ₂ ,…,e _n ,e _n+1 ) In which e is _i ∈R ^k Is a column vector;

and inputting the state matrix into a CRF submodel, and calculating an optimal label sequence.

As a preferable scheme of the electric power standard knowledge graph construction method of the invention, the method comprises the following steps: inputting the state matrix into a CRF submodel, calculating an optimal label sequence comprises,

state matrix E = (E) ₀ ,e ₁ ,e ₂ ,…,e _n ,e _n+1 ) Inputting the CRF model into a CRF sub-model;

calculating each label sequence based on a constraint matrix F introduced in a CRF submodel and an input state matrix E

Total score of (c):

/>

wherein F ∈ R ^(k+2)×(k+2) ，

Represents the tag sequence->

Alpha is an adjustment factor, and>

represents the probability that the ith participle is classified into the jth tag in the state matrix E, and ` is greater or lesser than `>

Indicates that it is based on the tag sequence->

The probability of the jth label transitioning to the j +1 th label;

based on each tag sequence

Is greater than or equal to>

Calculating an optimal tag sequence->

Wherein the content of the first and second substances,

is a collection of all possible tag sequences.

As a preferable scheme of the electric power standard knowledge graph construction method of the invention, the method comprises the following steps: the knowledge extraction method for each image information includes,

identifying the converted text information, and determining whether a target symbol "=" exists;

if the target symbol "=" does not exist, determining the converted text information as a formula text;

if the target symbol "=" exists, splitting the converted text information by using the target symbol "=" to obtain a plurality of formula texts;

combining formula text with v = ([ CLS)],v ₁ ,v ₂ ,…,v _m ,[SEP]) Inputting the vector sequence into a WordBert submodel to obtain a vector sequence l = (l) corresponding to a formula text combination v ₀ ,l ₁ ,l ₂ ,…,l _m ,l _m+1 )，l _i ∈R ^m×L Wherein i is epsilon [0, m +1 +]Vector sequence l = (l) ₀ ,l ₁ ,l ₂ ,…,l _m ,l _m+1 ) Is a hidden state corresponding to a formula text combination v in the last layer of the WordBert submodel, [ CLS]As initiator, [ SEP]For the terminator, L is the hidden state dimension of the WordBert submodel;

vector sequence l = (l) ₀ ,l ₁ ,l ₂ ,…,l _m ,l _m+1 ) In each formula vector sequence l _i As input for each time step in the BGRU submodel; hidden state sequence for outputting forward GRU in BGRU submodel

And a hidden state sequence output in reverse GRU>

Calculating to obtain a hidden state sequence h corresponding to the vector sequence l _m+1 ，h _m+1 ∈R ^m×H H is the hidden state dimension of the BGRU submodel; will hide the state sequence h _m+1 Mapping from H dimension to k dimension, wherein k is the number of labels; calculating the label scores of k labels classified by each formula to obtain a state matrix E = (E) ₀ ,e ₁ ,e ₂ ,…,e _m ,e _m+1 )，e _i ∈R ^k Is a column vector;

state matrix E = (E) ₀ ,e ₁ ,e ₂ ,…,e _m ,e _m+1 ) Inputting the CRF model into a CRF sub-model;

calculating each tag sequence based on the input state matrix E

Total score of (c):

wherein the content of the first and second substances,

represents the tag sequence->

Is based on the total score of->

Representing the probability of classifying the ith component into the jth label in the state matrix E;

based on each tag sequence

Is greater than or equal to>

Calculating an optimal tag sequence->

Wherein it is present>

Is a collection of all possible tag sequences.

As a preferable scheme of the electric power standard knowledge graph construction method of the invention, the method comprises the following steps: the vector sequence of the extracted entities and attributes is processed and then input into a relation extraction submodel to realize the extraction of the relation between the entities,

based on the extracted entities, the vector sequence l = (l) corresponding to the segmented text w ₀ ,l ₁ ,l ₂ ,…,l _n ,l _n+1 ) The corresponding vector in (1) is marked;

inputting the marked vector sequence l' into a relation extraction submodel;

performing binary mutual grouping on all the marker vectors aiming at the marker vectors with the markers in the vector sequence l' so as to enable each marker vector and other marker vectors to have a paired combination relationship;

splicing two marker vectors of the marker vector pair to obtain a combined vector aiming at each marker vector pair with a combination relation;

calculating the score of each combination vector under each relation category;

and respectively obtaining the optimal scores corresponding to each combined vector, sorting, eliminating the last optimal score in the sorting, and determining the relationship between the entities with the corresponding relationship categories between the entities corresponding to the combined vectors aiming at each residual optimal score to realize the extraction of the relationship between the entities.

In a second aspect, an embodiment of the present invention provides a power standard knowledge graph building system, which includes:

the data layer comprises a pre-constructed power standard knowledge graph and a word segmentation dictionary constructed based on entities and attributes in the power standard knowledge graph;

the Web layer is used for receiving question information of a user and generating and displaying answer information based on a query result of the query layer, wherein the question information is in a natural language form;

and the query layer is used for converting the question information into Cypher query sentences, sending the Cypher query sentences to the Neo4j graph database for query and acquiring query results.

In a third aspect, an embodiment of the present invention provides a computer device, including a memory and a processor, where the memory stores a computer program, and where: the processor, when executing the computer program, performs any of the steps of the above-described method.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, wherein: which when executed by a processor performs any of the steps of the above-described method.

The invention has the beneficial effects that: the knowledge extraction of the text information realizes the combined extraction of the entities and the attributes in the power standard knowledge through a designed model, so that the reliability of the knowledge extraction can be ensured, and the extraction efficiency can be ensured; the problem that power standard knowledge is difficult to extract (because formula images are used for representing data of relevant information such as numerical value limitation and calculation modes, effective knowledge extraction cannot be realized in the prior art) is effectively solved through a designed model aiming at knowledge extraction of image information, and not only can extraction of relevant knowledge in the formula images be guaranteed, but also reliability of extraction of the knowledge can be guaranteed. And moreover, the designed WordBert submodel is adopted for the formula text, word segmentation operation is not involved, the processing process can be reduced, information can be effectively reserved, and the problem of wrong formula information extraction caused by word segmentation in the traditional Bert model is solved. The vector sequence of the extracted entities and the attributes is processed and then input into the relation extraction submodel, so that the extraction of the relation between the entities is realized, the subsequent relation processing can be carried out by utilizing the vector sequence obtained by processing the Bert submodel, and the extraction of the relation can be carried out after the corresponding processing, so that the workload of knowledge extraction can be effectively reduced (because the repeated entity extraction process is not required), and the entity is determined, so that the work can be doubled with half the effort in the relation extraction process.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise. Wherein:

FIG. 1 is a flow chart of a power standard knowledge graph construction method.

FIG. 2 is a schematic diagram of a power standard knowledge graph building model.

FIG. 3 is a schematic diagram of a power standard knowledge question-answering system.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, specific embodiments accompanied with figures are described in detail below, and it is apparent that the described embodiments are a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, shall fall within the protection scope of the present invention.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways than those specifically described and will be readily apparent to those of ordinary skill in the art without departing from the spirit of the present invention, and therefore the present invention is not limited to the specific embodiments disclosed below.

Furthermore, reference herein to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one implementation of the invention. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments.

The present invention will be described in detail with reference to the drawings, wherein the cross-sectional views illustrating the structure of the device are not enlarged partially in general scale for convenience of illustration, and the drawings are only exemplary and should not be construed as limiting the scope of the present invention. In addition, the three-dimensional dimensions of length, width and depth should be included in the actual fabrication.

Meanwhile, in the description of the present invention, it should be noted that the terms "upper, lower, inner and outer" and the like indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of describing the present invention and simplifying the description, but do not indicate or imply that the referred device or element must have a specific orientation, be constructed in a specific orientation and operate, and thus, cannot be construed as limiting the present invention. Furthermore, the terms first, second, or third are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

The terms "mounted, connected" and "connected" in the present invention are to be construed broadly, unless otherwise explicitly specified or limited, for example: can be fixedly connected, detachably connected or integrally connected; they may be mechanically, electrically, or directly connected, or indirectly connected through intervening media, or may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.

Example 1

Referring to fig. 1 and 2, a first embodiment of the present invention provides a power standard knowledge graph construction method, including:

s100: and constructing an ontology structure of the power standard knowledge graph through the collected power standard data, wherein the ontology structure comprises entities, attributes and relationships among the entities.

It should be noted that, considering the field of power standard knowledge graph, the ontology structure may be constructed in a top-down and bottom-up manner, and a part of the ontology structure is designed in advance, such as: the method comprises the following steps of obtaining the standard name of the electric power (such as building lightning protection design specifications), indexes (such as lightning protection devices), indexes (lightning protection devices), lower-layer indexes (lightning protection lines) and the like, and finding and adding a new body structure in the follow-up knowledge extraction process.

S200: acquiring basic data containing power standard knowledge, extracting the basic data to extract entities, attributes and relationships among the entities.

It should be noted that, acquiring basic data containing power standard knowledge may be implemented by collecting documents, crawling web pages, and the like. For example, the information about the power standard knowledge in the web page can be crawled, and can also be obtained from the constructed data set (since the power standard knowledge belongs to a very vertical field and the knowledge is relatively stable). The basic data containing the power standard knowledge may be plain text data (e.g., a word document, a PDF document, a TXT document, etc.), or may be a combination of text data and a formula image (e.g., a PDF document containing a formula, a word document containing a formula image, etc.), and the basic data may be a document obtained by processing and then sorting data obtained by crawling a web page.

It should be noted that, for text data in the base data, the text data in the base data may be split into a plurality of text information based on the sentence separator.

It should be noted that the acquired basic data may be preprocessed to obtain a plurality of text messages, or obtain a plurality of text messages and at least one image message.

Further, if a formula image exists in the basic data, the formula image may be processed for each formula image in the basic data to obtain corresponding image information. For example, a formula image can be input into Mathpix to obtain an output formula, the output Latex format can be converted into tex, and then MathType is used to convert the Latex into MathML format, i.e. plain text format, which can be used to obtain a Word document.

Further, for each image information, a number may be assigned to the image information, and the same number may be assigned to all text information corresponding to the paragraph in which the formula image is located and the adjacent paragraph in the text data, so as to establish an association relationship between the image information and the text information. By the method, the incidence relation can be established for the text information and the image information, so that the entity object to which the attribute belongs can be conveniently determined subsequently, and the accuracy and the reliability of the knowledge graph are ensured.

It should be noted that, in order to implement knowledge extraction (joint extraction of entities and attributes) on text information, for each text information, the text information may be segmented and input to the Bert submodel, so as to obtain a corresponding vector sequence.

Specifically, the text information is segmented to obtain a segmented text w with the length of n, and then the segmented text w = ([ CLS)],w ₁ ,w ₂ ,…,w _n ,[SEP]) Transfusion deviceEntering a Bert submodel to obtain a vector sequence l = (l) corresponding to the word segmentation text w ₀ ,l ₁ ,l ₂ ,…,l _n ,l _n+1 )，l _i ∈R ^n×L ；

Wherein i belongs to [0, n +1 ]]Vector sequence l = (l) ₀ ,l ₁ ,l ₂ ,…,l _n ,l _n+1 ) For the hidden state corresponding to the participle text w in the last layer of the Bert submodel, [ CLS]As an initiator, [ SEP]For the terminator, L is the hidden state dimension (e.g., 100 dimensions, 200 dimensions, etc.) of the Bert submodel.

Further, after the vector sequence l output by the Bert submodel is obtained, the vector sequence l can be input into the BGRU submodel, and the BGRU submodel outputs a state matrix for revealing scores of the labels corresponding to the words in the text information.

Specifically, the vector sequence l = (l) ₀ ,l ₁ ,l ₂ ,…,l _n ,l _n+1 ) In each word vector sequence l _i Respectively used as the input of each time step (n +2 time steps are needed) in the BGRU submodel, and then the hidden state sequence of the forward GRU output in the BGRU submodel

And a hidden state sequence of inverted GRU outputs>

Calculating to obtain a hidden state sequence h corresponding to the vector sequence l _n+1 ，h _n+1 ∈R ^n×H Where H is the hidden state dimension of the BGRU submodel.

It should be noted that the hidden state sequence is output to the GRU in the forward direction

Hidden state sequence with reverse GRU output

Averaging after bit-wise addition (to further improve the precision, averaging can also be performed in a bit-wise weighted addition manner) to obtainHidden state sequence h _n+1 。

Further, a hidden state sequence h _n+1 Mapping from H dimension to k dimension, wherein k is the number of labels, and calculating the label score of each participle classification to k labels to obtain a state matrix E = (E) ₀ ,e ₁ ,e ₂ ,…,e _n ,e _n+1 ) Wherein e is _i ∈R ^k ，e _i Is a column vector;

and inputting the state matrix into a CRF submodel, calculating an optimal label sequence, and realizing the extraction of the entity and the attribute.

Specifically, the state matrix E = (E) ₀ ,e ₁ ,e ₂ ,…,e _n ,e _n+1 ) Inputting the state matrix E into a CRF submodel based on a constraint matrix F introduced into the CRF submodel and an input state matrix E, wherein F belongs to R ^(k+2)×(k+2) Each tag sequence is calculated using the following formula

Total score of (c):

wherein the content of the first and second substances,

represents the tag sequence->

Alpha is a regulatory factor, based on the total score of (a)>

Indicates that it is based on the tag sequence->

Probability of the jth label in (j) to transition to the j +1 th label.

Then, the sequence can be based on each label

In total score +>

Substituting the following formula to calculate the optimal tag sequence

Wherein, the first and the second end of the pipe are connected with each other,

is a collection of all possible tag sequences.

In addition, in order to guarantee the applicability of the introduced constraint matrix F, the following loss function can be added in the CRF submodel, and in the training phase, the constraint matrix F is learned by minimizing the loss function.

Wherein y is the correct tag sequence,

is a collection of all possible tag sequences.

The method realizes the combined extraction of the entities and the attributes in the power standard knowledge through the designed model, not only can ensure the reliability of the knowledge extraction, but also can ensure the extraction efficiency. Due to the adoption of the construction of the Bert + BGRU + CRF model, the method can perform word segmentation and then utilize the Bert model for processing, realize the combined extraction of the entity and the attribute, improve the accuracy of the entity and the attribute extraction, and reduce the model designDifficulty. And the introduced constraint matrix F is used for constraining the state matrix E, so that illegal tag sequences can be prevented from being output. And, calculating each tag sequence

The adjustment factor alpha is introduced during the total score, so that the method has stronger applicability in the process of entity and attribute combined extraction, ensures the accuracy of entity and attribute extraction, and overcomes the problem caused by the difference of a constraint matrix F required in the process of entity and attribute extraction (the problem that the entity extraction precision is high but the attribute extraction precision is low or the entity extraction precision is low but the attribute extraction precision is high because the attribute and the entity adopt the constraint matrix of the same standard).

In order to realize knowledge extraction (extraction of attributes) of image information, for each image information, the image information can be input into an externally called formula identification sub-tool, and converted text information is obtained. The converted text information may then be processed to obtain at least one formula text.

It should be noted that the converted text information may be recognized to determine whether the target symbol "=" is present therein. If the target symbol "=" does not exist, determining the converted text information as a formula text; if the target symbol "=" exists, the converted text information is split by the target symbol "=" to obtain a plurality of formula texts (if there are 4 target symbols "=", the formula texts can be split into 5 formula texts).

The attribute related to the formula can be split into an attribute identification part (for example, symbolic representation of the attribute) and an attribute definition part (for example, value definition of the attribute, parameter value range definition, and the like), and even an intermediate derivation process including the attribute is available.

For each formula text, each formula text can be input into the WordPert submodel together to obtain a corresponding vector sequence. Here, each formula text represents a formula text obtained by splitting text information obtained by converting the same image data.

Specifically, the formula text is combined with v = ([ CLS)],v ₁ ,v ₂ ,…,v _m ,[SEP]) Inputting the vector sequence into a WordBert submodel to obtain a vector sequence l = (l) corresponding to a formula text combination v ₀ ,l ₁ ,l ₂ ,…,l _m ,l _m+1 )，l _i ∈R ^m×L Wherein i is epsilon [0, m +1 +]Vector sequence l = (l) ₀ ,l ₁ ,l ₂ ,…,l _m ,l _m+1 ) Is a hidden state corresponding to a formula text combination v in the last layer of the WordBert submodel, [ CLS]As an initiator, [ SEP]For the terminator, L is the hidden state dimension of the WordBert submodel (e.g., 100, 200, consistent with the hidden state dimension of the Bert submodel).

The obtained vector sequence can be input into a BGRU submodel, and a state matrix for revealing each label score corresponding to each formula text in the converted text information is output.

Specifically, a vector sequence l = (l) ₀ ,l ₁ ,l ₂ ,…,l _m ,l _m+1 ) In each formula vector sequence l _i As the input of each time step in the BGRU submodel, the hidden state sequence of the forward GRU output in the BGRU submodel

And a hidden state sequence of inverted GRU outputs>

Calculating to obtain a hidden state sequence h corresponding to the vector sequence l _m+1 ，h _m+1 ∈R ^m×H And H is the hidden state dimension of the BGRU submodel. Then the hidden state sequence h _m+1 Mapping from H dimension to k dimension, wherein k is the number of labels, calculating the label score of each formula classification to k labels to obtain a state matrix E = (E) ₀ ,e ₁ ,e ₂ ,…,e _m ,e _m+1 )，e _i ∈R ^k ，e _i Is a column vector. The process is similar to the operation of the BGRU submodel described above, and thus will not be described again.

Get the state matrix E = (E) ₀ ,e ₁ ,e ₂ ,…,e _m ,e _m+1 ) The state matrix may then be input to a CRF submodel to compute an optimal tag sequence.

Specifically, the state matrix E = (E) ₀ ,e ₁ ,e ₂ ,…,e _m ,e _m+1 ) Inputting the CRF model into a CRF sub-model; calculating each tag sequence based on the input state matrix E

Total score of (c):

represents the tag sequence->

Is based on the total score of->

Representing the probability of the ith component in the state matrix E being classified to the jth label.

Based on each tag sequence

Is greater than or equal to>

Calculating an optimal label sequence>

Wherein the content of the first and second substances,

is a collection of all possible tag sequences.

It should be noted that, in the present embodiment, l = (l) for the vector sequence output based on the WordBert sub-model ₀ ,l ₁ ,l ₂ ,…,l _m ,l _m+1 ) Determined state matrix E = (E) ₀ ,e ₁ ,e ₂ ,…,e _m ,e _m+1 ) The vector sequence l = (l) output based on the Bert submodel is adopted ₀ ,l ₁ ,l ₂ ,…,l _n ,l _n+1 ) Determined state matrix E = (E) ₀ ,e ₁ ,e ₂ ,…,e _n ,e _n+1 ) Computing tag sequences in different computing manners

The total score of (2) is due to the fact that the state matrix obtained in the two cases is better in effect by adopting a differentiation calculation method. Of course, the vector sequence output based on the WordBert submodel l = (l) ₀ ,l ₁ ,l ₂ ,…,l _m ,l _m+1 ) Determined state matrix E = (E) ₀ ,e ₁ ,e ₂ ,…,e _m ,e _m+1 ) In calculating a tag sequence >>

The method of calculating the total score according to equation (1) can also be used because the method of equation (1) also introduces the adjustment factor α in consideration of the difference between the entity and the attribute (particularly, the formula of text information), but relatively speaking, when only the method is used for attribute extraction, the effect of equation (1) is slightly inferior to that of equation (4), but the method of equation (1) performs tag sequence transformation on the entity and the attribute without differentiation processing>

The total score of (a) is calculated, the performance will be much better.

Thereby, the extraction of the attribute based on the formula image can be realized.

In such a mode, the extraction of the relevant knowledge in the power standard knowledge (both belonging to attributes) in the formula image can be realized through the designed model, the problem of difficulty in extracting the power standard knowledge (due to the fact that the formula image is used for representing data of relevant information such as numerical value limitation and calculation modes, effective knowledge extraction cannot be realized in the prior art) is effectively solved, the extraction of the relevant knowledge in the formula image can be ensured, and the reliability of the extraction of the knowledge can also be ensured. In addition, word segmentation operation is not needed, and the WordBert submodel does not need to be trained by using segmented texts but by using whole sentences (particularly formulas, characters, operators and the like), so that the accuracy of formula class attribute extraction can be greatly improved. The designed WordBert submodel is adopted for the formula text, word segmentation operation is not involved, the processing process can be reduced, information can be effectively reserved, and the problem of wrong formula information extraction caused by word segmentation in the traditional Bert model is solved.

Furthermore, after the extraction of the entities and the attributes is realized, the vector sequences of the extracted entities and the attributes can be processed and then input into the relationship extraction submodel, so that the extraction of the relationship between the entities is realized.

The vector sequence of the extracted entities and the attributes is processed and then input into the relation extraction submodel, so that the extraction of the relation between the entities is realized, the subsequent relation processing can be carried out by utilizing the vector sequence obtained by processing the Bert submodel, and the extraction of the relation can be carried out after the corresponding processing, so that the workload of knowledge extraction can be effectively reduced (because the repeated entity extraction process is not required), and the entity is determined, so that the work can be doubled with half the effort in the relation extraction process.

Specifically, based on the extracted entities, the vector sequence l = (l) corresponding to the segmented text w ₀ ,l ₁ ,l ₂ ,…,l _n ,l _n+1 ) The corresponding vector in (1) is labeled.

For example, the vector sequence l = (l) corresponding to the participle text w ₀ ,l ₁ ,l ₂ ,…,l _n ,l _n+1 ) In, l ₁ ,l ₃ ,l ₅ Corresponding participles are extracted as entities, then, can be pairedVector sequence l = (l) ₀ ,l ₁ ,l ₂ ,…,l _n ,l _n+1 ) Marking the corresponding vector in (1) to obtain a marked vector sequence l' = (l) ₀ ,l′ ₁ ,l ₂ ,l′ ₃ ,l ₄ ,l′ ₅ ,…,l _n ,l _n+1 )。

The marked vector sequence l' may then be input into the relation extraction submodel. The relationship extraction submodel here belongs to a relationship extraction model based on Bert.

And for the marked marker vectors in the vector sequence l', performing binary mutual grouping on all the marker vectors so that each marker vector has a paired combination relationship with other marker vectors, and so that each marker vector has a paired combination relationship with other marker vectors to represent the marker vector pairs.

Following the previous example, for the marked vector sequence l' = (l) ₀ ,l′ ₁ ,l ₂ ,l′ ₃ ,l ₄ ,l′ ₅ ,…,l _n ,l _n+1 ) For all the token vectors l' ₁ ,l′ ₃ ,l′ ₅ ) And performing binary mutual grouping to obtain three grouped mark vector pairs: (l' ₁ ,l′ ₃ )、(l′ ₁ ,l′ ₅ ) And (l' ₃ ,l′ ₅ )。

And splicing the two mark vectors of the mark vector pair to obtain a combined vector aiming at each mark vector pair with a combination relation. The splicing mode here can be: and splicing the two mark vectors of the mark vector pair end to obtain a corresponding combined vector. For example, a pair of marker vectors (l' ₁ ,l′ ₃ ) Obtaining a combined vector l after splicing _c1 Vector pair of (/ ')' ₁ ,l′ ₅ ) Obtaining a combined vector l after splicing _c2 Number vector pair (l' ₃ ,l′ ₅ ) Obtaining a combined vector l after splicing _c3 。

Then, the score of each combined vector under each relation category can be calculated, and a corresponding score vector can be obtained

Wherein, P ^x And q is the number of relation categories.

May then be based on the vector sequence p ^x And determining the optimal score, wherein the relation category corresponding to the optimal score represents the relation category between the combination vectors. And sorting the optimal scores, and removing the optimal score at the tail of sorting.

And aiming at each residual optimal score, determining the relationship between the entities with the corresponding relationship categories between the entities corresponding to the combination vector, and realizing the extraction of the relationship between the entities. Therefore, the extraction of the relationship among the entities can be quickly, efficiently and accurately realized in a multi-entity binary mutual group mode, and the relationship among the entities can be considered.

For the corresponding relation between the attribute and the entity, the entity and the attribute can be corresponded in the combined extraction process; or after determining the entity and the attribute, attribution division can be carried out on the attribute and the entity; the attribution relationship between the entity and the attribute can also be extracted from the webpage by means of a wrapper (for example, by inputting a URL, using a tool to crawl the webpage, and using the wrapper to extract the attribute corresponding to the entity provided by the webpage and then attributing the extracted attribute).

It should be noted that, for the data source of the power standard knowledge, for each document (especially, the document whose content belongs to the normative file, such as overvoltage protection design specification of industrial and civil power devices, grounding design specification of industrial and civil power devices, lightning protection design specification of buildings, design specification of power devices in explosion and fire hazard places, etc.), the title can be extracted separately, a basic entity object is extracted, and key attributes such as formulation time, application scene, publishing unit, etc. are extracted as important factors in the application of the power standard knowledge graph such as subsequent intelligent question answering, personalized recommendation, etc.

S300: and performing knowledge fusion based on the extracted knowledge, and storing the knowledge after the knowledge fusion by adopting a Neo4j graph database so as to construct the power standard knowledge graph.

It should be noted that there are many ways of knowledge fusion, mainly requiring entity alignment and entity disambiguation. For example, the Jaccard algorithm based on string similarity can be employed to achieve entity alignment and entity disambiguation.

Further, a strategy of extracting and storing at the same time can be adopted: and temporarily storing the knowledge extraction result in a memory by JSON format data, and then submitting the knowledge extraction result to a Neo4j graph database through a py2Neo library of Python to realize persistent storage.

In this way, the construction of the power standard knowledge graph can be realized.

Further, the present embodiment also provides an electric power standard knowledge question-answering system, which includes:

the Web layer is used for receiving question information of a user and generating and displaying answer information based on a query result of the Web layer, wherein the question information is in a natural language form;

The embodiment also provides a computer device, which is suitable for the case of the power standard knowledge graph construction method, and includes:

a memory and a processor; the memory is used for storing computer executable instructions, and the processor is used for executing the computer executable instructions to realize the power station area change relationship identification method provided by the embodiment.

The computer device may be a terminal comprising a processor, a memory, a communication interface, a display screen and an input means connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The communication interface of the computer device is used for communicating with an external terminal in a wired or wireless manner, and the wireless manner can be realized through WIFI, an operator network, NFC (near field communication) or other technologies. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

The present embodiment also provides a storage medium on which a computer program is stored, which when executed by a processor implements the method for implementing a power standard knowledge graph as set forth in the above embodiments.

The storage medium proposed by the present embodiment belongs to the same inventive concept as the data storage method proposed by the above embodiments, and technical details that are not described in detail in the present embodiment can be referred to the above embodiments, and the present embodiment has the same beneficial effects as the above embodiments.

Example 2

Referring to fig. 1 to 3, a second embodiment of the present invention provides a method for constructing a power standard knowledge graph, and scientific demonstration is performed through experiments in order to verify the beneficial effects of the present invention.

In this embodiment, 984 labeled basic data are used to construct a data set, and the data set is 7:2:1, dividing the model into a training set (689), a verification set (197) and a test set (98), training, verifying and testing the model, taking accuracy, recall rate and F1 value as evaluation indexes, and verifying the effect of the model:

(1) The accuracy rate P represents the accuracy degree of model prediction, and the calculation formula is as follows:

where M represents the sample set for which the model predicts to be positive and T represents the sample set that is truly positive.

(2) The recall ratio R represents the comprehensive degree of model prediction, and the calculation formula is as follows:

(3) The F1 value is the combination of the precision P and the recall ratio R, and the calculation formula is as follows:

based on the effect verification on the model, the obtained relevant evaluation data is as follows: the precision rate P is approximately equal to 0.84, the recall rate R is approximately equal to 0.90, and F1 is approximately equal to 0.87. Therefore, the model is well represented, and the effect of extracting the power standard knowledge is good.

And based on the constructed power standard knowledge graph, a power standard knowledge question-answering system can be further constructed.

It should be noted that the above-mentioned embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, which should be covered by the claims of the present invention.

Claims

1. A power standard knowledge graph construction method is characterized by comprising the following steps: comprises the steps of (a) preparing a mixture of a plurality of raw materials,

and performing knowledge fusion based on the extracted knowledge, storing the fused knowledge, and constructing a power standard knowledge map.

2. The power standard knowledge-graph construction method of claim 1, wherein: the acquiring of basic data containing power standard knowledge and the knowledge extraction of the basic data comprise,

inputting the image information into a formula identification submodel called from the outside aiming at each image information to obtain converted text information, processing the converted text information to obtain at least one formula text, inputting each formula text into a WordBert submodel together to obtain a corresponding vector sequence, then inputting the vector sequence into a BGRU submodel, outputting a state matrix for revealing each label score corresponding to each formula text in the converted text information, inputting the state matrix into a CRF submodel, calculating an optimal label sequence, and realizing the extraction of attributes;

and processing the vector sequences of the extracted entities and attributes and then inputting the processed vector sequences into the relation extraction submodel to realize the extraction of the relation between the entities.

3. The power standard knowledge graph construction method according to claim 2, wherein: the knowledge extraction method for each text message includes,

segmenting the word text w = ([ CLS)],w ₁ ,w ₂ ,…,w _n ,[SEP]) Is inputted intoThe Bert submodel obtains a vector sequence l = (l) corresponding to the participle text w ₀ ,l ₁ ,l ₂ ,…,l _n ,l _n+1 )，l _i ∈R ^n×L Wherein i is epsilon [0, n +1]Vector sequence l = (l) ₀ ,l ₁ ,l ₂ ,…,l _n ,l _n+1 ) For the hidden state corresponding to the participle text w in the last layer of the Bert submodel, [ CLS]As an initiator, [ SEP]For the terminator, L is the hidden state dimension of the Bert submodel;

the vector sequence l = (l) ₀ ,l ₁ ,l ₂ ,…,l _n ,l _n+1 ) In each word vector sequence l _i As input for each time step in the BGRU submodel;

hidden state sequence for outputting forward GRU in BGRU submodel

And a hidden state sequence of inverted GRU outputs>

will hide the state sequence h _n+1 Mapping from H dimension to k dimension, wherein k is the label number;

calculating the label scores of each participle classified into k labels to obtain a state matrix E = (E) ₀ ,e ₁ ,e ₂ ,…,e _n ,e _n+1 ) Wherein e is _i ∈R ^k Is a column vector;

4. The power standard knowledge graph construction method of claim 3, wherein: inputting the state matrix into a CRF submodel, calculating an optimal label sequence comprises,

Total score of (c):

wherein F ∈ R ^(k+2)×(k+2) ，

Represents the tag sequence->

Alpha is a regulatory factor, based on the total score of (a)>

Represents a number of times greater than or equal to a tag sequence>

The probability of the jth label transitioning to the j +1 th label;

based on each tag sequence

Is greater than or equal to>

Calculating an optimal label sequence>

Wherein the content of the first and second substances,

is a collection of all possible tag sequences.

5. The power standard knowledge graph construction method of claim 4, wherein: the knowledge extraction method for each image information includes,

combining formula text by v = ([ CLS)],v ₁ ,v ₂ ,…,v _m ,[SEP]) Inputting the vector sequence into a WordBert submodel to obtain a vector sequence l = (l) corresponding to a formula text combination v ₀ ,l ₁ ,l ₂ ,…,l _m ,l _m+1 )，l _i ∈R ^m×L Wherein i belongs to [0, m +1 ]]Vector sequence l = (l) ₀ ,l ₁ ,l ₂ ,…,l _m ,l _m+1 ) Is a hidden state corresponding to a formula text combination v in the last layer of the WordPert submodel, [ CLS ]]As an initiator, [ SEP]For the terminator, L is the hidden state dimension of the WordBert submodel;

the vector sequence l = (l) ₀ ,l ₁ ,l ₂ ,…,l _m ,l _m+1 ) In each formula vector sequence l _i As input for each time step in the BGRU submodel; hidden state sequence for outputting forward GRU in BGRU submodel

And a hidden state sequence of inverted GRU outputs>

6. The power standard knowledge graph construction method of claim 5, wherein: inputting the state matrix into a CRF submodel, calculating an optimal label sequence comprises,

calculating each tag sequence based on the input state matrix E

Total score of (c):

/>

wherein the content of the first and second substances,

represents a tag sequence +>

Total score ofValue,. Or>

based on each tag sequence

In total score +>

Calculating an optimal tag sequence->

Wherein the content of the first and second substances,

is a collection of all possible tag sequences.

7. The power standard knowledge graph construction method of claim 6, wherein: the vector sequence of the extracted entities and attributes is processed and then input into a relation extraction submodel to realize the extraction of the relation between the entities,

inputting the marked vector sequence l' into a relation extraction submodel;

splicing two mark vectors of the mark vector pair to obtain a combined vector aiming at each mark vector pair with a combination relation;

calculating the score of each combination vector under each relation category;

8. An electric power standard knowledge question-answering system based on the electric power standard knowledge graph construction method of claims 1-7, characterized in that: comprises the steps of (a) preparing a mixture of a plurality of raw materials,

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that: the processor, when executing the computer program, performs the steps of the method of any one of claims 1 to 7.

10. A computer-readable storage medium having stored thereon a computer program, characterized in that: the computer program, when executed by a processor, implements the steps of the method of any one of claims 1 to 7.