CN115587190A - Construction method and device of knowledge graph in power field and electronic equipment - Google Patents
Construction method and device of knowledge graph in power field and electronic equipment Download PDFInfo
- Publication number
- CN115587190A CN115587190A CN202211193515.9A CN202211193515A CN115587190A CN 115587190 A CN115587190 A CN 115587190A CN 202211193515 A CN202211193515 A CN 202211193515A CN 115587190 A CN115587190 A CN 115587190A
- Authority
- CN
- China
- Prior art keywords
- entity
- document
- concept
- data
- knowledge graph
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000010276 construction Methods 0.000 title claims abstract description 12
- 238000000605 extraction Methods 0.000 claims abstract description 128
- 238000000034 method Methods 0.000 claims abstract description 66
- 238000012549 training Methods 0.000 claims abstract description 59
- 238000012795 verification Methods 0.000 claims abstract description 13
- 238000004364 calculation method Methods 0.000 claims abstract description 11
- 238000004590 computer program Methods 0.000 claims description 22
- 238000002372 labelling Methods 0.000 claims description 22
- 238000003860 storage Methods 0.000 claims description 12
- 230000009471 action Effects 0.000 claims description 11
- 230000004927 fusion Effects 0.000 claims description 9
- 230000000717 retained effect Effects 0.000 claims description 7
- 230000011218 segmentation Effects 0.000 claims description 7
- 238000012544 monitoring process Methods 0.000 claims description 6
- 238000010586 diagram Methods 0.000 description 10
- 230000006870 function Effects 0.000 description 8
- 238000012216 screening Methods 0.000 description 8
- 238000012545 processing Methods 0.000 description 7
- 230000008569 process Effects 0.000 description 5
- 238000004422 calculation algorithm Methods 0.000 description 3
- 238000004140 cleaning Methods 0.000 description 3
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 238000009826 distribution Methods 0.000 description 3
- 239000000284 extract Substances 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000012217 deletion Methods 0.000 description 2
- 230000037430 deletion Effects 0.000 description 2
- 230000005611 electricity Effects 0.000 description 2
- 230000008030 elimination Effects 0.000 description 2
- 238000003379 elimination reaction Methods 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 230000014509 gene expression Effects 0.000 description 2
- 238000007689 inspection Methods 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000008520 organization Effects 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 239000013598 vector Substances 0.000 description 2
- 206010063385 Intellectualisation Diseases 0.000 description 1
- 208000025174 PANDAS Diseases 0.000 description 1
- 208000021155 Paediatric autoimmune neuropsychiatric disorders associated with streptococcal infection Diseases 0.000 description 1
- 240000000220 Panda oleosa Species 0.000 description 1
- 235000016496 Panda oleosa Nutrition 0.000 description 1
- 230000002159 abnormal effect Effects 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000013075 data extraction Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000005684 electric field Effects 0.000 description 1
- 230000008570 general process Effects 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 230000014759 maintenance of location Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000004321 preservation Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/901—Indexing; Data structures therefor; Storage structures
- G06F16/9024—Graphs; Linked lists
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/06—Energy or water supply
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Business, Economics & Management (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Health & Medical Sciences (AREA)
- Economics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Water Supply & Treatment (AREA)
- Human Resources & Organizations (AREA)
- Life Sciences & Earth Sciences (AREA)
- Software Systems (AREA)
- Public Health (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Animal Behavior & Ethology (AREA)
- Marketing (AREA)
- Primary Health Care (AREA)
- Strategic Management (AREA)
- Tourism & Hospitality (AREA)
- General Business, Economics & Management (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a construction method and device of a knowledge graph in the power field and electronic equipment. The method comprises the following steps: obtaining a plurality of entity triples based on a remote supervision method; generating a denoised training set by taking the state of each entity triple reserved or not reserved through a strategy network, and training a relation extraction model; if the extraction accuracy of the trained relation extraction model is smaller than a preset value, obtaining a feedback value through feedback calculation based on a preset verification set; based on the feedback values, obtaining a state combination that maximizes an expected feedback value for the policy network; taking a label data set generated based on state combination as a denoised label data set, training a relation extraction model until the extraction accuracy of the trained relation extraction model is more than or equal to a preset value; and extracting the electric power field documents through the relation extraction model to obtain the knowledge graph. The method can reduce the noise data marked wrongly in the training set through the strategy network, and improve the extraction accuracy of the relation extraction model.
Description
Technical Field
The invention relates to the technical field of knowledge graphs, in particular to a method and a device for constructing a knowledge graph in the power field and electronic equipment.
Background
A knowledge graph is a semantic network used to abstractly represent concepts, entities, and relationships of the real world. The knowledge graph extracts knowledge instances from the text and expresses the knowledge in the form of entity-attribute value or entity-relation-entity triple. Domain-specific knowledge maps, such as the power domain, are typically constructed in a top-down manner. And the domain expert performs concept combing and entity category division and determines the relationship between entity concepts. And extracting knowledge instances, namely entity triples, from the massive documents in the power field based on the relation extraction model to form a knowledge map.
Generally, a large amount of entity triplet labeled data based on documents is needed as a training set for training a relational extraction model. And the available data in the power field is less, and a great deal of manpower, material resources and time are consumed for collection, labeling and inspection. To obtain a large number of annotation data sets, a remote supervision method is usually adopted, which assumes that: if two entities have a relationship in a given knowledge base, then an unstructured sentence containing both entities can represent the relationship. However, due to the strong hypothesis of the remote supervision method, a large amount of noisy data with wrong labels exist in the obtained training set. And the extraction accuracy rate of the relation extraction model is reduced by training the concentrated noise data.
Disclosure of Invention
The embodiment of the invention provides a method and a device for constructing a knowledge graph in the power field and electronic equipment, and aims to solve the problem of low extraction accuracy of a relation extraction model caused by noise data.
In a first aspect, an embodiment of the present invention provides a method for constructing a knowledge graph in the power domain, including: a plurality of entity triplets is obtained based on a remote supervision method.
And generating a denoised labeling data set by using a reserved or unreserved state as each entity triple through a strategy network.
And taking the denoised labeling data set as a training set training relationship extraction model.
And if the extraction accuracy of the trained relation extraction model is smaller than a preset value, obtaining a feedback value through feedback calculation based on a preset verification set.
And adjusting the state combination of each entity triple based on the feedback value to obtain the state combination which enables the expected feedback value of the policy network to be maximum.
And taking the label data set generated based on the state combination as a denoised label data set, and training the relation extraction model until the extraction accuracy of the trained relation extraction model is greater than or equal to a preset value.
And extracting the electric power field documents through the relation extraction model to obtain a knowledge graph.
In one possible implementation, the expected feedback values of the policy network are:
wherein J (Θ) is an expected feedback value, E represents a state combination of the entity triples, s represents a current state of the entity triples, a represents an execution action of the policy network, and r(s) i T) represents the feedback value of the retained entity triplet.
In one possible implementation, the obtaining a plurality of entity triples based on a remote supervision method includes:
based on the first sample document, a document concept map is constructed through word segmentation and LDA topic clustering, and a plurality of document concept triples are obtained.
The confidence level of a document concept triple is calculated by the formula,
W j =C+α*COUNT(D j )
wherein, W j Representing the confidence of the document concept triples, C and alpha representing preset weights, D j Representing document concept triplets, COUNT (D) j ) Representing the number of times a document concept triple occurs in the first sample document.
And if the confidence coefficient of the document concept triple is lower than a preset value, deleting the document concept triple to obtain the screened document concept map.
And obtaining a plurality of entity triples by a remote supervision method based on the second sample document and the screened document concept map.
In one possible implementation, the first sample document includes a plurality of documents, and the confidence of the document concept triplets is:
W j =C+∑(α i *COUNT(D ij ))
wherein alpha is i Representing document summariesMoning triad D j Preset weights in the ith first template document.
In a possible implementation manner, the extracting, by the relationship extraction model, the power domain document to obtain a knowledge graph includes:
and extracting the documents in the power field through the relation extraction model to obtain document entity triples, wherein the document entity triples comprise document entities and document entity relations.
And obtaining data concept triples based on the fields of the power field relational database and preset relations among the fields, wherein the data concept triples comprise data concepts and data concept relations.
And obtaining a data entity triple based on the attribute value of the field and the data concept triple, wherein the data entity triple comprises a data entity and a data entity relation.
And calculating the concept similarity between the document concept and the data concept based on the document entity corresponding to the document concept and the data entity corresponding to the data concept.
And combining the document concept and the data concept according to the concept similarity to obtain the knowledge graph after the concept triple fusion.
In a possible implementation manner, after merging the document concept and the data concept according to the concept similarity to obtain the knowledge graph after the concept triple fusion, the method further includes:
the entity similarity between the document entity and the data entity is calculated by the following formula.
sim(x,y)=α*∑s(x i ,y i )+β*∑s(Ner(x) i ,Ner(y) i )
Where sim (x, y) represents entity similarity, s represents similarity of entity attributes, x represents document entity, y represents data entity, x represents document entity i Representing the value of an attribute, y, contained in the document entity i Indicating the value of an attribute, ner (x), that the data entity contains i Associated entities, ner (y), representing document entities i Representing the associated entities of the data entities, alpha, beta representing preset weights.
And combining the document entity and the data entity according to the entity similarity to obtain the knowledge graph after entity triple fusion.
In a second aspect, an embodiment of the present invention provides an apparatus for constructing a knowledge graph in the power domain, including:
and the remote monitoring module is used for obtaining a plurality of entity triples based on a remote monitoring method.
And the first denoising module is used for generating a denoised labeling data set by using a state of reserving or not reserving the labeling data set as each entity triple through a strategy network.
And the training module is used for taking the denoised labeling data set as a training set training relationship extraction model.
And the feedback module is used for obtaining a feedback value through feedback calculation based on a preset verification set if the extraction accuracy of the trained relation extraction model is smaller than a preset value.
And the state combination module is used for adjusting the state combination of each entity triple based on the feedback value to obtain the state combination which enables the expected feedback value of the policy network to be maximum.
And the second denoising module is used for training the relation extraction model by taking the labeling data set generated based on the state combination as the denoised labeling data set until the extraction accuracy of the trained relation extraction model is more than or equal to a preset value.
And the extraction module is used for extracting the electric power field document through the relation extraction model to obtain the knowledge graph.
In one possible implementation, the expected feedback values of the policy network are:
wherein J (Θ) is an expected feedback value, E represents a state combination of the entity triples, s represents a current state of the entity triples, a represents an execution action of the policy network, and r(s) i T) represents the feedback value of the retained entity triplet.
In a third aspect, an embodiment of the present invention provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the steps of the method according to the first aspect or any one of the possible implementation manners of the first aspect.
In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, where a computer program is stored, and the computer program, when executed by a processor, implements the steps of the method according to the first aspect or any one of the possible implementation manners of the first aspect.
The embodiment of the invention provides a method and a device for constructing a knowledge graph in the power field and electronic equipment. And generating a denoised labeling data set by using a reserved or unreserved state as each entity triple through a strategy network. And taking the denoised labeling data set as a training set training relationship extraction model. And if the extraction accuracy of the trained relation extraction model is smaller than a preset value, obtaining a feedback value through feedback calculation based on a preset verification set. And adjusting the state combination of each entity triple based on the feedback value to obtain the state combination which enables the expected feedback value of the policy network to be maximum. And taking the label data set generated based on the state combination as a denoised label data set, and training the relation extraction model until the extraction accuracy of the trained relation extraction model is greater than or equal to a preset value. And extracting the electric power field document through the relation extraction model to obtain the knowledge graph. According to the method, a strategy network is used for optimizing the strategy network by taking whether the state of the entity triple is reserved or not, the reserved triple is adopted to generate a training set and train a relation extraction model, the output of the trained relation extraction model is used as a feedback value to optimize the strategy network to obtain the optimal entity triple state combination, and the reserved triple is used again to generate the training set and train the relation extraction model until the extraction accuracy is larger than a preset value. Through the strategy network, the noise data with wrong labeling in the training set is reduced, and the extraction accuracy of the relation extraction model is improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings required to be used in the embodiments or the prior art description will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings may be obtained according to these drawings without inventive labor.
FIG. 1 is a flowchart of an implementation of a method for constructing a knowledge graph in the power domain according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating an implementation of another method for constructing a knowledge graph in the power domain according to an embodiment of the present invention;
FIG. 3 is a main interface diagram of a power domain knowledge graph modeling tool platform provided by an embodiment of the invention;
FIG. 4 is an input interface diagram of a power domain knowledge graph modeling tool platform according to an embodiment of the present invention;
FIG. 5 is a diagram illustrating a knowledge graph of a power domain knowledge graph modeling tool platform according to an embodiment of the present invention;
FIG. 6 is a schematic structural diagram of an apparatus for constructing a knowledge graph in the power domain according to an embodiment of the present invention;
fig. 7 is a schematic diagram of an electronic device provided in an embodiment of the present invention.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present invention with unnecessary detail.
In order to make the objects, technical solutions and advantages of the present invention more apparent, the following description is made by way of specific embodiments with reference to the accompanying drawings.
The existing power system knowledge organization and management mode is high in manual processing occupation ratio, insufficient in automation and intelligence level, low in processing efficiency, poor in result accuracy, difficult in data extraction and the like, and a knowledge base needs to be constructed to show direct relation among knowledge, explore potential relation among knowledge, explore rich connotation of power system knowledge and realize intellectualization of the power system knowledge organization and management mode.
A knowledge graph is a semantic network used to abstractly represent concepts, entities, and relationships of the real world. With the introduction of knowledge graph concepts and the development of language models based on deep learning, many research and applications represent knowledge in the form of triples (entities, attributes, attribute values) or (entities, relationships, entities), and knowledge instances are extracted from text using supervised or semi-supervised learning methods. For example, the entities, relationships, attributes and the like are expressed in a vector form, and the pre-training language model is used for training to obtain related information such as entity categories, semantic similarity and the like. The knowledge graph extracts knowledge instances from the text and represents the knowledge in the form of entity-attribute values or entity-relationship-entity triples.
The knowledge-graph can be divided into a domain-specific knowledge-graph and a general knowledge-graph (usually an encyclopedia). The specific fields may include the field of electric power, the field of electric power marketing. As a domain-specific knowledge graph, for example, a power domain knowledge graph is generally constructed in a top-down manner, unlike a bottom-up construction manner of an encyclopedic knowledge graph having mass data. And (3) requiring a domain expert to carry out concept combing and entity category division, determining the relation between entity concepts and determining specifications and constraints for the construction of a subsequent knowledge graph. The information extraction technology is used for carrying out preliminary screening and extraction on the domain document information, including named entity identification, relation extraction and the like, so that entities defined in the knowledge map and relations among the entities are accurately screened out from massive texts. The knowledge-graph data is typically stored in a graph database, such as Neo4 j. For example, knowledge instances, namely entity triples, are extracted from massive documents in the power field based on a relational extraction model to form a knowledge graph.
Generally, a large amount of entity triple annotation data (i.e., sample data) based on a document is needed for training a relational extraction model as a training set. The training set may be in the form of a set of entity triples. Typically a training set needs to contain tens or even hundreds of thousands of triplets. In a specific field, such as the electric field, less data is available, and a lot of manpower, material resources and time are consumed for collection, labeling and inspection. To obtain a large number of annotation data sets, a remote supervision method is usually adopted, which assumes that: if two entities have a relationship in a given knowledge base, then an unstructured sentence containing both entities can represent the relationship. That is, it is assumed that sentences containing the same entity pair should have similar relational representations. However, due to the strong hypothesis of the remote supervision method, a large amount of noisy data with wrong labels exist in the obtained training set. For example, two sentences each contain the same entity pair, but the entity pair relationships expressed by the two sentences may be different. And the extraction accuracy rate of the relation extraction model is reduced by training the concentrated noise data. The embodiment of the invention provides a construction method of a knowledge graph in the power field, which aims to solve the problem of low extraction accuracy of a relation extraction model caused by noise data in a training set.
Referring to fig. 1, it shows an implementation flowchart of a method for constructing a knowledge graph in the power domain according to an embodiment of the present invention, which is detailed as follows:
in step S1, a plurality of entity triplets is obtained based on a remote supervision method.
The remote supervision method (distance super) can correspond the existing knowledge graph to a large amount of unstructured data, so that a large amount of training data is generated and used for training the relation extraction model. Namely, a remote supervision method is adopted, and a labeling data set is generated based on a large amount of document data in the electric power field and a document concept map in the electric power field. The annotation data set contains a plurality of entity triples, which can be used as a training set for the relationship extraction model. The remote monitoring method is based on the document concept map, and the accuracy of the document concept map on the description of the power field influences the accuracy of the remote monitoring method in generating the training set.
In one possible implementation, obtaining the plurality of entity triples based on the remote supervision method includes:
in step S11, a document concept graph is constructed by word segmentation and LDA topic clustering based on the first sample document, and a plurality of document concept triples are obtained. And removing the format of the collected first sample document, and processing the collected first sample document into a text format file with a uniform format. And carrying out preprocessing operations such as word segmentation, word stop removal and the like on the text format file to obtain a corpus set containing a large number of words. And projecting the first sample document and the participles to a group of subjects by using an LDA subject clustering model to obtain a subject-participle probability matrix and a document-subject probability distribution. And counting the probability distribution of the participles on each topic to obtain the participles with high occurrence probability on the topic. The participles with high occurrence probability on the theme can more accurately describe the meaning of the theme. For example, the power field expert may determine concepts corresponding to the topic and the links between the concepts, that is, document concept triplets, concept-relationship-concepts, according to the topic and the segmentation with high occurrence probability in the topic. A plurality of document concept triplets constitute a document concept graph. Furthermore, other documents in the category of the business field can be screened out according to the theme so as to enrich the description of the concept and complement the connection between the concepts.
For example, the first sample document may be an electric power industry text document. The first sample document may include, but is not limited to, one or more of the following: text documents such as power industry knowledge manuals, specifications, work reports and common problem sets. By way of example, concepts may include, but are not limited to, one or more of the following: units (classification), industries (classification), lines, transformer areas, users, metering points, transformers, electric energy meters, catalog electricity prices, electricity utilization strategies, statistical periods and the like. Relationships may include, but are not limited to, one or more of the following: hierarchy, containment, attribute, execution, service, management, connection, statistics, and the like.
Compared with the power field or the specific project requirement, the document concept graph obtained through word segmentation and LDA topic clustering contains triples with lower importance, and further screening is needed.
In step S12 the confidence level of the document concept triples is calculated by the following formula,
W j =C+α*COUNT(D j )
wherein, W j Representing the confidence of the document concept triples, C and alpha representing preset weights, D j Representing document concept triplets, COUNT (D) j ) Representing the number of times a document concept triple occurs in the first sample document. And counting the document concept triples appearing in the document and performing weight adjustment on the document concept triples based on the fact that the more the appearance times are, the higher the confidence of the document concept triples are. Computing jth document concept triple D j Confidence of (W) j 。
In step S13, if the confidence of the document concept triple is lower than the preset value, the document concept triple is deleted, and the document concept graph after screening is obtained.
Illustratively, document concept triplets are filtered by the following formula:
σ denotes a preset value. Deletion confidence W j Document concept triplets D lower than preset value sigma j Retention of confidence W j Document concept triple D larger than or equal to preset value sigma j And obtaining the screened document concept map.
Illustratively, document concept triplets are filtered by the following formula:
K i representing predefined concept triplets. Power domain experts determine predefined concept triplets K related to the domain i . Deletions not belonging to { K i } document concept triples D of a set of concept triples j Reserved as belonging to { K i Document concept triple D of concept triple set j 。
In step S14, a plurality of entity triples are obtained by a remote supervision method based on the second sample document and the screened document concept graph. For example, the second sample document the first sample document may be an electric power industry text document. Illustratively, the data volume of the second sample document is greater than the data volume of the first sample document.
According to the method for constructing the knowledge graph in the power field, provided by the embodiment of the invention, the document concept graph is constructed through word segmentation and LDA topic clustering, and the confidence coefficient of the document concept triple is calculated based on the fact that the occurrence times are more and the confidence coefficient of the document concept triple is higher, the document concept triple with low confidence coefficient is removed, the non-relevant data in the document concept graph is reduced, the accuracy of obtaining a training set by a remote supervision method is further improved, and the extraction accuracy of a trained relation extraction model is improved.
The above can be used when the first sample is not in large numbers. When the number of the first documents is larger, the increase of the absolute number of the documents with low importance necessarily leads to the corresponding increase of the triple confidence coefficient, and the triple confidence coefficient cannot fully reflect the importance of the documents. The importance differences from document to document need to be taken into account. Illustratively, different confidence preset values σ and weights α are set for different types of documents.
In one possible implementation, the first sample document includes a plurality of documents, and the confidence of the document concept triplets is:
W j =C+∑(α i *COUNT(D ij ))
wherein alpha is i Representing document concept triplets D j Preset weights in the ith first template document. The first sample document includes a plurality of documents, and the weight α is set to be different according to the degree of importance of each document i . Same document concept triple D j Weight α in documents of varying degrees of importance i Different. Illustratively, the confidence level of the domain knowledge contained in the national standards document or the industry specification document should be higher than that of the domain knowledge contained in the general process record documentAnd (4) recognizing. Illustratively, different confidence preset values σ are set according to the importance degree of each document.
According to the method for constructing the knowledge graph in the power field, provided by the embodiment of the invention, the confidence of more accurate document concept triples is obtained by presetting different calculation weights for documents with different importance degrees.
In step S2, a denoised annotation data set is generated in a state of each entity triplet, which may be reserved or not reserved by a policy network.
And forming a triple set, namely an annotation data set, based on a plurality of entity triples obtained by a remote supervision method. A policy network is a network that gives a certain output by learning, given a specific input. With or without preservation as the state of each entity triplet, the preserved entity triplets constitute the annotation data set. The states of each entity triplet constitute a state combination. And correspondingly generating different labeled data sets under different state combinations. Illustratively, in step S2, the generated annotation data set is combined in any state as the denoised annotation data set. That is, any one state combination is used as the initial state combination.
In step S3, the denoised labeled data set is used as a training set training relationship extraction model.
And the relation extraction model completes training after multiple iterations based on the training set. The trained relationship extraction model can calculate the extraction accuracy based on a preset verification set.
In step S4, if the extraction accuracy of the trained relationship extraction model is smaller than a preset value, a feedback value is obtained through feedback calculation based on a preset verification set. Illustratively, the trained relation extraction model is extracted to obtain an extraction triple set based on a preset verification set; extracting the triple set and comparing the triple set with the standard triple set labeled by the verification set to obtain performance evaluation of the relation extraction model, such as extraction accuracy; and taking the performance evaluation of the relation extraction model as a feedback value for optimizing parameters of the strategy network.
In step S5, based on the feedback values, the state combinations of the entity triplets are adjusted to obtain the state combination that maximizes the expected feedback value of the policy network. The desired feedback value is maximized by optimizing the parameters of the policy network, i.e. the state combinations.
In one possible implementation, the expected feedback values for the policy network are:
wherein J (Θ) is an expected feedback value, E represents a state combination of the entity triples, s represents a current state of the entity triples, a represents an execution action of the policy network, and r(s) i T) represents the feedback value of the retained entity triplet. And according to the current state s of the entity triple, after the policy network executes the action a, generating a new state combination E of the entity triple. The state combination E of different entity triples is combined with the feedback value r(s) of the retained entity triples i | T), the desired feedback value J (Θ) is calculated. Finally, the state combination E of the entity triplet that maximizes the expected feedback value J (Θ) is determined.
In step S6, the relation extraction model is trained by using the labeled data set generated based on the state combination as the denoised labeled data set until the extraction accuracy of the trained relation extraction model is greater than or equal to the preset value. And forming a denoised annotation data set by the reserved entity triples based on the state combination with the maximum expected feedback value. And training the relation extraction model again based on the denoised labeled data set. And if the extraction accuracy of the trained relation extraction model is greater than or equal to the preset value, finishing the training and executing the step S7. And if the extraction accuracy of the trained relation extraction model is smaller than the preset value, executing the steps S4, S5 and S6 until the extraction accuracy of the trained relation extraction model is larger than or equal to the preset value.
In step S7, the power domain document is extracted by the relationship extraction model to obtain a knowledge graph. And the relation extraction model extracts entity triples from a large number of electric power field documents, and the obtained entity triplet sets form a knowledge graph.
Exemplary, the stateDefined as a characteristic representation of the entities in the current sentence and sentence triples. Fig. 2 is a flowchart of an implementation of another method for constructing a knowledge graph in the power domain according to an embodiment of the present invention. Referring to FIG. 2: x represents a sentence, t represents a corresponding triple in the sentence, and s is the current state of the entity triple. Specifically, sentence (x) is aligned in each state m ) The current triple t in (1) performs an action (keep Y/delete N) sampling, and an action sequence is generated after the action sampling of all the triples in the data set is completed. And generating a data set after noise reduction according to the original data and the action sequence. And a feedback function receives a feedback value from the triple extraction model and guides the strategy network optimization. And the strategy network is used for identifying triple noises in the sentences to form a denoised data set, inputting the denoised data set into the relation extraction model and performing the next iteration.
The embodiment of the invention adopts the reserved triples to generate a training set and train a relationship extraction model by using the strategy network to determine whether the state of the entity triples is reserved or not, optimizes the strategy network by using the output of the trained relationship extraction model as a feedback value to obtain the optimal entity triplet state combination, and generates the training set and trains the relationship extraction model by using the reserved triples again until the extraction accuracy is greater than the preset value. During each iteration of training, the wrong triple samples of a given sample set are filtered, so that the accuracy of the relation extraction model on the verification set is higher. Through the strategy network, the noise data with wrong labels in the training set is reduced, and the extraction accuracy of the relation extraction model is improved.
The data sources from which the knowledge graph is constructed may include not only document classes but also relational databases within the power domain. Generally, a huge business system is established around production, marketing, operation and maintenance in the power field, and data is modeled into a two-dimensional table structure and stored in a relational database. The business database contains rich domain knowledge structures, a conceptual model of the knowledge graph is built based on an E-R model of the business system database, and the knowledge graph can be built quickly by utilizing the relationship between basic data obtained after domain experts are combed. Illustratively, power domain knowledge is contained not only in a large number of domain documents, but also in a business database. The document description comprises diversified expressions of business concepts, rich concept attribute information and relations between concepts. The concept graph obtained from the business relation database is inconsistent with the concept graph formed in the domain document, and needs to be integrated. Whether the concepts are combined into one class can be determined by judging the similarity of example data contained under two certain concepts of the business data map and the domain document map, and then the domain document knowledge map can be fused into a business knowledge map library through operations such as entity alignment, entity disambiguation, relationship completion and the like.
In one possible implementation manner, extracting the electric power domain document through the relation extraction model to obtain the knowledge graph comprises the following steps:
in step S71, extracting the power domain document through the relationship extraction model to obtain a document entity triplet, where the document entity triplet includes a document entity and a document entity relationship. A collection of multiple document entity triplets is a knowledge-graph of document entities.
In step S72, a data concept triple is obtained based on the preset relationship between the fields of the power domain relationship database, where the data concept triple includes a data concept and a data concept relationship. A collection of multiple data concept triples is a data concept graph. Generally, a table of a relational database can be modeled as a type of concept in a concept graph, fields of the table represent attributes of the concept, and the relationship between the concept and the table can be obtained through a primary key and a foreign key of the table. Illustratively, the table name of a database table serves as a concept and the fields of the database table serve as attributes of the concept. And acquiring a primary key and a foreign key between data tables according to the E-R model of the service system database to obtain the relationship between concepts, and constructing a corresponding triple relationship. The method comprises the steps of aiming at constructing a knowledge graph, obtaining a logic concept on the basis of the existing table structure by utilizing information of a service database, converging and screening required entities and attributes and relations which need to be reserved, and obtaining a knowledge structure, an ontology and a three-tuple model in the service data concept graph.
In step S73, a data entity triplet is obtained based on the attribute values of the fields and the data concept triplet, wherein the data entity triplet includes the data entity and the data entity relationship. A collection of triples of data entities is a knowledge-graph of data entities.
And establishing a knowledge graph of the data entity based on the established data concept graph. Illustratively, the business database tables are exported as CSV files using database management tools, or the instance data in different tables is extracted using SQL statements using Python connection databases. And (5) performing data cleaning on the data obtained in the database table by using a data analysis package Pandas. The data cleaning can comprise regular expression matching character strings, repeated redundant data elimination, data abnormal value elimination and the like, and is used for screening the example data. Illustratively, feature selection may be performed according to business requirements, eliminating redundant or irrelevant attributes. Illustratively, in order to adapt to the storage of a graph database, a primary key attribute needs to be recorded, and two attributes, namely a tag of an entity and an entity name (entity name) need to be added, and the entity extracted from the same data table and an attribute value needing to be reserved thereof are stored in one entity file in a unified manner. And merging two or more entity files according to the triple relation of the business data concept map, and finally only keeping the host key attributes of the subject and the object and the relation attributes between the entities to form a triple instance file, namely a data entity triple set.
Illustratively, the attribute values of the fields are used as data entities, and the data concept relationship in the data concept triples is used as the relationship between the data entities. Illustratively, traversing the data tables of the service database, and performing data cleaning, data screening, feature selection and the like based on the data concept map to obtain the data tables corresponding to the concepts. The extraction of the instances is carried out according to the sub-table set of the source data table, the extraction of the entities and the attributes thereof can be directly carried out the entity division and the feature selection according to the difference of the table names, and the extraction of the relations can be carried out the screening according to the main keys and the foreign keys existing among all the data tables in the service database. When the examples are added to the knowledge graph, the entities and the attribute values thereof are extracted from the database, and then the triple examples between the entities are stored to form the knowledge graph library of the service data.
In step S74, a concept similarity between the document concept and the data concept is calculated based on the document entity corresponding to the document concept and the data entity corresponding to the data concept. Each document concept corresponds to a set of document entities. Each data concept corresponds to a set of data entities. The document concept and the data concept represent that the acquisition source of the concept is a document and a relational database. The document concept and the data concept representing the same concept may have different names, and the similarity between the two concepts can be calculated based on the entity corresponding to the concept. The types of the document entities and the data entities can be numerical types and/or text types.
Illustratively, the document concept corresponds to a group of numerical document entities, the data concept corresponds to a group of numerical data entities, variance analysis is performed on entity data corresponding to the two concepts, and whether the two concepts belong to the same concept is judged by comparing the results with the set confidence.
Illustratively, the document concept corresponds to a group of text-type document entities, the data concept corresponds to a group of text-type data entities, and semantic similarity calculation is performed on entity data corresponding to the two concepts to judge whether the two concepts belong to the same concept.
In step S75, the document concept and the data concept are merged according to the concept similarity, and a knowledge graph after the concept triple fusion is obtained.
According to the construction method of the knowledge graph in the power field, provided by the embodiment of the invention, whether the concepts belong to the same concept is judged by calculating the similarity of entities contained in the concepts of different acquisition sources, so that the data redundancy is reduced, and the usability of the knowledge graph is improved.
After the document concept and the data concept describing the same concept are fused, the corresponding entity sets are also merged. Entities with different names may also exist in the merged entity set to describe the same object, and the entities need to be further fused.
In a possible implementation manner, after merging the document concept and the data concept according to the concept similarity to obtain the knowledge graph after the concept triple fusion, the method further includes:
the entity similarity between the document entity and the data entity is calculated by the following formula.
sim(x,y)=α*∑s(x i ,y i )+β*∑s(Ner(x) i ,Ner(y) i )
Wherein sim (x, y) represents entity similarity, s represents similarity of entity attributes, x represents document entity, y represents data entity, x represents document entity i Representing the value of an attribute, y, contained in the document entity i Indicating the value of an attribute, ner (x), that the data entity contains i Associated entities, ner (y), representing document entities i Representing the associated entities of the data entities, alpha, beta representing preset weights. Typically, an entity includes a plurality of attributes and attribute values. And comprehensively judging the similarity of the two entities according to the similarity of the attributes and the attribute values contained in the entities and the similarity of the associated entities of the entities. When the entity similarity exceeds a preset threshold, the entities are judged to be the same entity.
Illustratively, if the types of the two entities are numerical types, variance analysis is performed on the attribute values of the two entities to obtain the similarity of the attributes of the entities.
Illustratively, if the types of the two entities are text types, semantic similarity calculation is performed on the attribute values of the two entities to obtain the similarity of the attributes of the entities.
And combining the document entity and the data entity according to the entity similarity to obtain the knowledge graph after the entity triple fusion.
According to the construction method of the knowledge graph in the power field, provided by the embodiment of the invention, the similarity of two entities is comprehensively judged according to the similarity of the attributes and attribute values contained in the entities and the similarity of the associated entities of the entities, whether the two entities belong to the same concept is judged, the data redundancy is reduced, and the availability of the knowledge graph is improved.
Illustratively, entities and relationship instances in the service data knowledge graph and the domain document knowledge graph are mapped into low-dimensional vectors based on the fused concept graph, the similarity degree between the entities is judged according to the similarity of attributes and attribute values contained between the entities, the auxiliary judgment is carried out according to the associated node information of each node, and when the similarity degree of the entities exceeds a preset threshold value, the entities are fused. And constructing a synonym library for the equivalent examples and the equivalent attributes, and improving the intelligent degree of the knowledge graph library.
Illustratively, the construction method of the knowledge graph in the power field provided by the embodiment of the invention can be used for constructing a knowledge base by using a deep learning model and an algorithm for service data in the power marketing field, integrating and reasoning learning multi-source heterogeneous data in the field to form structured knowledge, and realizing high-efficiency storage by using a graph database. And the macro level provides auxiliary decision for management, and the micro level provides data association display and summary information based on business management logic for basic level personnel. By automatically constructing the knowledge map facing the electric power marketing, the marketing operation management knowledge is combed and precipitated, the intelligent management requirement of the electric power enterprise is met, the scientific management and the informatization construction of the electric power enterprise are promoted, and the management level and the service level of the electric power enterprise are effectively improved.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.
FIG. 3 is a main interface diagram of a knowledge-graph modeling tool platform provided by an embodiment of the invention. FIG. 4 is an input interface diagram of a knowledge-graph modeling tool platform provided by an embodiment of the invention. FIG. 5 is a diagram of an example knowledge graph of a knowledge graph modeling tool platform provided by an embodiment of the invention. Referring to fig. 3, 4 and 5: the embodiment of the invention also provides a knowledge graph modeling tool platform which provides functions of entity extraction and triple generation based on database files, triple screening based on documents, concept graph management, knowledge graph instance generation and the like. The knowledge graph modeling tool platform is developed based on a Django framework, and mainly comprises 4 parts of functions, namely graph generation based on database files, graph generation based on domain document files, concept graph management and knowledge graph instance generation.
For the database file input into the modeling tool platform, the relationship between concepts can be obtained by analyzing the field dependency relationship between different table files and tables. And for the input field document file, the concept description and the completion triple relation can be enriched by analyzing the relation fact contained in the natural language. A knowledge graph system constructed by extracting example data based on a concept graph is a dynamic and updatable system, and can achieve fine-grained modeling description on a knowledge object in reality. Considering the application of the method for establishing the document concept map in the step S2 in the field knowledge, only the information such as the occurrence times of the triples needs to be recorded, and the storage of all the documents and the database files in the past is not needed, so that the storage overhead of the platform server is reduced. The weight can be modified when the number of the subsequent documents is increased, and the concept graph and the knowledge graph can be modified through a forced rule, so that the reconstruction and the updating of the knowledge graph are easily realized.
And generating a corresponding business data concept map based on the map generation of the database file, database table information, fields and main foreign key dependency information in the business database file, and respectively storing the entity file and the relation triple instance file by using the entity, the attribute value thereof and the triple instance data.
The method comprises the steps of generating maps based on document files, carrying out corresponding processing on different types of document files, extracting concept map structures and knowledge map examples of related knowledge in a service field range by applying an entity and relationship combination algorithm based on long texts, and storing description of concepts in the documents.
And (4) managing the concept graph, namely introducing the extracted knowledge triples into the concept graph to form definition of relationship types among the concepts as a mode and prior knowledge generated by the knowledge graph instance. Meanwhile, the management of the concept map is provided, and convenience is provided for map expansion and modification.
And (4) knowledge graph example generation, namely importing example files into a graph database Neo4j on the basis of the formed concept graph to form a knowledge graph library, supporting knowledge retrieval and forming professional demonstration application.
The embodiment of the invention provides a knowledge graph modeling tool platform, which is based on the field knowledge graph construction steps and is used for constructing a Web-based high-efficiency reusable modeling tool platform. A user does not need to know details and technical flows in the process of establishing the knowledge graph, and the end-to-end knowledge graph establishment can be completed only by selecting an original document and a database file. The tool reduces manual participation, facilitates the updating and utilization of the knowledge graph, thereby improving the automation and intelligence degree of the establishment of the domain knowledge graph and reducing the cost of establishing the electric power marketing knowledge graph.
The following are embodiments of the apparatus of the invention, reference being made to the corresponding method embodiments described above for details which are not described in detail therein.
Fig. 6 shows a schematic structural diagram of an apparatus for constructing a knowledge graph in the power domain according to an embodiment of the present invention, and for convenience of description, only the parts related to the embodiment of the present invention are shown, which is detailed as follows:
as shown in fig. 6, an apparatus 2 for constructing a power domain knowledge graph includes:
a remote supervision module 21 configured to obtain a plurality of entity triples based on a remote supervision method.
And the first denoising module 22 is configured to generate a denoised annotation data set in a state of being reserved or not reserved as each entity triplet through the policy network.
And the training module 23 is configured to use the denoised labeled data set as a training relationship extraction model of the training set.
And the feedback module 24 is configured to obtain a feedback value through feedback calculation based on a preset verification set if the extraction accuracy of the trained relationship extraction model is smaller than a preset value.
And the state combination module 25 is configured to adjust the state combination of each entity triplet based on the feedback value, so as to obtain a state combination that maximizes the expected feedback value of the policy network.
And a second denoising module 26, configured to train the relationship extraction model by using the labeled data set generated based on the state combination as a denoised labeled data set until the extraction accuracy of the trained relationship extraction model is greater than or equal to a preset value.
And the extraction module 27 is used for extracting the electric power field document through the relation extraction model to obtain the knowledge graph.
The embodiment of the invention adopts the reserved triples to generate a training set and train a relationship extraction model by using whether the strategy network is reserved as the state of the entity triples or not, optimizes the strategy network by using the output of the trained relationship extraction model as a feedback value to obtain the optimal entity triplet state combination, and generates the training set and trains the relationship extraction model by using the reserved triples again until the extraction accuracy is greater than a preset value. During each iteration of training, the wrong triple samples of a given sample set are filtered, so that the accuracy of the relation extraction model on the verification set is higher. Through the strategy network, the noise data with wrong labeling in the training set is reduced, and the extraction accuracy of the relation extraction model is improved.
In one possible implementation, the expected feedback values for the policy network are:
wherein J (theta) is an expected feedback value, E represents the state combination of the entity triple, s represents the current state of the entity triple, a represents the execution action of the policy network, and r(s) i T) represents the feedback value of the retained entity triplet.
Fig. 7 is a schematic diagram of an electronic device provided in an embodiment of the present invention. As shown in fig. 7, the electronic apparatus 3 of this embodiment includes: a processor 30, a memory 31 and a computer program 32 stored in said memory 31 and executable on said processor 30. The processor 30, when executing the computer program 32, implements the steps in each of the above-mentioned embodiments of the method for constructing a power domain knowledge graph, such as the steps S1 to S7 shown in fig. 1. Alternatively, the processor 30, when executing the computer program 32, implements the functions of the modules/units in the device embodiments described above, such as the modules 21 to 27 shown in fig. 6.
Illustratively, the computer program 32 may be partitioned into one or more modules/units that are stored in the memory 31 and executed by the processor 30 to implement the present invention. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution of the computer program 32 in the electronic device 3. For example, the computer program 32 may be divided into the modules 21 to 27 shown in fig. 6.
The electronic device 3 may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The electronic device 3 may include, but is not limited to, a processor 30, a memory 31. Those skilled in the art will appreciate that fig. 7 is merely an example of the electronic device 3, and does not constitute a limitation of the electronic device 3, and may include more or fewer components than those shown, or some of the components may be combined, or different components, e.g., the electronic device may also include an input-output device, a network access device, a bus, etc.
The Processor 30 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory 31 may be an internal storage unit of the electronic device 3, such as a hard disk or a memory of the electronic device 3. The memory 31 may also be an external storage device of the electronic device 3, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, provided on the electronic device 3. Further, the memory 31 may also include both an internal storage unit and an external storage device of the electronic device 3. The memory 31 is used for storing the computer program and other programs and data required by the electronic device. The memory 31 may also be used to temporarily store data that has been output or is to be output.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only used for distinguishing one functional unit from another, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus/electronic device and method may be implemented in other ways. For example, the above-described apparatus/electronic device embodiments are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated module/unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow of the method of the embodiments described above can be realized by the present invention, and the method can also be realized by a computer program to instruct related hardware, where the computer program can be stored in a computer readable storage medium, and when the computer program is executed by a processor, the steps of the above-mentioned each method embodiment for constructing a power domain knowledge graph can be realized. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, read-Only Memory (ROM), random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein.
Claims (10)
1. A construction method of a knowledge graph in the power field is characterized by comprising the following steps:
obtaining a plurality of entity triples based on a remote supervision method;
generating a denoised labeling data set by using a state of reserving or not reserving as each entity triple through a policy network; taking the denoised labeling data set as a training set training relationship extraction model;
if the extraction accuracy of the trained relation extraction model is smaller than a preset value, obtaining a feedback value through feedback calculation based on a preset verification set; based on the feedback values, adjusting the state combination of each entity triple to obtain the state combination which enables the expected feedback value of the strategy network to be maximum;
taking a labeling data set generated based on the state combination as a denoised labeling data set, training a relation extraction model until the extraction accuracy of the trained relation extraction model is more than or equal to a preset value;
and extracting the electric power field documents through the relation extraction model to obtain a knowledge graph.
2. The method for constructing the power domain knowledge graph according to claim 1, wherein the expected feedback values of the policy network are as follows:
wherein J (Θ) is an expected feedback value, E represents a state combination of the entity triples, s represents a current state of the entity triples, a represents an execution action of the policy network, and r(s) i T) represents the feedback value of the retained entity triplet.
3. The method for constructing a power domain knowledge graph according to claim 1, wherein the obtaining a plurality of entity triples based on a remote supervision method includes:
constructing a document concept map through word segmentation and LDA topic clustering based on a first sample document to obtain a plurality of document concept triples;
the confidence level of a document concept triple is calculated by the formula,
W j =C+α*COUNT(D j )
wherein, W j Representing the confidence of the document concept triples, C and alpha representing preset weights, D j Representing document concept triplets, COUNT (D) j ) Representing the number of times the document concept triplets appear in the first sample document;
if the confidence coefficient of the document concept triple is lower than a preset value, deleting the document concept triple to obtain a screened document concept map;
and obtaining a plurality of entity triples by a remote supervision method based on the second sample document and the screened document concept map.
4. The method for constructing a power domain knowledge graph according to claim 3, wherein the first sample document comprises a plurality of documents, and the confidence degrees of the document concept triples are as follows:
W j =C+∑(α i *COUNT(D ij ))
wherein alpha is i Representing document concept triplets D j Preset weights in the ith first template document.
5. The method for constructing the power domain knowledge graph according to claim 3, wherein the extracting the power domain documents through the relationship extraction model to obtain the knowledge graph comprises:
extracting the documents in the power field through the relation extraction model to obtain document entity triples, wherein the document entity triples comprise document entities and document entity relations;
acquiring data concept triples based on fields of a power field relational database and preset relations among the fields, wherein the data concept triples comprise data concepts and data concept relations;
obtaining a data entity triple based on the attribute value of the field and the data concept triple, wherein the data entity triple comprises a data entity and a data entity relation;
calculating the concept similarity between the document concept and the data concept based on the document entity corresponding to the document concept and the data entity corresponding to the data concept;
and combining the document concept and the data concept according to the concept similarity to obtain the knowledge graph after the concept triple fusion.
6. The method for constructing a knowledge graph in the power domain according to claim 5, wherein after the document concept and the data concept are merged according to the concept similarity to obtain the knowledge graph after concept triple fusion, the method further comprises:
calculating entity similarity between the document entity and the data entity by the following formula;
sim(x,y)=α*∑s(x i ,y i )+β*∑s(Ner(x) i ,Ner(y) i )
wherein sim (x, y) represents entity similarity, s represents similarity of entity attributes, x represents document entity, y represents data entity, x represents document entity i Representing document entity ContainmentProperty value of y i Indicating the value of the attribute, ner (x), contained by the data entity i Associated entities representing document entities, ner (y) i Representing associated entities of the data entities, wherein alpha and beta represent preset weights;
and combining the document entity and the data entity according to the entity similarity to obtain the knowledge graph after the entity triple fusion.
7. An apparatus for constructing a knowledge graph in the power domain, comprising:
the remote monitoring module is used for obtaining a plurality of entity triples based on a remote monitoring method;
the first denoising module is used for generating denoised labeling data sets in a state of reserving or not reserving entity triples through a strategy network;
the training module is used for taking the denoised labeling data set as a training set training relationship extraction model;
the feedback module is used for obtaining a feedback value through feedback calculation based on a preset verification set if the extraction accuracy of the trained relation extraction model is smaller than a preset value;
the state combination module is used for adjusting the state combination of each entity triple based on the feedback value to obtain the state combination which enables the expected feedback value of the strategy network to be maximum;
the second denoising module is used for training the relation extraction model by taking the labeling data set generated based on the state combination as a denoised labeling data set until the extraction accuracy of the trained relation extraction model is more than or equal to a preset value;
and the extraction module is used for extracting the electric power field documents through the relation extraction model to obtain the knowledge graph.
8. The apparatus for constructing a power domain knowledge graph according to claim 7, wherein the expected feedback values of the policy network are:
wherein J (Θ) is an expected feedback value, E represents a state combination of the entity triples, s represents a current state of the entity triples, a represents an execution action of the policy network, and r(s) i T) represents the feedback value of the retained entity triplet.
9. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the method for constructing a power domain knowledge graph as claimed in any one of claims 1 to 6 when executing the computer program.
10. A computer-readable storage medium, in which a computer program is stored, wherein the computer program, when being executed by a processor, implements the steps of the method for constructing a power domain knowledge graph as claimed in any one of claims 1 to 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211193515.9A CN115587190A (en) | 2022-09-28 | 2022-09-28 | Construction method and device of knowledge graph in power field and electronic equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211193515.9A CN115587190A (en) | 2022-09-28 | 2022-09-28 | Construction method and device of knowledge graph in power field and electronic equipment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115587190A true CN115587190A (en) | 2023-01-10 |
Family
ID=84778225
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211193515.9A Pending CN115587190A (en) | 2022-09-28 | 2022-09-28 | Construction method and device of knowledge graph in power field and electronic equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115587190A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117271802A (en) * | 2023-10-26 | 2023-12-22 | 研祥智能科技股份有限公司 | Knowledge graph construction method, knowledge graph construction device, computer equipment and storage medium |
-
2022
- 2022-09-28 CN CN202211193515.9A patent/CN115587190A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117271802A (en) * | 2023-10-26 | 2023-12-22 | 研祥智能科技股份有限公司 | Knowledge graph construction method, knowledge graph construction device, computer equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110825882B (en) | Knowledge graph-based information system management method | |
CN109918511B (en) | BFS and LPA based knowledge graph anti-fraud feature extraction method | |
WO2021103492A1 (en) | Risk prediction method and system for business operations | |
CN111708773A (en) | Multi-source scientific and creative resource data fusion method | |
CN108304382B (en) | Quality analysis method and system based on text data mining in manufacturing process | |
CN116245107B (en) | Electric power audit text entity identification method, device, equipment and storage medium | |
CN111427974A (en) | Data quality evaluation management method and device | |
CN114840685A (en) | Emergency plan knowledge graph construction method | |
Yang et al. | User story clustering in agile development: a framework and an empirical study | |
CN116561264A (en) | Knowledge graph-based intelligent question-answering system construction method | |
CN115587190A (en) | Construction method and device of knowledge graph in power field and electronic equipment | |
CN117453805B (en) | Visual analysis method for uncertainty data | |
CN112036150A (en) | Electricity price policy term analysis method, storage medium and computer | |
CN117744769A (en) | Knowledge graph construction method and device for industrial chain data, electronic equipment and medium | |
CN114722159B (en) | Multi-source heterogeneous data processing method and system for numerical control machine tool manufacturing resources | |
CN117130938A (en) | Method and device for generating test cases based on knowledge graph | |
CN115827885A (en) | Operation and maintenance knowledge graph construction method and device and electronic equipment | |
Yang et al. | Evaluation and assessment of machine learning based user story grouping: A framework and empirical studies | |
CN116431828A (en) | Construction method of power grid center data asset knowledge graph database constructed based on neural network technology | |
Hu et al. | A classification model of power operation inspection defect texts based on graph convolutional network | |
CN115757735A (en) | Intelligent retrieval method and system for power grid digital construction result resources | |
CN116226371A (en) | Digital economic patent classification method | |
CN111242520B (en) | Feature synthesis model generation method and device and electronic equipment | |
CN114547477A (en) | Data processing method and device, electronic equipment and storage medium | |
CN112417220A (en) | Heterogeneous data integration method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |