CN116822625A - Divergent-type associated fan equipment operation and detection knowledge graph construction and retrieval method - Google Patents

Divergent-type associated fan equipment operation and detection knowledge graph construction and retrieval method Download PDF

Info

Publication number
CN116822625A
CN116822625A CN202310557369.1A CN202310557369A CN116822625A CN 116822625 A CN116822625 A CN 116822625A CN 202310557369 A CN202310557369 A CN 202310557369A CN 116822625 A CN116822625 A CN 116822625A
Authority
CN
China
Prior art keywords
entity
data
attribute
knowledge
fan
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310557369.1A
Other languages
Chinese (zh)
Inventor
满于维
卜俊文
王正海
李泽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangxi Zhuojie Power Engineering Maintenance Co ltd
Original Assignee
Guangxi Zhuojie Power Engineering Maintenance Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangxi Zhuojie Power Engineering Maintenance Co ltd filed Critical Guangxi Zhuojie Power Engineering Maintenance Co ltd
Priority to CN202310557369.1A priority Critical patent/CN116822625A/en
Publication of CN116822625A publication Critical patent/CN116822625A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/01Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a divergent correlation fan equipment operation detection knowledge graph construction and retrieval method, which belongs to the field of fan equipment operation detection and knowledge graph, and aims at structural data such as data in a relational database to finish mapping from the structural data to the knowledge graph and realize conversion from the database to the knowledge graph; aiming at unstructured data, a deep learning method is mainly adopted, knowledge extraction is carried out on text and webpage information generated in a fan operation and detection process, entity identification and relation extraction are completed, the data after knowledge fusion are stored in Neo4j, visual display of a knowledge graph is achieved through a Neo4j graph database, and semantic query can be carried out by using a Cypher query language. And operation and maintenance personnel can conveniently and quickly inquire operation and maintenance knowledge and mine operation and maintenance data.

Description

Divergent-type associated fan equipment operation and detection knowledge graph construction and retrieval method
Technical Field
The invention belongs to the field of fan equipment operation and detection and knowledge maps, and particularly relates to a divergent-type related fan equipment operation and detection knowledge map construction and retrieval method.
Background
With more and more technologies such as the Internet of things and artificial intelligence being introduced into the fan operation and detection, new equipment and new methods in the operation and detection process also make the fan operation and detection process more complex. In this case, the business of fan operation will face an increasing challenge. In the fan operation and detection process, a large amount of multi-source heterogeneous data exist, management categories are numerous and complicated, and management efficiency is low.
Knowledge graph is a semantic net that represents things and relationships between things in a structured form, and can effectively utilize large amounts of structured, semi-structured, and unstructured data. The knowledge graph construction comprises knowledge extraction, knowledge fusion, knowledge representation and the like. The knowledge graph is divided into a general knowledge graph and a domain knowledge graph. The universal knowledge graph is mainly applied to a search engine; the domain knowledge graph is mainly applied to specific fields, has higher specialization degree and has application in the fields of medical treatment, law, finance, electronic commerce and the like.
One key challenge in domain knowledge graph construction is the lack of data sets in the domain and the high number of terms and concepts. Traditional rule-based or template-based knowledge extraction requires manual construction of a large number of rule templates, has a limited application range, and is difficult to adapt to complex requirements.
Therefore, a divergent-type related knowledge graph construction and retrieval method for operation detection of fan equipment is needed to be designed at the present stage so as to solve the problems.
Disclosure of Invention
The invention aims to provide a divergent correlation fan equipment operation detection knowledge graph construction and retrieval method, which is used for solving the technical problems in the prior art, constructing a knowledge graph, reducing labor energy consumption, realizing automatic acquisition of knowledge from original data and carrying out visual storage by using a Neo4j graph database.
In order to achieve the above purpose, the technical scheme of the invention is as follows:
a divergent-type associated fan equipment operation detection knowledge graph construction and retrieval method comprises the following steps:
s1: the method comprises the steps of obtaining original data such as documents, forms and news through a data obtaining module, preprocessing the original data to obtain preprocessed data comprising structured data and semi/unstructured data; and adopting BERT-BiLSTM-CRF model/relation extraction and attribute extraction to make structuring and making entity extraction.
S2: and carrying out entity identification and entity disambiguation on the unstructured data, and determining the range of the entity in the sentence by carrying out label definition on the sentence. Based on the similarity comparison method of the named entity relationship, the common named entity and the selected attribute of each group of multi-source data are stored in a table, different weights are set for each attribute with conditions, and the similarity of the entities is judged by calculating the weighted values of all the attributes.
S3: and (3) carrying out knowledge reasoning by adopting a Path-RNN model, and converting paths among target entities into inputs of an RNN network by adopting a Path reasoning method, thereby carrying out knowledge reasoning.
S4: and (3) constructing a fan fault knowledge graph entity part, and carrying out recognition processing on the terms by combining the TextRank and TF IDF technologies. A concept entity is created. The operation terms, accident handling terms, operation terms and fault terms are terms created from keywords extracted by two algorithms: correction, fusion, screening and classification. The screening method combines the data material completion terms, interprets the terms in detail, and adds the relevant scheduling and security rules by searching and matching.
S5: storing, displaying and inquiring the knowledge graph, and structuring various entities according to the entity framework; and flexibly using Neo4j-web and Neo4j-import in Neo4j to control and import standard structured data such as defects, defect reasons, equipment, parts and the like of each fan obtained after data cleaning. Semantic query is carried out by using a Cypher query language, so that connection and interaction between the application and a graph database are realized; and realizing the query, display and modification of various semantic types, relations, node objects and relation objects based on the graph database.
The step S1 specifically includes: by applying the web crawler technology, documents published by various power generation companies or equipment manufacturers and tables of fan operation and detection processes are obtained and downloaded according to law, and then data contained in the words, excel and pdf are read by using open source software modules python-docx, xlrd and pdfminer respectively according to different file formats.
Then, the original data obtained from the text is transformed and encoded into a vector form suitable for computer processing, and the invention uses a skip-gram model to optimize the word vector matrix L, and learns accurate word vector representation for each word. Given an arbitrary n-tuple (w, C) =w n-c …w n-1 w n w n+1 …w n+C The model uses the word vector e (w n ) The t-th word w in the prediction context t The probability of (2) is:
in the above, w n Representing a center word; e (w) n )∈R d Representing w n The vector of the d-dimension word is obtained through the retrieval of a vector matrix L; c is the size of the scale, representing the window size of the background. The objective function of the model is as follows:
after model training is completed, an optimized word vector matrix can be obtained, including representations of all distributed vectors in the table.
Knowledge extraction for text data. The patent uses a model of a two-way long and short term memory neural network (Bidirectional Long Short Term Memory, biLSTM) combined with a conditional random field (ConditionalRandom Field, CRF) to identify named entities.
When given lexical sequence x=x 0 x 1 ...x n Searching word vectors e corresponding to each word in the trained word vector table n ∈R d1 D1 represents the dimension of its vector. LSTM is controlled by a memory cell and three gates, the inputs of which are the hidden layer representation hi-1 at the previous time and the output wi-1 of the previous power information and communication technology time, the output being the hidden layer representation hi at the current time. The calculation method comprises the following steps:
i n =σ(W i e(W n-1 )+U i h n-1 +V i C n-1 +b i )
f n =σ(W f e(W n-1 )+U f h n-1 +V f C n-1 +b f )
o n =σ(W o e(W n-1 )+U o h n-1 +V o C n-1 +b o )
h n =o n ⊙tanh(c n )
wherein i is n 、f n 、o n Representing input, forget and output gates, respectively; c n Representing a memory unit; w (W) n 、U n 、V n Iso and b i 、b f 、b o The offset and coefficient representing the linear relationship, σ (x) representing the activation function, and c representing the dot product.
The expression of the hidden layer corresponding to each character obtained by the LSTM of the preamble is
Similarly, the subsequent LSTM is expressed as a hidden layer
The leading hidden layer captures the comprehensive information e (0) to e (i-1) of e (i) and its left part, and the following hidden layer captures e (i) and its right information e (i+1) to eT. LSTM concatenates the preamble and the following hidden layers, and finally models the conditional probability P (y|x) by the following formula:
in ρ of the above k Is a parameter thereof; f (f) k (y i+1 ,y i M, i) is a transfer function defined at the front and rear adjacent positions of the sequence M. Decoding the obtained result through a model to obtain the following result:
the step S2 specifically includes: the subject and subject relationships between the plurality of words in the sentence are analyzed by the dependency grammar (Dependency Parsing, DP), and the structure of the entire sentence is presented, that is, the relationships between the components are summarized by analyzing the grammatical components such as subject, predicate, object, subject, object, and the like contained in the sentence. This patent uses MST for syntactic analysis of dependencies. And constructing an acyclic directed graph of the input sentence, wherein the acyclic directed graph comprises a node set and a directed edge set of corresponding words. The head of the dominant in the dependency structure diagram is the dependency of the dominant, and the dominant is the core predicate node in the sentence without depending on other words. There may be directional edges between two nodes that are in the same direction but have different dependencies. The MST converts the get best dependency structure into finding the highest scoring dependency tree in the directed graph.
Assuming that the analysis result of sentence a is b, the model parameter is ε, a conditional probability model SC (a i |b i The method comprises the steps of carrying out a first treatment on the surface of the Epsilon) will find the epsilon value that maximizes the model between i=1 and N during training.
MSTParser defines that the score for the entire syntactic tree is a weighted sum of the arc scores in the tree:
in the above formula: s is a score, b is one of the dependency trees of sentence a; w is the weight vector of feature f (.
The step S3 specifically includes: each PATH is decomposed into a sequence of relationships by the PATH-RNN model and added to the RNN to construct a vector representation of the PATH, and then the correlation of the PATH and the candidate relationships is calculated by the dot product of the PATH vector representation. The first step is to convert all input entities and relations into vectors through an embedded matrix, and the method is the same as step S1. The priors use the PRA to obtain the relationship path of the training instance (es, r, et) that is most relevant to the relationship r. The path random walk of the PRA is performed for a given triplet. Recording all connection relations from the head to the tail of the entity, obtaining a plurality of relation paths, { r1, r2, & gt. Rn } adding an intermediate entity, and obtaining a random path
K= [ es, r1, e1...et ], the path is extended completely, and its model is shown in fig. 1.
When the search space of the path representation is large, combining all paths does not provide enough evidence to infer relationships between entities, and therefore, in order to narrow the search, an expansion is made on the model, performing multi-step reasoning on the path distribution. Multi-step reasoning refers to using the attention mechanism multiple times on the path vector obtained from BiLSTM, and continuing to use the attention mechanism for the result obtained by using the attention mechanism each time to improve the accurate value of the reasoning result. Each inference generates a new relationship embedding vector u to represent the evidence of the inference.
u z+1 =W o (o 2 +u 2 )
After the path is acquired, entity disambiguation is performed, and the flow is as shown in fig. 2, based on an entity naming attribute relationship similarity comparison method, common naming entities and selected attributes of all groups of multi-source data are stored in a table, different weights are set for all the attributes with conditions, and the similarity of the entities is judged by calculating the weighted values of all the attributes. And calculating the semantic similarity of 2 entities by taking the name, the relation and the numerical value attribute of the knowledge base entity of the fan as characteristic analysis quantities. The calculation is as follows:
wherein: a0 B0 refers to entity names of the a entity and the B entity; ai, bi refer to the numerical attribute values of the A entity and the B entity; aj, bj refers to the object attribute values of the A entity and the B entity; sim (a, B) refers to semantic similarity of 2 attribute values; α+β+γ=1, where α, β, γ represent weights of entity name similarity, entity numerical attribute value similarity, and entity object attribute value similarity, respectively. For a numerical attribute entity, the calculation is performed with the following formula:
for the aggregate attribute entity, the following formula is used for calculation:
for text attribute entities, the following formula is used for calculation:
the step S4 specifically includes: a central conceptual model of the fan operation and detection field is created, and an ontology framework is built on the basis of the central conceptual model, as shown in fig. 3. The body of constructing the fan equipment fault knowledge graph is a key task in the whole flow. The construction of the fan equipment body comprises the steps of definition, concept, hierarchy, category, concept attribute relationship definition and the like. The classification of the ontology concept is mainly to classify and define the fault types of the equipment, and the classification can be divided into the following classes according to the internal element constitution: equipment, parts, fault causes, suggestions and measures. The definition of the conceptual attribute relationship can refine the ontology, so that a classification hierarchy system with a good structure is formed, and each fault class consists of equipment, components, fault reasons, suggestions and measures and can be abstracted into entity and entity state form description. Thus, the nodes in the wind field topology are accurately defined and used as entities in the knowledge graph, the switches and the circuits are represented by the relationships in the knowledge graph, and the information of the nodes is stored in the knowledge graph in an attribute form. And converting the fan equipment information into structured triplet data.
The production management knowledge construction comprises the steps of fan fault dispatching related departments, processing department business flow relation when fan faults occur, and department corresponding responsible person information.
The departments include: department [ name, task, location, responsible person, phone ].
The personnel include: personnel [ name, department, age, position, professional skills, telephone ]
Fan fault handling event part construction
The fan fault includes: failure [ name, alias, cause, attribute, how to behave, processing method, expert experience, correspondent person ]
The step S5 specifically includes: structuring various entities according to the entity frame; and flexibly using Neo4j-web and Neo4j-import in Neo4j to control and import standard structured data such as defects, defect reasons, equipment, parts and the like of each fan obtained after data cleaning.
Semantic query is carried out by using a Cypher query language, so that connection and interaction between the application and a graph database are realized; and realizing the query, display and modification of various semantic types, relations, node objects and relation objects based on the graph database. An operator can input query contents, convert the query contents into computer-recognizable Cypher language for knowledge query through semantic understanding and problem template matching, directly retrieve entity relations of a knowledge base by using the Cypher, and display the entity relations in the form of entity-relation-attribute triples through a data-driven document (D3. Js) technology.
Compared with the prior art, the invention has the following beneficial effects:
one of the beneficial effects of the scheme is that the invention provides a divergent-type related fan equipment operation detection knowledge graph construction method, aiming at structured data such as data in a relational database, the conversion from the structured data to the knowledge graph to the mapping is completed, and the conversion from the database to the knowledge graph is realized; aiming at unstructured data, a deep learning method is mainly adopted, knowledge extraction is carried out on text and webpage information generated in a fan operation and detection process, entity identification and relation extraction are completed, the data after knowledge fusion are stored in Neo4j, visual display of a knowledge graph is achieved through a Neo4j graph database, and semantic query can be carried out by using a Cypher query language. And operation and maintenance personnel can conveniently and quickly inquire operation and maintenance knowledge and mine operation and maintenance data.
Drawings
FIG. 1 is a schematic diagram of knowledge reasoning using the PATH-RNN model.
Fig. 2 is a schematic diagram of a defective entity matching process.
Fig. 3 is a diagram of a knowledge graph ontology framework relationship.
FIG. 4 is a general framework diagram of a knowledge graph construction method using BiLSTM-CRF in combination with BERT model.
Fig. 5 is a schematic diagram of a Neo4j graph database presentation.
Detailed Description
For the purpose of making the technical solution and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings and examples. It should be understood that the particular embodiments described herein are illustrative only and are not intended to limit the invention, i.e., the embodiments described are merely some, but not all, of the embodiments of the invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be made by a person skilled in the art without making any inventive effort, are intended to be within the scope of the present invention.
A divergent-type associated fan equipment operation detection knowledge graph construction and retrieval method comprises the following steps:
s1: the method comprises the steps of obtaining original data such as documents, forms and news through a data obtaining module, preprocessing the original data to obtain preprocessed data comprising structured data and semi/unstructured data; and adopting BERT-BiLSTM-CRF model/relation extraction and attribute extraction to make structuring and making entity extraction.
S2: and carrying out entity identification and entity disambiguation on the unstructured data, and determining the range of the entity in the sentence by carrying out label definition on the sentence. Based on the similarity comparison method of the named entity relationship, the common named entity and the selected attribute of each group of multi-source data are stored in a table, different weights are set for each attribute with conditions, and the similarity of the entities is judged by calculating the weighted values of all the attributes.
S3: and (3) carrying out knowledge reasoning by adopting a Path-RNN model, and converting paths among target entities into inputs of an RNN network by adopting a Path reasoning method, thereby carrying out knowledge reasoning.
S4: and (3) constructing a fan fault knowledge graph entity part, and carrying out recognition processing on the terms by combining the TextRank and TF IDF technologies. A concept entity is created. The operation terms, accident handling terms, operation terms and fault terms are terms created from keywords extracted by two algorithms: correction, fusion, screening and classification. The screening method combines the data material completion terms, interprets the terms in detail, and adds the relevant scheduling and security rules by searching and matching.
S5: storing, displaying and inquiring the knowledge graph, and structuring various entities according to the entity framework; and flexibly using Neo4j-web and Neo4j-import in Neo4j to control and import standard structured data such as defects, defect reasons, equipment, parts and the like of each fan obtained after data cleaning. Semantic query is carried out by using a Cypher query language, so that connection and interaction between the application and a graph database are realized; and realizing the query, display and modification of various semantic types, relations, node objects and relation objects based on the graph database.
The step S1 specifically includes: by applying the web crawler technology, documents published by various power generation companies or equipment manufacturers and tables of fan operation and detection processes are obtained and downloaded according to law, and then data contained in the words, excel and pdf are read by using open source software modules python-docx, xlrd and pdfminer respectively according to different file formats.
Then, the original data obtained from the text is transformed and encoded into a vector form suitable for computer processing, and the invention uses a skip-gram model to optimize the word vector matrix L, and learns accurate word vector representation for each word. Given an arbitrary n-tuple (w, C) =w n-c ···w n-1 w n w n+1 ...w n+C The model uses a word vector w (w n ) The t-th word w in the prediction context t The probability of (2) is:
in the above, w n Representing a center word; e (w) n )∈R d Representing w n The vector of the d-dimension word is obtained through the retrieval of a vector matrix L; c is the size of the scale, representing the window size of the background. The objective function of the model is as follows:
after model training is completed, an optimized word vector matrix can be obtained, including representations of all distributed vectors in the table.
Knowledge extraction for text data. The patent uses a model of a two-way long and short term memory neural network (Bidirectional Long Short Term Memory, biLSTM) combined with a conditional random field (ConditionalRandom Field, CRF) to identify named entities.
When given lexical sequence x=x 0 x 1 ...x n Searching word vectors corresponding to each word in the trained word vector table e n∈R d1 Dl represents the dimension of its vector. LSTM is controlled by a memory cell and three gates, the inputs of which are the hidden layer representation hi-1 at the previous time and the output wi-1 of the previous power information and communication technology time, the output being the hidden layer representation hi at the current time. The calculation method comprises the following steps:
i n =σ(W i e(W n-1 )+U i h n-1 +V i C n-1 +b i )
f n =σ(W f e(W n-1 )+U f h n-1 +V f C n-1 +b f )
o n =σ(W o e(W n-1 )+U o h n-1 +V o C n-1 +b o )
h n =o n ⊙tanh(c n )
wherein i is n 、f n 、o n Representing input, forget and output gates, respectively; c n Representing a memory unit; w (W) n 、U n 、V n Iso and b i 、b f 、b o The offset and coefficient representing the linear relationship, σ (x) representing the activation function, and c representing the dot product.
The expression of the hidden layer corresponding to each character obtained by the LSTM of the preamble is
Similarly, the subsequent LSTM is expressed as a hidden layer
The leading hidden layer captures the comprehensive information e (0) to e (i-1) of e (i) and its left part, and the following hidden layer captures e (i) and its right information e (i+1) to eT. LSTM concatenates the preamble and the following hidden layers, and finally models the conditional probability P (y|x) by the following formula:
in ρ of the above k Is a parameter thereof; f (f) k (y i+1 ,y i M, i) is a transfer function defined at the front and rear adjacent positions of the sequence M. Decoding the obtained result through a model to obtain the following result:
the step S2 specifically includes: the subject and subject relationships between the plurality of words in the sentence are analyzed by the dependency grammar (Dependency Parsing, DP), and the structure of the entire sentence is presented, that is, the relationships between the components are summarized by analyzing the grammatical components such as subject, predicate, object, subject, object, and the like contained in the sentence. This patent uses MST for syntactic analysis of dependencies. And constructing an acyclic directed graph of the input sentence, wherein the acyclic directed graph comprises a node set and a directed edge set of corresponding words. The head of the dominant in the dependency structure diagram is the dependency of the dominant, and the dominant is the core predicate node in the sentence without depending on other words. There may be directional edges between two nodes that are in the same direction but have different dependencies. The MST converts the get best dependency structure into finding the highest scoring dependency tree in the directed graph.
Assuming that the analysis result of sentence a is b, the model parameter is ε, a conditional probability model SC (a i |b i The method comprises the steps of carrying out a first treatment on the surface of the Epsilon) will find the epsilon value that maximizes the model between i=1 and N during training.
MSTParser defines that the score for the entire syntactic tree is a weighted sum of the arc scores in the tree:
in the above formula: s is a score, b is one of the dependency trees of sentence a; w is the weight vector of feature f (.
The step S3 specifically includes: each PATH is decomposed into a sequence of relationships by the PATH-RNN model and added to the RNN to construct a vector representation of the PATH, and then the correlation of the PATH and the candidate relationships is calculated by the dot product of the PATH vector representation. The first step is to convert all input entities and relations into vectors through an embedded matrix, and the method is the same as step S1. The priors use the PRA to obtain the relationship path of the training instance (es, r, et) that is most relevant to the relationship r. The path random walk of the PRA is performed for a given triplet. Recording all connection relations from the head to the tail of the entity, obtaining a plurality of relation paths, { r1, r2, & gt. Rn } adding an intermediate entity, and obtaining a random path
K= [ es, r1, e1...et ], the path is extended completely, and its model is shown in fig. 1.
When the search space of the path representation is large, combining all paths does not provide enough evidence to infer relationships between entities, and therefore, in order to narrow the search, an expansion is made on the model, performing multi-step reasoning on the path distribution. Multi-step reasoning refers to using the attention mechanism multiple times on the path vector obtained from BiLSTM, and continuing to use the attention mechanism for the result obtained by using the attention mechanism each time to improve the accurate value of the reasoning result. Each inference generates a new relationship embedding vector u to represent the evidence of the inference.
u z+1 =W o (o 2 +u 2 )
After the path is acquired, entity disambiguation is performed, and the flow is as shown in fig. 2, based on an entity naming attribute relationship similarity comparison method, common naming entities and selected attributes of all groups of multi-source data are stored in a table, different weights are set for all the attributes with conditions, and the similarity of the entities is judged by calculating the weighted values of all the attributes. And calculating the semantic similarity of 2 entities by taking the name, the relation and the numerical value attribute of the knowledge base entity of the fan as characteristic analysis quantities. The calculation is as follows:
wherein: a0 B0 refers to entity names of the a entity and the B entity; ai, bi refer to the numerical attribute values of the A entity and the B entity; aj, bj refers to the object attribute values of the A entity and the B entity; sim (a, B) refers to semantic similarity of 2 attribute values; α+β+γ=1, where α, β, γ represent weights of entity name similarity, entity numerical attribute value similarity, and entity object attribute value similarity, respectively. For a numerical attribute entity, the calculation is performed with the following formula:
for the aggregate attribute entity, the following formula is used for calculation:
for text attribute entities, the following formula is used for calculation:
the step S4 specifically includes: a central conceptual model of the fan operation and detection field is created, and an ontology framework is built on the basis of the central conceptual model, as shown in fig. 3. The body of constructing the fan equipment fault knowledge graph is a key task in the whole flow. The construction of the fan equipment body comprises the steps of definition, concept, hierarchy, category, concept attribute relationship definition and the like. The classification of the ontology concept is mainly to classify and define the fault types of the equipment, and the classification can be divided into the following classes according to the internal element constitution: equipment, parts, fault causes, suggestions and measures. The definition of the conceptual attribute relationship can refine the ontology, so that a classification hierarchy system with a good structure is formed, and each fault class consists of equipment, components, fault reasons, suggestions and measures and can be abstracted into entity and entity state form description. Thus, the nodes in the wind field topology are accurately defined and used as entities in the knowledge graph, the switches and the circuits are represented by the relationships in the knowledge graph, and the information of the nodes is stored in the knowledge graph in an attribute form. And converting the fan equipment information into structured triplet data.
The production management knowledge construction comprises the steps of fan fault dispatching related departments, processing department business flow relation when fan faults occur, and department corresponding responsible person information.
The departments include: department [ name, task, location, responsible person, phone ].
The personnel include: personnel [ name, department, age, position, professional skills, telephone ]
Fan fault handling event part construction
The fan fault includes: failure [ name, alias, cause, attribute, how to behave, processing method, expert experience, correspondent person ]
The step S5 specifically includes: structuring various entities according to the entity frame; and flexibly using Neo4j-web and Neo4j-import in Neo4j to control and import standard structured data such as defects, defect reasons, equipment, parts and the like of each fan obtained after data cleaning.
Semantic query is carried out by using a Cypher query language, so that connection and interaction between the application and a graph database are realized; and realizing the query, display and modification of various semantic types, relations, node objects and relation objects based on the graph database. An operator can input query contents, convert the query contents into computer-recognizable Cypher language for knowledge query through semantic understanding and problem template matching, directly retrieve entity relations of a knowledge base by using the Cypher, and display the entity relations in the form of entity-relation-attribute triples through a data-driven document (D3. Js) technology.
Case analysis:
as shown in FIG. 4, a schematic diagram of a construction flow of an operational inspection knowledge graph is provided in an embodiment of the present invention.
The fan equipment knowledge graph model constructed by the method is applied to test points of certain power generation companies in Sichuan, and data are derived from certain wind power plants. The unstructured data are a design specification, a work audit report and a device test report (word file) of the fan; the semi-structured data is an equipment inventory (excel file) of the wind farm. The experimental data 2426 Chinese text strings and 31 lists with the length of 142 obtained after preprocessing such as data cleaning are subjected to information extraction according to the method described in the step S1.
The natural language segments of the word vectors are converted into codes through a skip-gram model, the codes are used as input of a Bi-LSTM-CRF model, and the tasks of word segmentation, part-of-speech tagging and named entity recognition are performed at the same time. The loss function is cross entropy, and the operation effect of the model is measured through the accuracy, the return rate and the F value. Taking word segmentation as an example, the calculation formulas of the 3 indexes are as follows:
comparing the Bi-LSTM-CRF model with other models by taking the equipment test report as a test set, dividing the model into 9862 words and 8960 words in total, and dividing the other models into 8960 words, wherein the results are shown in the following table:
and (2) performing dependency relationship analysis on the character string with the attribute of the text type according to the step S2, and extracting the attribute relationship among the entities. The dependency syntax accuracy evaluation index based on the algorithm selects an index (LabeledAttachmentScore, LAS) which is related to the association relation and an index (UnlabeledAttachmentScore, UAS) which is not related to the association relation type, and the calculation formula is as follows:
knowledge nodes and attribute relations in the semi-structured list are extracted through the method described in the section S3/S4/S5, 519 attribute relation sides 368 are obtained, and 392 triples are formed, wherein the fan and related equipment names, equipment types, material resource names and asset description 4-type entities are obtained. And finally filtering out repeated entity nodes, and completing the display based on the Neo4j graph database. The total number of the 8 types of entity nodes is 812 from the word file and the excel file, and the total number of 7 types of attribute relation sides between the entities is 765. In the Neo4j graph database, a user can use a mouse to operate a click node or a label of a relation to view fan equipment data displayed in a visual form, so that the user can use a more visual graph mode to perform association mining and analysis on the data. For example, the user queries the devices contained in the wind farm, and the returned results are shown in fig. 5.
The above is a preferred embodiment of the present invention, and all changes made according to the technical solution of the present invention belong to the protection scope of the present invention when the generated functional effects do not exceed the scope of the technical solution of the present invention.

Claims (6)

1. A divergent-type associated fan equipment operation detection knowledge graph construction and retrieval method is characterized by comprising the following steps:
s1: the method comprises the steps of obtaining original data through a data obtaining module, preprocessing the original data to obtain preprocessed data comprising structured data and semi/unstructured data; the BERT-BiLSTM-CRF model/relation extraction and attribute extraction are adopted to carry out structuring and entity extraction;
s2: performing entity identification and entity disambiguation on unstructured data, and determining the range of an entity in a sentence by performing label definition on the sentence; based on a similarity comparison method of entity naming attribute relationship, storing common naming entities and selected attributes of each group of multi-source data in a table, setting different weights for each attribute with conditions, and calculating weighted values of all the attributes to judge the similarity of the entities;
s3: carrying out knowledge reasoning by adopting a Path-RNN model, and converting paths among target entities into inputs of an RNN network by adopting a Path reasoning method so as to carry out knowledge reasoning;
s4: constructing a fan fault knowledge graph entity part, and combining the TextRank and TF IDF to identify terms and create a concept entity; the operation terms, the accident handling terms, the operation terms, and the fault terms are terms created by the extracted keywords: correction, fusion, screening and classification; the screening method combines the data materials to finish terms, interprets the technical terms in detail, and adds relevant scheduling and safety regulations through searching and matching;
s5: storing, displaying and inquiring the knowledge graph, and structuring various entities according to the entity framework; flexibly using Neo4j-web and Neo4j-import in Neo4j to control and import the standard structured data of defects, defect reasons, equipment and parts of each fan obtained after data cleaning; semantic query is carried out by using a Cypher query language, so that connection and interaction between the application and a graph database are realized; and realizing the query, display and modification of various semantic types, relations, node objects and relation objects based on the graph database.
2. The method for constructing and retrieving the operational inspection knowledge graph of the divergent-type associated fan equipment according to claim 1, wherein the step S1 is specifically as follows: the web crawler technology is applied, documents published by various power generation companies or equipment manufacturers and tables of fan operation and detection processes are obtained and downloaded according to law, and then data contained in words, excel and pdf are read by using open source software modules python-docx, xlrd and pdfminer respectively according to different file formats;
the original data obtained from the text is transformed and encoded, converted into a vector form suitable for computer processing, and word vector matrix optimized using skip-gram modelL, learning accurate word vector representation for each word; given an arbitrary n-tuple (w, C) =w n-c ...w n-1 w n w n+1 ...w n+C The model uses the word vector e (w n ) Predicting word w of t in text t The probability of (2) is:
in the above, w n Representing a center word; e (w) n )∈R d Representing w n The vector of the d-dimension word is obtained through the retrieval of a vector matrix L; c is the size of the scale, representing the window size of the background; the objective function of the model is as follows:
after model training is completed, an optimized word vector matrix can be obtained, and the word vector matrix comprises the representations of all distributed vectors in the table;
knowledge extraction for text data; identifying named entities by using a BiLSTM combined conditional random field CRF model;
when given lexical sequence x=x 0 x 1 ...x n Searching word vectors e corresponding to each word in the trained word vector table n ∈R d1 D1 represents the dimension of its vector; LSTM is controlled by a memory unit and three gates, its input is hidden layer representation hi-1 at the previous time and output wi-1 at the previous power information and communication technology time, and its output is hidden layer representation hi at the current time; the calculation method comprises the following steps:
i n =σ(W i e(W n-1 )+U i h n-1 +V i C n-1 +b i )
f n =σ(W f e(W n-1 )+U f h n-1 +V f C n-1 +b f )
o n =σ(W o e(W n-1 )+U o h n-1 +V o C n-1 +b o )
h n =o n ⊙tan h(c n )
wherein i is n 、f n 、o n Representing input, forget and output gates, respectively; c n Representing a memory unit; w (W) n 、U n 、V n Iso and b i 、b f 、b o Offset and coefficient representing linear relationship, σ (x) represents activation function, and by;
the expression of the hidden layer corresponding to each character obtained by the LSTM of the preamble is
Similarly, the subsequent LSTM is expressed as a hidden layer
The preamble hiding layer captures e (i) and the comprehensive information e (0) to e (i-1) of the left part thereof, and the following hiding layer captures e (i) and the information e (i+1) to eT of the right side thereof; LSTM concatenates the preamble and the following hidden layers, and finally models the conditional probability P (y|x) by the following formula:
in ρ of the above k Is a parameter thereof; f (f) k (y i+1 ,y i M, i) is a transfer function defined at the front and rear adjacent positions of the sequence M.
3. The method for constructing and retrieving the operational inspection knowledge graph of the divergent-type associated fan equipment according to claim 2, wherein the step S2 is specifically as follows: analyzing the dominant and dominant relation among a plurality of words in the sentence through the dependency grammar, and displaying the structure of the whole sentence, namely summarizing the relation among the components through analyzing the grammar components of subjects, predicates, objects, fixed words, subjects and complements contained in the sentence; performing syntactic analysis of the dependency sentence by using the MST; constructing an acyclic graph of an input sentence, wherein the acyclic graph comprises a node set and a directed edge set of corresponding words; the head of the dominant in the dependency structure diagram, the dominant is dependency, and the dependency is not dependent on other words, namely the core predicate node in the sentence; directed edges with the same direction and different dependence relationships can exist between every two nodes; the MST converts the acquired optimal dependency structure into a dependency tree with highest scoring in the directed graph;
assuming that the analysis result of sentence a is b, the model parameter is ε, a conditional probability model SC (a i |b i The method comprises the steps of carrying out a first treatment on the surface of the Epsilon), the epsilon value that maximizes the model between i=1 and N will be found during training;
MSTParser defines that the score for the entire syntactic tree is a weighted sum of the arc scores in the tree:
in the above formula: s is a score, b is one of the dependency trees of sentence a; w is the weight vector of feature f (.
4. The method for constructing and retrieving a knowledge graph of operational inspection of fan equipment associated with divergent type according to claim 3, wherein the step S3 is specifically as follows: decomposing each PATH into a relation sequence through a PATH-RNN model, adding the relation sequence into the RNN, thus constructing vector representation of the PATH, and calculating the correlation between the PATH and the candidate relation through dot products of the PATH vector representation; firstly, all input entities and relations are converted into vectors through an embedded matrix, and the method is the same as that in the step S1; hunger uses PRA to obtain a relationship path for training examples most relevant to relationship r; carrying out path random walk of PRA on a given triplet; recording all connection relations from the head to the tail of the entity, obtaining a plurality of relation paths, { r1, r2, & rn } adding an intermediate entity to obtain a random path K= [ es, r1, e1...et ], and expanding the path completely;
when the search space of the path representation is bigger, combining all paths can not provide enough evidence to infer the relation between the entities, expanding on the model, and executing multi-step reasoning on the path distribution; the multi-step reasoning refers to using the attention mechanism for the path vector obtained from the BiLSTM for a plurality of times, and continuing to use the attention mechanism for the result obtained by using the attention mechanism each time to improve the accurate value of the reasoning result; each step of reasoning generates a new relation embedded vector u to represent reasoning evidence;
u z+1 =W o (o 2 +u 2 )
after the paths are acquired, entity disambiguation is carried out, common named entities of various groups of multi-source data and selected attributes are stored in a table based on an entity naming attribute relationship similarity comparison method, different weights are set for the attributes with conditions, and the similarity of the entities is judged by calculating the weighted values of all the attributes; calculating semantic similarity of 2 entities by taking the name, relation and numerical attribute of a knowledge base entity of the fan as characteristic analysis quantity; the calculation is as follows:
wherein: a0 B0 refers to entity names of the a entity and the B entity; ai, bi refer to the numerical attribute values of the A entity and the B entity; aj, bj refers to the object attribute values of the A entity and the B entity; sim (a, B) refers to semantic similarity of 2 attribute values; α+β+γ=1, where α, β, γ represent weights of entity name similarity, entity numerical attribute value similarity, entity object attribute value similarity, respectively; for a numerical attribute entity, the calculation is performed with the following formula:
for the aggregate attribute entity, the following formula is used for calculation:
for text attribute entities, the following formula is used for calculation:
5. the method for constructing and retrieving the operational inspection knowledge graph of the divergent-type associated fan equipment according to claim 4, wherein the step S4 is specifically as follows: creating a center conceptual model in the fan operation and detection field, and building a body frame based on the center conceptual model; constructing a body of a fan equipment fault knowledge graph is a key task in the whole flow; constructing inclusion definition, concepts, layers and categories of fan equipment bodies, and defining concept attribute relationships; the classification of the ontology concept is to classify and define the fault types of the equipment, and the classification can be classified into the following classes according to the internal element constitution: equipment, parts, fault reasons, suggestions and measures; the definition of the conceptual attribute relationship enables the ontology to be finer, so that a classification hierarchy is formed, and each fault class consists of equipment, components, fault reasons, suggestions and measures and can be abstracted into entity and entity state form description; thus forming accurate definition, taking nodes in wind field topology as entities in a knowledge graph, representing the switches and the circuits by relationships in the knowledge graph, and storing information of the nodes in the knowledge graph in an attribute form; converting the fan equipment information into structured triplet data;
the production management knowledge construction comprises the steps of processing business flow relation of departments when fan faults occur, and department corresponding responsible person information;
the departments include: name, task, location, responsible person, phone;
the personnel include: name, department, age, job position, professional skills, telephone;
constructing a fan fault processing event part;
the fan fault includes: name, alias, cause, attribute, how represented, method of processing, expert experience, corresponding person.
6. The method for constructing and retrieving a knowledge graph of operational inspection of fan equipment associated with divergent type according to claim 5, wherein the step S5 is specifically as follows: structuring various entities according to the entity frame; flexibly using Neo4j-web and Neo4j-import in Neo4j to control and import the standard structured data of defects, defect reasons, equipment and parts of each fan obtained after data cleaning;
semantic query is carried out by using a Cypher query language, so that connection and interaction between the application and a graph database are realized; query, display and modification of various semantic types, relations, node objects and relation objects based on the graph database are realized;
an operator can input query contents, convert the query contents into computer-recognizable Cypher language for knowledge query through semantic understanding and problem template matching, directly search entity relations of a knowledge base by using the Cypher, and display the entity relations in a form of entity-relation-attribute triples through a data visualization library technology.
CN202310557369.1A 2023-05-17 2023-05-17 Divergent-type associated fan equipment operation and detection knowledge graph construction and retrieval method Pending CN116822625A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310557369.1A CN116822625A (en) 2023-05-17 2023-05-17 Divergent-type associated fan equipment operation and detection knowledge graph construction and retrieval method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310557369.1A CN116822625A (en) 2023-05-17 2023-05-17 Divergent-type associated fan equipment operation and detection knowledge graph construction and retrieval method

Publications (1)

Publication Number Publication Date
CN116822625A true CN116822625A (en) 2023-09-29

Family

ID=88123102

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310557369.1A Pending CN116822625A (en) 2023-05-17 2023-05-17 Divergent-type associated fan equipment operation and detection knowledge graph construction and retrieval method

Country Status (1)

Country Link
CN (1) CN116822625A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117033816A (en) * 2023-10-08 2023-11-10 湖北省长投智慧停车有限公司 Parking recommendation method and device, electronic equipment and storage medium
CN117131929A (en) * 2023-10-27 2023-11-28 北京华控智加科技有限公司 Operation and maintenance data management method and device
CN117390139A (en) * 2023-11-27 2024-01-12 国网江苏省电力有限公司扬州供电分公司 Method for evaluating working content accuracy of substation working ticket based on knowledge graph
CN117851614A (en) * 2024-03-04 2024-04-09 创意信息技术股份有限公司 Searching method, device and system for mass data and storage medium
CN117390139B (en) * 2023-11-27 2024-05-24 国网江苏省电力有限公司扬州供电分公司 Method for evaluating working content accuracy of substation working ticket based on knowledge graph

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117033816A (en) * 2023-10-08 2023-11-10 湖北省长投智慧停车有限公司 Parking recommendation method and device, electronic equipment and storage medium
CN117131929A (en) * 2023-10-27 2023-11-28 北京华控智加科技有限公司 Operation and maintenance data management method and device
CN117390139A (en) * 2023-11-27 2024-01-12 国网江苏省电力有限公司扬州供电分公司 Method for evaluating working content accuracy of substation working ticket based on knowledge graph
CN117390139B (en) * 2023-11-27 2024-05-24 国网江苏省电力有限公司扬州供电分公司 Method for evaluating working content accuracy of substation working ticket based on knowledge graph
CN117851614A (en) * 2024-03-04 2024-04-09 创意信息技术股份有限公司 Searching method, device and system for mass data and storage medium
CN117851614B (en) * 2024-03-04 2024-05-14 创意信息技术股份有限公司 Searching method, device and system for mass data and storage medium

Similar Documents

Publication Publication Date Title
Zhou et al. Deep learning for aspect-level sentiment classification: survey, vision, and challenges
CN110633409B (en) Automobile news event extraction method integrating rules and deep learning
CN104408173B (en) A kind of kernel keyword extraction method based on B2B platform
CN107315738B (en) A kind of innovation degree appraisal procedure of text information
CN110825721A (en) Hypertension knowledge base construction and system integration method under big data environment
CN116822625A (en) Divergent-type associated fan equipment operation and detection knowledge graph construction and retrieval method
CN108733748B (en) Cross-border product quality risk fuzzy prediction method based on commodity comment public sentiment
CN112001187A (en) Emotion classification system based on Chinese syntax and graph convolution neural network
CN112001186A (en) Emotion classification method using graph convolution neural network and Chinese syntax
CN114547298A (en) Biomedical relation extraction method, device and medium based on combination of multi-head attention and graph convolution network and R-Drop mechanism
CN116127084A (en) Knowledge graph-based micro-grid scheduling strategy intelligent retrieval system and method
CN112036178A (en) Distribution network entity related semantic search method
Khademi et al. Persian automatic text summarization based on named entity recognition
CN113011161A (en) Method for extracting human and pattern association relation based on deep learning and pattern matching
CN114048305A (en) Plan recommendation method for administrative penalty documents based on graph convolution neural network
Kalaivani et al. A review on feature extraction techniques for sentiment classification
CN116108191A (en) Deep learning model recommendation method based on knowledge graph
CN115292490A (en) Analysis algorithm for policy interpretation semantics
CN111400449A (en) Regular expression extraction method and device
CN115169349A (en) Chinese electronic resume named entity recognition method based on ALBERT
Jing et al. An integrated implicit user preference mining approach for uncertain conceptual design decision-making: A pipeline inspection trolley design case study
CN114896387A (en) Military intelligence analysis visualization method and device and computer readable storage medium
CN114239828A (en) Supply chain affair map construction method based on causal relationship
CN117648984A (en) Intelligent question-answering method and system based on domain knowledge graph
Skondras et al. Efficient Resume Classification through Rapid Dataset Creation Using ChatGPT

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination