CN109033284A - The power information operational system database construction method of knowledge based map - Google Patents
The power information operational system database construction method of knowledge based map Download PDFInfo
- Publication number
- CN109033284A CN109033284A CN201810762686.6A CN201810762686A CN109033284A CN 109033284 A CN109033284 A CN 109033284A CN 201810762686 A CN201810762686 A CN 201810762686A CN 109033284 A CN109033284 A CN 109033284A
- Authority
- CN
- China
- Prior art keywords
- word
- knowledge
- data
- power information
- work order
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000010276 construction Methods 0.000 title claims abstract description 24
- 238000012423 maintenance Methods 0.000 claims abstract description 31
- 238000000034 method Methods 0.000 claims description 38
- 239000013598 vector Substances 0.000 claims description 37
- 230000011218 segmentation Effects 0.000 claims description 31
- 238000012545 processing Methods 0.000 claims description 11
- 238000004364 calculation method Methods 0.000 claims description 6
- 238000001914 filtration Methods 0.000 claims description 6
- 235000014347 soups Nutrition 0.000 claims description 4
- 238000005516 engineering process Methods 0.000 abstract description 5
- 230000008569 process Effects 0.000 description 10
- 238000000605 extraction Methods 0.000 description 9
- 230000006870 function Effects 0.000 description 9
- 238000010586 diagram Methods 0.000 description 7
- 238000013135 deep learning Methods 0.000 description 5
- 230000014509 gene expression Effects 0.000 description 5
- 230000008676 import Effects 0.000 description 5
- 238000013507 mapping Methods 0.000 description 5
- 238000004519 manufacturing process Methods 0.000 description 4
- 238000009825 accumulation Methods 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000007477 logistic regression Methods 0.000 description 3
- 238000003058 natural language processing Methods 0.000 description 3
- 239000000047 product Substances 0.000 description 3
- 230000007547 defect Effects 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 230000008520 organization Effects 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 238000007476 Maximum Likelihood Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000009193 crawling Effects 0.000 description 1
- 230000002354 daily effect Effects 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 230000003203 everyday effect Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000003860 storage Methods 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
- 238000012800 visualization Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/20—Administration of product repair or maintenance
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/06—Energy or water supply
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02E—REDUCTION OF GREENHOUSE GAS [GHG] EMISSIONS, RELATED TO ENERGY GENERATION, TRANSMISSION OR DISTRIBUTION
- Y02E40/00—Technologies for an efficient electrical power generation, transmission or distribution
- Y02E40/70—Smart grids as climate change mitigation technology in the energy generation sector
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y04—INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
- Y04S—SYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
- Y04S10/00—Systems supporting electrical power generation, transmission or distribution
- Y04S10/50—Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications
Landscapes
- Engineering & Computer Science (AREA)
- Business, Economics & Management (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Economics (AREA)
- General Physics & Mathematics (AREA)
- Human Resources & Organizations (AREA)
- General Health & Medical Sciences (AREA)
- General Business, Economics & Management (AREA)
- Strategic Management (AREA)
- Tourism & Hospitality (AREA)
- Marketing (AREA)
- Public Health (AREA)
- Water Supply & Treatment (AREA)
- Primary Health Care (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- Entrepreneurship & Innovation (AREA)
- Operations Research (AREA)
- Quality & Reliability (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention proposes a kind of power information operational system database construction method of knowledge based map, by carrying out textual and participle to electric system work order data, is classified later by similarity, then store and to form chart database.The present invention improves the effective use of electric power data, the value for playing image watermarking, the intelligent Application for improving power grid is horizontal, changes low dimensional, mutually isolated information ways of presentation in the past, power information data is modeled, the relevance between electric power knowledge can more intuitively be presented.On this basis, power information operational system knowledge base can be constructed, knowledge is converted by history operation/maintenance data, user is supported independently to seek advice from and solve the problems, such as online, realize intelligent answer, O&M efficiency is improved, important value will be played in power grid maintenance work, and completes power information O&M model of the domain knowledge on this basis and optimizes memory technology.
Description
Technical Field
The invention belongs to the field of power operation and maintenance management, and particularly relates to a power information operation and maintenance system database construction method based on a knowledge graph.
Background
The current electric power operation and maintenance customer service system faces the heart sound of massive customers from all parties every day, enterprises and individual users carry out business consultation, fault guarantee, complaints, reports, suggestions, opinions and expressions, power grid information, policy subscription and the like through various channels such as WeChat public numbers, microblogs, palm electric power APP, 95598 intelligent interaction websites and the like, higher requirements are provided for service provision and consumption, the heart sound of the customers needs to be timely and accurately observed, and the demands of the customers are met in the shortest time and with the best service; and meanwhile, customers require more friendly, convenient and quick service provided by the customers. For this reason, it is required to improve the customer service system by continuously improving the information support service to improve the customer service quality and efficiency, however, in the era of internet development, the conventional call center customer service system requires a lot of manpower and material resources.
The national network communication department reaches the requirement of developing the operation management level improvement activity of the information system in 2015, the requirement of actually improving the operation management level is provided, and the information system needs to develop information operation and maintenance work according to a mode of 'two-level scheduling, three-layer maintenance, integrated operation, unified customer service and information technology three-wire support center', so that a large amount of operation and maintenance worksheet data are accumulated. With the continuous improvement of the requirements of the public on the reliability of power production, national grid companies have higher requirements and standards on the quality of power supply services. How to continuously improve the business quality of power information operation and maintenance personnel, how to quickly and accurately process the operation and maintenance problems of an information system in production, how to quickly search and locate defects in daily production and prevent the defects in the past are appeal of power production technology and managers.
The method has the advantages that a public knowledge model is built facing the power industry, and has many technical difficulties, such as weak analysis of operation and maintenance knowledge, the data resource value is reflected to be integrally stopped at a rough stage, data resources are not effectively converted into knowledge assets, and the problems of forming a virtuous circle of knowledge, weak supporting capacity for business requirements and weak pushing capacity for management change and the like are solved, so that the organization and the construction of the power operation and maintenance knowledge model are hindered. Particularly, a single-ticket and two-ticket system of a company has a large amount of available data, can be effectively mined and utilized, and provides basic data and theoretical basis for constructing a maintenance and transportation knowledge model.
Disclosure of Invention
Aiming at the problems and the blank existing in the prior art, the invention adopts the following technical scheme:
a power information operation and maintenance system database construction method based on knowledge graph is characterized by comprising the following steps:
step 1: acquiring a work order of the power system work order data, converting the work order data into a text format, constructing a body by adopting a seven-step method, and dividing the body into a plurality of text fields according to text meaning attributes;
step 2: taking the work order as a unit, and performing word segmentation processing on the work order text data;
and step 3: grouping the text fields;
and 4, step 4: performing domain word segmentation on each text domain, and performing word segmentation on each group of contents by adopting a word segmentation method based on character string matching;
and 5: filtering invalid words according to the invalid word list, and filtering the invalid words and sensitive words;
step 6: comparing the effective vocabulary with a vocabulary list in a knowledge base, adding the new vocabulary into the vocabulary list of the knowledge base, and accumulating the occurrence frequency of the existing vocabulary;
and 7: extracting entity relations of the added vocabularies: extracting entity relations by predefining entity relation types and characteristics based on the entities, processing word feature vectors by adopting word2vec, calculating similarity among the word vectors, and classifying the entity relations according to the similarity;
and 8: the classification of entities and entity relationships is imported into a Neo4j graph database.
Preferably, in the step 2 and the step 4, the words are divided by an ICTCCLAS system of Chinese academy of sciences;
all the words after being divided are formed into a character table D, D ═ D1,d2,...,dnIn which d isiRepresents a word, i ∈ [1, n ]]. Representing the word feature vector of each word E as V ═ { V ═ V1,v2,...,vnIn which v isiRepresents whether the word corresponds to D in the character table Di,viThe calculation method of (c) is as follows:
preferably, in step 2 and step 4, the word segmentation further includes the construction of part-of-speech characteristics: the construction of the part-of-speech characteristics is consistent with the construction mode of the word characteristics.
Preferably, the content of the unstructured text data is collected by using a urllib2 package of Python for the work order data of the power system; analyzing the collected content by adopting a Beautiful Soup packet; and performing word segmentation by adopting an Rwordseg packet under an R environment.
Compared with the prior art, the deep learning algorithm is introduced into the construction of a knowledge model in the power field, and two machine learning tasks of named entity identification and entity relation extraction are adopted to solve two problems of knowledge unit extraction and knowledge unit relation extraction. Besides, the invention also incorporates the graphic database into a knowledge model building system, and stores and displays the knowledge unit by adopting the graphic database, thereby providing a new method for drawing the knowledge model.
The method improves the effective utilization of the power data, exerts the value of data hiding, improves the intelligent application level of the power grid, changes the past low-dimensional and mutually isolated information display mode based on the knowledge representation of the knowledge model, and more intuitively presents the relevance between the power knowledge. The method comprises the steps of constructing a knowledge base of the power information operation and maintenance system, converting historical operation and maintenance data into knowledge, supporting users to autonomously consult and solve problems on line, achieving intelligent question and answer, improving operation and maintenance efficiency, playing an important role in operation and maintenance of a power grid, and completing a knowledge model optimization storage technology in the power information operation and maintenance field on the basis.
The knowledge model constructed by the invention can solve the problems of repeated manual answers, insufficient personnel and the like, can quickly and accurately know the intention of the applicant of the operation and maintenance work order, and can greatly improve the operation and maintenance work efficiency by solving the user with the autonomous answer.
Drawings
The invention is described in further detail below with reference to the following figures and detailed description:
FIG. 1 is a schematic flow chart of a method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a knowledge model construction process based on a deep learning algorithm;
FIG. 3 is a schematic diagram of the implementation and use of a Neo4j graph database according to an embodiment of the present invention;
FIG. 4 is a schematic flow chart of a seven-step process in an embodiment of the present invention;
FIG. 5 is a schematic diagram of two processes used by word2vec in an embodiment of the present invention;
FIG. 6 is a schematic diagram of a CBOW model according to an embodiment of the present invention;
FIG. 7 is a schematic diagram of a Skip-gram model according to an embodiment of the present invention.
Detailed Description
In order to make the features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail as follows:
as shown in fig. 1, the present embodiment method includes the following steps: the method comprises the following steps:
step 1: acquiring a work order of the power system work order data, converting the work order data into a text format, constructing a body by adopting a seven-step method, and dividing the body into a plurality of text fields according to text meaning attributes;
step 2: taking the work order as a unit, and performing word segmentation processing on the work order text data;
and step 3: grouping the text fields;
and 4, step 4: performing domain word segmentation on each text domain, and performing word segmentation on each group of contents by adopting a word segmentation method based on character string matching;
and 5: filtering invalid words according to the invalid word list, and filtering the invalid words and sensitive words;
step 6: comparing the effective vocabulary with a vocabulary list in a knowledge base, adding the new vocabulary into the vocabulary list of the knowledge base, and accumulating the occurrence frequency of the existing vocabulary;
and 7: extracting entity relations of the added vocabularies: extracting entity relations by predefining entity relation types and characteristics based on the entities, processing word feature vectors by adopting word2vec, calculating similarity among the word vectors, and classifying the entity relations according to the similarity;
and 8: the classification of entities and entity relationships is imported into a Neo4j graph database.
Fig. 2 is a schematic diagram of a conventional knowledge model construction process based on a deep learning algorithm, and the main innovation points of the embodiment relative to the process are as follows: a new word segmentation strategy is provided, and the problem of feature selection in entity extraction is solved.
In the step 2 and the step 4, an ICTCCLAS system of a Chinese academy of sciences is adopted for word segmentation;
all the words after being divided are formed into a character table D, D ═ D1,d2,...,dnIn which d isiRepresents a word, i ∈ [1, n ]]. Representing the word feature vector of each word E as V ═ { V ═ V1,v2,…,vnIn which v isiRepresents whether the word corresponds to D in the character table Di,viThe calculation method of (c) is as follows:
in step 2 and step 4, the word segmentation also comprises the construction of part-of-speech characteristics: the construction of the part-of-speech characteristics is consistent with the construction mode of the word characteristics.
In step 7, word feature vectors are processed by adopting word2vec, the similarity between the word vectors is calculated, and entity relation classification is carried out according to the similarity. By adopting the method to construct the mode set, a large amount of manual participation is reduced, and the extraction of the features is simpler and more effective.
Acquiring the content of the unstructured text data by adopting a urllib2 package of Python for the work order data of the power system; analyzing the collected content by adopting a Beautiful Soup packet; and performing word segmentation by adopting an Rwordseg packet under an R environment.
For Neo4j database.
Wherein,
1) an entity. Each entity or concept is identified by a globally uniquely determined ID, referred to as their identifier. The most typical examples include three general entities such as a person name, a place name, and an organization name. For the field of power information operation and maintenance, besides general entities, richer entities exist, such as servers, operation and maintenance systems, power grids, transformers, substations, transmission lines, distribution networks, main networks and the like.
2) Attribute one value. Each attribute-value pair is used to characterize an entity's intrinsic properties. The attribute value range includes: numerical (e.g., work order number, phone), enumerated (e.g., department, unit), short text (e.g., work order event title), long text (e.g., work order event detail).
3) And (4) relationship. And relationships are used to connect two entities, delineating the association between them. Typical relation extraction methods iterate through the "template generation instance extraction" process until convergence. For example, a (national grid, headquarters address, beijing) triplet instance may be initially extracted via a "X is the headquarters address of Y" template; then, more matching templates can be found according to the entity pair 'national grid-Beijing' in the triple, such as 'the headquarter address of Y is X', 'X is the center of Y', etc.; and further extracting more new triple instances by using the newly found template, and continuously extracting new instances and the template through repeated iteration. Relationships between entities may also be extracted by identifying phrases that express semantic relationships. For example, through syntactic analysis, the following relationship of "national grid" to "beijing" can be found from the text: (national grid, headquarters in Beijing), and (national grid, headquarters in Beijing). The relationships among the entities extracted by the method are rich and free, and the relationships are generally phrases with verbs as cores.
In this embodiment, the work order data of the power system specifically means "one single ticket and two tickets", that is: a service plan work order, a work order, and an operation order. Through the collection of the unstructured text data of 'one single ticket and two tickets'. And establishing a power operation and maintenance field knowledge model according to the power entities in the data, the relationship among the power entities and the attribute and attribute value of each power entity, so that people outside the field can know the current industry status and development trend of the field through the knowledge model of the field. The method comprises the steps of performing word segmentation and part-of-speech tagging on collected text data, identifying electric power operation and maintenance entities through a named entity identification method, extracting relationships among the entities through an entity relationship extraction method, storing the identified data in a graph database Neo4j, representing a knowledge model by using graph model attributes, applying the data stored in the graph database through a graph visualization technology, and displaying entity relationships of the knowledge model.
The graph database used in this embodiment is: neo4j graph database.
Among them, Neo4j supports multiple ways to import data. The import can be performed manually or directly by adopting a CSV file, and meanwhile, Neo4j supports the import of data from a mainstream relational database. In the embodiment, data is imported into Neo4j by a CSV batch import method. After data are imported into Neo4j in batches through CSV, data stored in a database can be inquired through a Cypher, and the data are displayed in a graphical mode.
As shown in fig. 3, in particular,
(1) putting the data into a database; and importing the extracted named entities and entity relations into a graph database by using a batch import mode.
(2) And querying all nodes and relations by using a Cypher query language to obtain the full appearance of the whole knowledge graph.
(3) And the Cypher language is adopted to search the required nodes and relationship information, so that personalized knowledge service can be provided for the user.
(4) The RRESTABI interface of Neo4j can be called in a programming mode to further develop the knowledge-graph interface.
Specifically, in this embodiment, because the deep learning algorithm is complex in calculation, the feature dimension often reaches tens of thousands of levels, the resource overhead to the computer is extremely high, and because the deep learning algorithm can also have a good characterization capability on a limited sample size, on the premise of selecting a limited number of samples, the content of the unstructured text data is collected by using the urllib2 package of Python for the work order data of the power system; analyzing the collected content by adopting a Beautiful Soup packet;
since a significant difference between chinese and english in natural language processing is that english can segment words by spaces, while chinese is ambiguous in word segmentation, it is first necessary to segment words when processing chinese. In the embodiment, the Rwordseg packet in the R environment is used for word segmentation, and the Rwordseg is written based on an ICTCCLAS Chinese word segmentation algorithm of the Chinese academy of sciences, so that the functions of word segmentation, multilevel part-of-speech tagging, keyword extraction and the like can be realized, and a user-defined dictionary can be introduced for word segmentation.
Because the experimental data is derived from 'one single ticket and two tickets' unstructured text data in 'national grid Fujian electric power company Limited-centralized authentication service single sign-on platform', when a Rwordseg packet is adopted for word segmentation, a dictionary in the field needs to be added to improve the accuracy of word segmentation. The electric power industry is a relatively non-open field, and currently, a complete field dictionary does not exist, so that the dictionary of the field needs to be constructed manually. And constructing a dictionary in the field by a method combining a machine acquisition method and a manual acquisition method. Firstly, entries in the data of 'one single ticket and two tickets' are automatically collected through a program, and then a manual collection mode is adopted to further expand the dictionary in the field.
The construction process of the dictionary is as follows:
(1) the label analysis is carried out on the entry of 'one single ticket and two tickets', and a plurality of labels such as 'one person for opening a single person', 'time for opening a single ticket', 'content for opening a single ticket' and the like are selected.
(2) And (3) compiling a crawler by using Python, and crawling entries under the plurality of tags determined in the step (1). And removing the duplication of the collected entries.
(3) Because the entries under the tag page of 'one single ticket and two tickets' are collected, and at most some entries are listed in the tag page, the entries which are not listed have no way to be directly acquired through a program. For this reason, the experiment supplements the entries that were not collected by manually looking at the content in the dataset.
And a self-built dictionary can be installed by adopting the installDict in the Rwordseg packet to perform word segmentation on the data set. The collected text data is segmented and then used for knowledge entity recognition. And (4) warehousing the content obtained after word segmentation and a self-built dictionary, and then combining the content and the self-built dictionary for further processing.
In the embodiment, the problem of coreference resolution is solved by constructing a dictionary and a concept matching mode, and different expression modes of the same concept are mapped into the same entity by using the dictionary before a knowledge model is constructed. For example, for the entity of "national grid", the experiment will automatically identify various expressions such as "SGCC", "national grid company", etc., and convert these expressions into the expression of "national grid".
The concept layer in the power domain knowledge model comprises: the starting person, the time, the application content, and the operation object. For concepts in the knowledge model and entities in the concepts, in the present embodiment, a top-down approach is taken to construct the electronic domain knowledge model. The top-down mode is not to construct the top-level relation ontology by a manual method firstly, but to express the top-level relation ontology directly in the form of attribute key-value pairs and entity-pair relations in the attribute diagram.
1) Attribute key-value pairs and tag representations in an attribute graph
For example, the term "virtual server" entity is an operation object class, and a type attribute with the value of "operation object" is added to the "virtual server" node in the attribute graph.
2) Entity pair relationship
For example, the "virtual server" entity is linked to the "zhanghui" and "zhangchao" entity in the "drawer" concept, and the "drawer" performs some operation on the "virtual server", so that the "virtual server" is connected to the "zhanghui" entity in a physical relationship, and a relatedBody correlation is established.
As shown in fig. 4, the seven-step method adopted in the present embodiment is developed by the medical college of stanford university, and is mainly applied to a method for domain ontology construction, and the specific steps and thought are as follows:
1) determining the range: developing a domain ontology is not a purpose, but a model of a specific domain is established for a specific purpose, so that the problems of determining the domain covered by the ontology, using the ontology, maintaining the ontology and the like are required;
2) considering multiplexing: with the widespread use of ontologies, the definition of ontologies has rarely started from scratch. It is therefore necessary to determine whether an available ontology is available from a third party as a starting point for developing an ontology itself;
3) the terms are listed: an unstructured list of all relevant terms expected to appear in the ontology, where nouns are the basis of class names and verbs or verb phrases are the basis of attribute names;
4) defining a classification: after determining the related terms, the terms are organized into a hierarchy, which can be from top to bottom or from bottom to top, to ensure that the hierarchy is composed of hierarchical relationships of classes and subclasses. Or combining the two methods to integrate a large number of concepts and associate them;
5) defining the attribute: the step is usually crossed with the previous step, the attributes for connecting the classes must be established at the same time when the hierarchical structure of the classes is established, because the classes have inheritance, the attributes need to be added to the highest-level class which is suitable for the classes, and the definition domain and the value domain of the attributes are determined when the attributes are added to the classes;
6) defining the side faces: in order to make the definition of the attribute more complete, the side (facet) of the attribute is further defined, and specifically includes three types: cardinality, which refers to whether a certain number of different values are allowed or necessary; a specific value, which means that a class is defined by a special value of a certain attribute, and this specific value can be taken from a given class, but not necessarily a specific value; the relation characteristics refer to the relation characteristics of the attributes, namely, the symmetry, the transmissibility, the inverse attribute and the function value;
7) example of definition: defining an ontology is typically used to organize collection instances, which are a separate step in populating the ontology, and the collection of instances is typically automatically extracted from the data source.
In this embodiment, word feature vector processing and training uses word2vec tool, which is derived by Google in 2013. The main function is to map words into a vector, and to represent a word by using a vector, so that the problem of natural language processing can be converted into a problem of vector processing. Through the word vector, the natural language processing task can be better modeled. Two models are mainly used in word2 vec: CBOW (ContinuousBag. of-WordsModel) and Skip-gram (Continuousskip. gram Model).
As shown in fig. 5, both models comprise three layers: input layer, mapping layer, output layer. w (t) represents the current word, w (t-2), w (t-1), w (t +2) represent the context of the word w (t). The CBOW model predicts the word w (t) under the premise of knowing the context w (t-2), w (t-1), w (t +1) and w (t +2) of the current word w (t), and the Skip-gram model predicts the context information w (t-2), w (t-1), w (t +1) and w (t +2) of the current word w (t) under the premise of knowing the force.
Taking CBOW model and Skip-gram model based on Hierarchica Softmax as examples, wherein:
(1) CBOW model
The main idea of the CBOW model is to predict a word w (t) given the context of the current word w (t). The network structure of the CBOW model is shown in fig. 6.
As can be seen from fig. 6, the CBOW model mainly consists of an input layer, a mapping layer, and an output layer, wherein the input layer inputs word vectors of 2c words in the context of the current word w (t); mapping layer usage XwAdding the word vectors of 2c words, and storing the result in xwPerforming the following steps; the output layer corresponds to a Huffman tree, wherein leaf nodes represent each word in the corpus, non-leaf nodes represent a two-classifier, xwClassifying the words from the root node through k (k represents the number of non-leaf nodes on the path from the root node to the leaf node of the current word) two classifiers, accumulating the errors divided into correct branches into a vector neule after every classifier, and after k classifications, xwAnd the information reaches the leaf node of the current word, and then the information stored in the error accumulation vector neule is respectively updated into 2c word vectors.
In the CBOW model, the most critical process is the operation on the output layer, the output layer corresponds to one huffman tree, each non-terminal node can be regarded as a binary problem, and the binary problem can be classified by using a logistic regression function, and the formula is as follows:
the value range of the logistic regression function is (0, 1), the threshold value is set to be 0.5, positive examples are represented by more than 0.5, negative examples are represented by less than 0.5, and the logistic regression function can well process the binary problems.
For each word w (t), a unique path l is provided from the root node root to the leaf node of the word w (t), if k non-leaf nodes are provided in the path l, k secondary classification is required, and if k secondary classification problems are guaranteed to have the best effect, the objective function can be taken as the maximum likelihood estimation, and the formula is as follows:
and theta represents the undetermined parameter, the value of the undetermined parameter is stored in the non-leaf node, and the representative model is trained as long as the undetermined parameter can be determined. The conditional probability is expressed in the vector xwAnd d on the premise that the undetermined parameter is thetajProbability of (d)j1 stands for positive case, djNegative examples are represented by 0.
(2) Skip-gram model
The main idea of the Skip-gram model is to predict context information using the current word. The network structure is shown in fig. 7.
For comparison with CBOW, a mapping layer is drawn in FIG. 7, in fact there is no mapping layer in the Skip-gram model. Therefore, the Skip-gram model is mainly composed of two layers, an input layer and an output layer. The input layer is a word vector of a current word v (w), and the output layer corresponds to a Huffman tree like a CBOW model. The difference from CBOW is that after v (w) passes k classifiers from the root node, the information held in the error accumulation vector new is updated into the word vector v (w) of the word.
The target function of the Skip-gram model is different from that of the CBOW model, and the target function is shown in the following formula:
and the theta represents the undetermined parameter, the value of the undetermined parameter is stored in the non-leaf node, and the final model is trained as long as the undetermined parameter theta is determined. The conditional probability represents that the vector v (w) and the undetermined parameter are classified intoThe probability of (a) of (b) being,represents a positive example of the case where,representing a negative example, where u represents a word in the context of the current word w. Since the Skip-gram model predicts context information on the premise of knowing a current word w, and the context information contains a plurality of words, different from the CBOW model, each training of the Skip-gram model needs to predict a win path (win is the number of words in context (w)), and each path represents a leaf node corresponding to one word from a root node to context (w). The path can be uniquely determined, i.e. determined, as long as it is determined which context u is predicted.
For example, the context information is represented by w (c-2), w (c-1) … w (c +1), w (c +2), then p (w (t-2) | w (t)), p (w (t-1) | w (t)) … p (w (t +1) | w (t), p (w (t +2) | w (t)) needs to be calculated in the Skip-gram model respectively, and the value in the error accumulation vector new of each time is updated to the current word vector v (w).
word2vec not only realizes the CBOW model and Skip-gram model based on Hierarchica Sotbnax, but also realizes the CBOW model and Skip-gram model based on Negativesampling. In addition, word2vec also performs corresponding processing on low-frequency words, high-frequency words and adjustment of learning rate, and a good effect is achieved.
After the word vector is expressed, because the contribution degree of each word to the entity is different, different weights need to be given to the words. The method of computing the weight of terms in a vector, TF-IDF, represents the product of TF (term frequency) and IDF (inverse document frequency):
TF-IDF ═ word frequency (TF) x Inverse Document Frequency (IDF)
The Term Frequency (TF) represents the number of occurrences of a feature word divided by the total number of words in the article:
where TF represents the frequency of occurrence of a keyword and IDF is the log value of the number of all documents divided by the number of documents containing the term.
D represents the number of all documents, w ∈ D represents the number of documents containing the term w.
Since the words "is", "this", etc. are often present, the IDF value is needed to reduce its weight. Dimension reduction is to reduce dimension. Specifically, the document similarity calculation is to reduce the number of words. Common words which can be used for reducing the dimension are mainly functional words and stop words (such as 'the' and the like), and in fact, the strategy of reducing the dimension can not only improve the efficiency, but also improve the precision in many cases.
The larger the final TF-IDF computation weight indicates the greater importance of the entry to the text.
Thus, a group of similar articles is required to calculate the IDF value. Weights of n keywords in the similar article D (w1, w 2.., wn) are calculated in sequence. When you give an article E, the same method is used to calculate E ═ q1, q 2.., qn), and then the similarity between D and E is calculated.
The similarity between documents is described by the cosine angle cos between the two vectors. The similarity formula for the texts D1 and D2 is as follows:
where the numerator represents the dot-product of two vectors and the denominator represents the product of the modulo of the two vectors.
And (3) inputting the corpus text into a CBOW model and a Skip-gram model by using a word2vec tool, converting text contents into vectors in a vector space for calculation, calculating the similarity between the vectors, and clustering the vectors with higher similarity to form topic clusters with different granularity and different thickness, thereby representing the similarity degree of text semantics.
The present invention is not limited to the above-mentioned preferred embodiments, and any other various methods for constructing the electric power information operation and maintenance system database based on the knowledge graph can be obtained according to the teaching of the present invention.
Claims (4)
1. A power information operation and maintenance system database construction method based on knowledge graph is characterized by comprising the following steps:
step 1: acquiring a work order of the power system work order data, converting the work order data into a text format, constructing a body by adopting a seven-step method, and dividing the body into a plurality of text fields according to text meaning attributes;
step 2: taking the work order as a unit, and performing word segmentation processing on the work order text data;
and step 3: grouping the text fields;
and 4, step 4: performing domain word segmentation on each text domain, and performing word segmentation on each group of contents by adopting a word segmentation method based on character string matching;
and 5: filtering invalid words according to the invalid word list, and filtering the invalid words and sensitive words;
step 6: comparing the effective vocabulary with a vocabulary list in a knowledge base, adding the new vocabulary into the vocabulary list of the knowledge base, and accumulating the occurrence frequency of the existing vocabulary;
and 7: extracting entity relations of the added vocabularies: extracting entity relations by predefining entity relation types and characteristics based on the entities, processing word feature vectors by adopting word2vec, calculating similarity among the word vectors, and classifying the entity relations according to the similarity;
and 8: the classification of entities and entity relationships is imported into a Neo4j graph database.
2. The power information operation and maintenance system database construction method based on the knowledge graph according to claim 1, characterized in that: in the step 2 and the step 4, an ICTCCLAS system of a Chinese academy of sciences is adopted for word segmentation;
all the words after being divided are formed into a character table D, D ═ D1,d2,...,dnIn which d isiRepresents a word, i ∈ [1, n ]](ii) a Representing the word feature vector of each word E as V ═ { V ═ V1,v2,...,vnIn which v isiRepresents whether the word corresponds to D in the character table Di,viThe calculation method of (c) is as follows:
3. the power information operation and maintenance system database construction method based on the knowledge graph as claimed in claim 2, wherein: in step 2 and step 4, the word segmentation also comprises the construction of part-of-speech characteristics: the construction of the part-of-speech characteristics is consistent with the construction mode of the word characteristics.
4. The power information operation and maintenance system database construction method based on the knowledge graph according to claim 1, characterized in that: acquiring the content of the unstructured text data by adopting a urllib2 package of Python for the work order data of the power system; analyzing the collected content by adopting a Beautiful Soup packet; and performing word segmentation by adopting an Rwordseg packet under an R environment.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810762686.6A CN109033284A (en) | 2018-07-12 | 2018-07-12 | The power information operational system database construction method of knowledge based map |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810762686.6A CN109033284A (en) | 2018-07-12 | 2018-07-12 | The power information operational system database construction method of knowledge based map |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109033284A true CN109033284A (en) | 2018-12-18 |
Family
ID=64640896
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810762686.6A Pending CN109033284A (en) | 2018-07-12 | 2018-07-12 | The power information operational system database construction method of knowledge based map |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109033284A (en) |
Cited By (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109858451A (en) * | 2019-02-14 | 2019-06-07 | 清华大学深圳研究生院 | A kind of non-cooperation hand detection method |
CN109947949A (en) * | 2019-03-12 | 2019-06-28 | 国家电网有限公司 | Knowledge information intelligent management, device and server |
CN110243834A (en) * | 2019-07-11 | 2019-09-17 | 西南交通大学 | The transformer equipment defect analysis method of knowledge based map |
CN110263251A (en) * | 2019-06-17 | 2019-09-20 | 广东电网有限责任公司 | A kind of O&M knowledge method for pushing and device based on context model |
CN111537831A (en) * | 2020-04-01 | 2020-08-14 | 华中科技大学鄂州工业技术研究院 | Power distribution network line fault positioning method and device |
CN111680222A (en) * | 2020-04-16 | 2020-09-18 | 陕西师范大学 | Exploration type searching method of knowledge graph based on social platform |
CN111930774A (en) * | 2020-08-06 | 2020-11-13 | 全球能源互联网研究院有限公司 | Automatic construction method and system for power knowledge graph ontology |
CN111966836A (en) * | 2020-08-29 | 2020-11-20 | 深圳呗佬智能有限公司 | Knowledge graph vector representation method and device, computer equipment and storage medium |
CN112000782A (en) * | 2020-08-01 | 2020-11-27 | 国网河北省电力有限公司信息通信分公司 | Intelligent customer service question-answering system based on k-means clustering algorithm |
CN112101592A (en) * | 2020-09-08 | 2020-12-18 | 中国电力科学研究院有限公司 | Power secondary device defect diagnosis method, system, device and storage medium |
CN112559739A (en) * | 2020-12-01 | 2021-03-26 | 广东电网有限责任公司广州供电局 | Method for processing insulation state data of power equipment |
CN113157860A (en) * | 2021-04-07 | 2021-07-23 | 国网山东省电力公司信息通信公司 | Electric power equipment maintenance knowledge graph construction method based on small-scale data |
CN113711146A (en) * | 2019-04-17 | 2021-11-26 | 阿自倍尔株式会社 | Management system |
CN114266364A (en) * | 2021-11-24 | 2022-04-01 | 国网北京市电力公司 | Power grid fault processing method and device and computer readable storage medium |
CN114780756A (en) * | 2022-06-07 | 2022-07-22 | 国网浙江省电力有限公司信息通信分公司 | Entity alignment method and device based on noise detection and noise perception |
CN115146712A (en) * | 2022-06-15 | 2022-10-04 | 北京天融信网络安全技术有限公司 | Internet of things asset identification method, device, equipment and storage medium |
CN115714009A (en) * | 2022-11-17 | 2023-02-24 | 中澄明(北京)商务服务有限公司 | System and method for intelligently recommending hospital of higher level to hospital of lower level according to hospital saturation |
CN116644192A (en) * | 2023-05-30 | 2023-08-25 | 中国民用航空飞行学院 | Knowledge graph construction method based on reliability of aircraft parts |
CN117390139A (en) * | 2023-11-27 | 2024-01-12 | 国网江苏省电力有限公司扬州供电分公司 | Method for evaluating working content accuracy of substation working ticket based on knowledge graph |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106600298A (en) * | 2016-12-23 | 2017-04-26 | 国网山东省电力公司信息通信公司 | Electric power information system customer service knowledge base construction method based on work order data analysis |
CN107609052A (en) * | 2017-08-23 | 2018-01-19 | 中国科学院软件研究所 | A kind of generation method and device of the domain knowledge collection of illustrative plates based on semantic triangle |
CN108460136A (en) * | 2018-03-08 | 2018-08-28 | 国网福建省电力有限公司 | Electric power O&M information knowledge map construction method |
-
2018
- 2018-07-12 CN CN201810762686.6A patent/CN109033284A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106600298A (en) * | 2016-12-23 | 2017-04-26 | 国网山东省电力公司信息通信公司 | Electric power information system customer service knowledge base construction method based on work order data analysis |
CN107609052A (en) * | 2017-08-23 | 2018-01-19 | 中国科学院软件研究所 | A kind of generation method and device of the domain knowledge collection of illustrative plates based on semantic triangle |
CN108460136A (en) * | 2018-03-08 | 2018-08-28 | 国网福建省电力有限公司 | Electric power O&M information knowledge map construction method |
Non-Patent Citations (1)
Title |
---|
王仁武 等: "基于深度学习与图数据库构建中文商业知识图谱的探索研究", 《图书与情报》 * |
Cited By (26)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109858451A (en) * | 2019-02-14 | 2019-06-07 | 清华大学深圳研究生院 | A kind of non-cooperation hand detection method |
CN109947949A (en) * | 2019-03-12 | 2019-06-28 | 国家电网有限公司 | Knowledge information intelligent management, device and server |
CN113711146A (en) * | 2019-04-17 | 2021-11-26 | 阿自倍尔株式会社 | Management system |
CN110263251A (en) * | 2019-06-17 | 2019-09-20 | 广东电网有限责任公司 | A kind of O&M knowledge method for pushing and device based on context model |
CN110243834A (en) * | 2019-07-11 | 2019-09-17 | 西南交通大学 | The transformer equipment defect analysis method of knowledge based map |
CN110243834B (en) * | 2019-07-11 | 2020-03-31 | 西南交通大学 | Transformer equipment defect analysis method based on knowledge graph |
CN111537831A (en) * | 2020-04-01 | 2020-08-14 | 华中科技大学鄂州工业技术研究院 | Power distribution network line fault positioning method and device |
CN111537831B (en) * | 2020-04-01 | 2022-06-24 | 华中科技大学鄂州工业技术研究院 | Power distribution network line fault positioning method and device |
CN111680222A (en) * | 2020-04-16 | 2020-09-18 | 陕西师范大学 | Exploration type searching method of knowledge graph based on social platform |
CN112000782A (en) * | 2020-08-01 | 2020-11-27 | 国网河北省电力有限公司信息通信分公司 | Intelligent customer service question-answering system based on k-means clustering algorithm |
CN111930774A (en) * | 2020-08-06 | 2020-11-13 | 全球能源互联网研究院有限公司 | Automatic construction method and system for power knowledge graph ontology |
CN111930774B (en) * | 2020-08-06 | 2024-03-29 | 全球能源互联网研究院有限公司 | Automatic construction method and system for electric power knowledge graph body |
CN111966836A (en) * | 2020-08-29 | 2020-11-20 | 深圳呗佬智能有限公司 | Knowledge graph vector representation method and device, computer equipment and storage medium |
CN112101592A (en) * | 2020-09-08 | 2020-12-18 | 中国电力科学研究院有限公司 | Power secondary device defect diagnosis method, system, device and storage medium |
CN112559739A (en) * | 2020-12-01 | 2021-03-26 | 广东电网有限责任公司广州供电局 | Method for processing insulation state data of power equipment |
CN113157860A (en) * | 2021-04-07 | 2021-07-23 | 国网山东省电力公司信息通信公司 | Electric power equipment maintenance knowledge graph construction method based on small-scale data |
CN113157860B (en) * | 2021-04-07 | 2022-03-11 | 国网山东省电力公司信息通信公司 | Electric power equipment maintenance knowledge graph construction method based on small-scale data |
CN114266364A (en) * | 2021-11-24 | 2022-04-01 | 国网北京市电力公司 | Power grid fault processing method and device and computer readable storage medium |
CN114780756A (en) * | 2022-06-07 | 2022-07-22 | 国网浙江省电力有限公司信息通信分公司 | Entity alignment method and device based on noise detection and noise perception |
CN115146712A (en) * | 2022-06-15 | 2022-10-04 | 北京天融信网络安全技术有限公司 | Internet of things asset identification method, device, equipment and storage medium |
CN115146712B (en) * | 2022-06-15 | 2023-04-28 | 北京天融信网络安全技术有限公司 | Internet of things asset identification method, device, equipment and storage medium |
CN115714009A (en) * | 2022-11-17 | 2023-02-24 | 中澄明(北京)商务服务有限公司 | System and method for intelligently recommending hospital of higher level to hospital of lower level according to hospital saturation |
CN115714009B (en) * | 2022-11-17 | 2023-08-25 | 北京恒生芸泰网络科技有限公司 | System and method for intelligent recommendation of hospitalization saturation quantity of superior hospital to subordinate hospital |
CN116644192A (en) * | 2023-05-30 | 2023-08-25 | 中国民用航空飞行学院 | Knowledge graph construction method based on reliability of aircraft parts |
CN117390139A (en) * | 2023-11-27 | 2024-01-12 | 国网江苏省电力有限公司扬州供电分公司 | Method for evaluating working content accuracy of substation working ticket based on knowledge graph |
CN117390139B (en) * | 2023-11-27 | 2024-05-24 | 国网江苏省电力有限公司扬州供电分公司 | Method for evaluating working content accuracy of substation working ticket based on knowledge graph |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109033284A (en) | The power information operational system database construction method of knowledge based map | |
CN108460136A (en) | Electric power O&M information knowledge map construction method | |
CN111428053B (en) | Construction method of tax field-oriented knowledge graph | |
Wang et al. | A graph-based context-aware requirement elicitation approach in smart product-service systems | |
EP3320490B1 (en) | Transfer learning techniques for disparate label sets | |
CN108874783A (en) | Power information O&M knowledge model construction method | |
CN106796578B (en) | Autoknowledge system and method and memory | |
Chen et al. | Research on personalized recommendation hybrid algorithm for interactive experience equipment | |
CN108491378A (en) | Power information O&M intelligent response system | |
CN109657072B (en) | Intelligent search WEB system and method applied to government aid decision | |
CN111782763A (en) | Information retrieval method based on voice semantics and related equipment thereof | |
US12008047B2 (en) | Providing an object-based response to a natural language query | |
CN112925901B (en) | Evaluation resource recommendation method for assisting online questionnaire evaluation and application thereof | |
CN103425740A (en) | IOT (Internet Of Things) faced material information retrieval method based on semantic clustering | |
Nesi et al. | Ge (o) Lo (cator): Geographic information extraction from unstructured text data and Web documents | |
CN113627797A (en) | Image generation method and device for employee enrollment, computer equipment and storage medium | |
CN113946686A (en) | Electric power marketing knowledge map construction method and system | |
CN118132719A (en) | Intelligent dialogue method and system based on natural language processing | |
CN114706996A (en) | Supply chain online knowledge graph construction method based on multivariate heterogeneous data mining | |
Jyothi et al. | A study on big data modelling techniques | |
US20210294794A1 (en) | Vector embedding models for relational tables with null or equivalent values | |
Abbasi et al. | A place recommendation approach using word embeddings in conceptual spaces | |
CN113868322B (en) | Semantic structure analysis method, device and equipment, virtualization system and medium | |
Zhu et al. | Construction of transformer substation fault knowledge graph based on a depth learning algorithm | |
CN116467291A (en) | Knowledge graph storage and search method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20181218 |
|
RJ01 | Rejection of invention patent application after publication |