CN109033284A - The power information operational system database construction method of knowledge based map - Google Patents

The power information operational system database construction method of knowledge based map Download PDF

Info

Publication number
CN109033284A
CN109033284A CN201810762686.6A CN201810762686A CN109033284A CN 109033284 A CN109033284 A CN 109033284A CN 201810762686 A CN201810762686 A CN 201810762686A CN 109033284 A CN109033284 A CN 109033284A
Authority
CN
China
Prior art keywords
word
knowledge
vocabulary
power information
entity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810762686.6A
Other languages
Chinese (zh)
Inventor
陈倩
吴飞
罗富财
李霆
杨启航
林伟
刘心
魏煜
温丽清
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Fujian Electric Power Co Ltd
Fujian Yirong Information Technology Co Ltd
Original Assignee
State Grid Fujian Electric Power Co Ltd
Fujian Yirong Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Fujian Electric Power Co Ltd, Fujian Yirong Information Technology Co Ltd filed Critical State Grid Fujian Electric Power Co Ltd
Priority to CN201810762686.6A priority Critical patent/CN109033284A/en
Publication of CN109033284A publication Critical patent/CN109033284A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/20Administration of product repair or maintenance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/06Energy or water supply
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02EREDUCTION OF GREENHOUSE GAS [GHG] EMISSIONS, RELATED TO ENERGY GENERATION, TRANSMISSION OR DISTRIBUTION
    • Y02E40/00Technologies for an efficient electrical power generation, transmission or distribution
    • Y02E40/70Smart grids as climate change mitigation technology in the energy generation sector
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y04INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
    • Y04SSYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
    • Y04S10/00Systems supporting electrical power generation, transmission or distribution
    • Y04S10/50Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Economics (AREA)
  • General Physics & Mathematics (AREA)
  • Human Resources & Organizations (AREA)
  • General Health & Medical Sciences (AREA)
  • General Business, Economics & Management (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • Marketing (AREA)
  • Public Health (AREA)
  • Water Supply & Treatment (AREA)
  • Primary Health Care (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention proposes a kind of power information operational system database construction method of knowledge based map, by carrying out textual and participle to electric system work order data, is classified later by similarity, then store and to form chart database.The present invention improves the effective use of electric power data, the value for playing image watermarking, the intelligent Application for improving power grid is horizontal, changes low dimensional, mutually isolated information ways of presentation in the past, power information data is modeled, the relevance between electric power knowledge can more intuitively be presented.On this basis, power information operational system knowledge base can be constructed, knowledge is converted by history operation/maintenance data, user is supported independently to seek advice from and solve the problems, such as online, realize intelligent answer, O&M efficiency is improved, important value will be played in power grid maintenance work, and completes power information O&M model of the domain knowledge on this basis and optimizes memory technology.

Description

The power information operational system database construction method of knowledge based map
Technical field
The invention belongs to electric power operation management field more particularly to a kind of power information operational systems of knowledge based map Database construction method.
Background technique
Current electric power O&M customer service system daily all in the heartfelt wishes for the client for being faced with the magnanimity from each side, enterprise and Personal user carries out business official communication by the various channels such as wechat public platform, microblogging, palm electric power APP, 95598 intelligent interaction websites Inquiry, warranty failure, complaint, report, suggestion, opinion and praise, electric network information, policy subscription etc., provide service and consumption mentions Higher requirement is gone out, has needed the heartfelt wishes of timely, accurate insight into customer, has met client's with the shortest time, with best service Demand;Customer requirement is more friendly simultaneously, easily and efficiently uses service provided by us.For this reason, it may be necessary to by constantly being promoted Information support service improves customer service system to improve customer service quality and efficiency, however in the epoch of internet development, Traditional call center's customer service system needs a large amount of man power and material.
State's net believes that logical portion has been assigned in 2015 about the promotion activity of information system operation and management level is carried out, and proposes and cuts The real requirement for promoting operation and management level, information system need to be according to " two-level scheduler, three layers of maintenance, integration operations, unified visitor Three line Support center of family services and information technology " mode carries out information maintenance work, thus has accumulated a large amount of work order number According to.Continuous improvement with masses to power generation reliability requirement, Guo Wang company also have higher want to power supply quality Seek standard.The professional qualities of power information operation maintenance personnel how are continuously improved, how fast and accurately to handle information in production How system O&M problem quickly searchs and locates defect in daily production and is preventive from possible trouble, be Technology of Electrical Power Generation and management The demand of person.
Towards power industry building common knowledge model, there are many technological difficulties, and the analysis such as O&M knowledge is weaker, Data resource value dimension integrally rests on the extensive stage, data resource is not efficiently converted into Knowledge Assets, and do not have There is the problems such as forming knowledge benign cycle, the enabling capabilities to business demand and the promotion ability to management change are weaker, these All hinder tissue and the building of electric power O&M knowledge model.One single two bill systems of especially company exist a large amount of available Data effectively can be excavated and be utilized, and transported maintenance knowledge model for building and provided basic data and theoretical foundation.
Summary of the invention
In view of the problems of the existing technology and blank, the invention adopts the following technical scheme:
A kind of power information operational system database construction method of knowledge based map, which is characterized in that including following Step:
Step 1: work order acquisition being carried out to electric system work order data, and is converted to text formatting, is carried out using seven footworks Ontological construction is multiple textview fields according to text meaning Attribute transposition;
Step 2: as unit of work order, word segmentation processing being carried out to work order text data;
Step 3: textview field is grouped;
Step 4: domain word segmentation processing is carried out respectively to each textview field, it is right respectively using the segmenting method based on string matching The content being respectively grouped carries out segmenting words;
Step 5: invalid vocabulary filtering being carried out according to invalid vocabulary, filters out invalid vocabulary and sensitive vocabulary;
Step 6: effective vocabulary being compared with vocabulary in knowledge base, the vocabulary that new term is added to knowledge base is arranged Table has vocabulary add up the frequency of its appearance oneself;
Step 7: extracting the entity relationship that vocabulary is added: by pre-defined entity relationship type and based on the spy of entity Sign is extracted entity relationship, is handled using word2vec word feature vector, and the similarity between term vector, and root are calculated The classification of entity relationship is carried out according to similarity;
Step 8: the classification of entity and entity relationship is imported into Neo4j chart database.
Preferably, in step 2 and step 4, participle uses the ICTCLAS system of the Chinese Academy of Sciences;
All words after participle are formed into character list D, D={ d1, d2..., dn, wherein diIndicate a word, i ∈ [1, n].The word feature vector of each word E is expressed as V={ v1, v2..., vn, wherein viRepresent whether the word corresponds to character list D In di, viCalculation it is as follows:
Preferably, in step 2 and step 4, participle further includes the building of part of speech feature: the building and word feature of part of speech feature Building mode it is consistent.
Preferably, use the urllib2 packet of Python in unstructured text data electric system work order data Appearance is acquired;It is parsed using content of the BeautifulSoup packet to acquisition;Using the Rwordseg packet under R environment into Row participle.
Compared with prior art, the present invention deep learning algorithm is introduced into the building of power domain knowledge model, adopt Blocks of knowledge extraction and blocks of knowledge are solved with the task of name this two big machine learning of Entity recognition and entity relation extraction The two problems of Relation extraction.In addition to this, graphic data base is also brought into the system of building knowledge model by the present invention, Blocks of knowledge is stored and shown using graphic data base, provides a kind of new method for the drafting of knowledge model.
The present invention improves the effective use of electric power data, plays the value of image watermarking, and the intelligence for improving power grid is answered With level, the representation of knowledge of knowledge based model will change low dimensional, mutually isolated information ways of presentation in the past, more directly See the relevance presented between electric power knowledge.Power information operational system knowledge base is constructed, history operation/maintenance data is converted to and knows Know, supports user independently to seek advice from and solve the problems, such as online, realize intelligent answer, O&M efficiency is improved, in power grid maintenance work Important value will be played, and completes power information O&M model of the domain knowledge on this basis and optimizes memory technology.
The knowledge model constructed using the present invention can solve the problems such as artificial repetition answers, is undermanned, can be quickly quasi- Really know the intention of work order applicant clearly, and autonomous answer is user's answer, and the working efficiency of O&M is substantially improved.
Detailed description of the invention
The present invention is described in more detail with reference to the accompanying drawings and detailed description:
Fig. 1 is present invention method main flow schematic diagram;
Fig. 2 is the knowledge model building flow diagram based on deep learning algorithm;
Fig. 3 is the embodiment of the present invention in the realization of Neo4j chart database and application schematic diagram;
Fig. 4 is seven footwork flow diagram in the embodiment of the present invention;
Fig. 5 is the schematic diagram for two kinds of processes that word2vec of the embodiment of the present invention is used;
Fig. 6 is CBOW model schematic of the embodiment of the present invention;
Fig. 7 is Skip-gram model schematic of the embodiment of the present invention.
Specific embodiment
For the feature and advantage of this patent can be clearer and more comprehensible, special embodiment below is described in detail below:
As shown in Figure 1, present implementation the following steps are included: the following steps are included:
Step 1: work order acquisition being carried out to electric system work order data, and is converted to text formatting, is carried out using seven footworks Ontological construction is multiple textview fields according to text meaning Attribute transposition;
Step 2: as unit of work order, word segmentation processing being carried out to work order text data;
Step 3: textview field is grouped;
Step 4: domain word segmentation processing is carried out respectively to each textview field, it is right respectively using the segmenting method based on string matching The content being respectively grouped carries out segmenting words;
Step 5: invalid vocabulary filtering being carried out according to invalid vocabulary, filters out invalid vocabulary and sensitive vocabulary;
Step 6: effective vocabulary being compared with vocabulary in knowledge base, the vocabulary that new term is added to knowledge base is arranged Table has vocabulary add up the frequency of its appearance oneself;
Step 7: extracting the entity relationship that vocabulary is added: by pre-defined entity relationship type and based on the spy of entity Sign is extracted entity relationship, is handled using word2vec word feature vector, and the similarity between term vector, and root are calculated The classification of entity relationship is carried out according to similarity;
Step 8: the classification of entity and entity relationship is imported into Neo4j chart database.
Fig. 2 is the conventional knowledge model building flow diagram based on deep learning algorithm, and the present embodiment is relative to this The main innovation point of process is: proposing new participle strategy, and solves the problems, such as feature selecting in entity extraction.
In step 2 and step 4, participle uses the ICTCLAS system of the Chinese Academy of Sciences;
All words after participle are formed into character list D, D={ d1, d2..., dn, wherein diIndicate a word, i ∈ [1, n].The word feature vector of each word E is expressed as V={ v1, v2..., vn, wherein viRepresent whether the word corresponds in character list D Di, viCalculation it is as follows:
In step 2 and step 4, participle further includes the building of part of speech feature: the building of part of speech feature and the building of word feature Mode is consistent.
In step 7, word feature vector is handled using word2vec, calculates the similarity between term vector, and root The classification of entity relationship is carried out according to similarity.Forming types set reduces a large amount of artificial participation, feature in this way Extraction it is simpler effectively.
Electric system work order data adopt the content of unstructured text data using the urllib2 packet of Python Collection;It is parsed using content of the BeautifulSoup packet to acquisition;It is segmented using the Rwordseg packet under R environment.
For Neo4j chart database.
Wherein,
1) entity.Each entity or concept are identified with the ID of a globally unique determination, referred to as their identifier.Most It is typically include the three classes general entities such as name, place name, mechanism name.For power information O&M field, other than general entity, There are richer entities, such as server, operational system, power grid, transformer, substation, transmission line of electricity, distribution, major network.
2) one value of attribute.Each one value of attribute is to the intrinsic characteristic for being used to portray entity.Attribute-value ranges include: numeric type (such as work order number, phone), enumeration type (such as department, unit), short text (such as work order event header), long text (such as work order event It is described in detail).
3) relationship.And relationship is used to connect two entities, portrays the association between them.Typical Relation extraction method is pressed The continuous iteration of process according to " extraction of template generation example " is until convergence.For example, can initially pass through " the general headquarters address that X is Y " Template extracts (national grid, general headquarters address, Beijing) triple example;Then according to the entity in triple to " state's household electrical appliances One Beijing of net " it can be found that more matching templates, such as " the general headquarters address of Y is X ", " X is the " center " of Y;And then it is sent out with new Existing template extracts more new triple examples, constantly extracts new example and template by iterating.It can also pass through The phrase of recognition expression semantic relation extracts relationship between entity.For example, can find " state from text by syntactic analysis Following relationship of the family's power grid " with " Beijing ": (national grid, general headquarters are located at, Beijing), (national grid, general headquarters are set to, north Capital) and (its general headquarters is built in, Beijing by national grid).Between the entity extracted by this method relationship it is very rich and Freely, usually one using verb as the phrase of core.
In the present embodiment, electric system work order data specifically refer to, " single two tickets ", it may be assumed that maintenance plan work order, work Ticket and operation order.Pass through the collection to " single two tickets " unstructured text data.And according in these data electric power entity, Relationship and each electric power entity attributes and attribute value between electric power entity, establish electric power O&M model of the domain knowledge, make The current Industry in the field, development trend can be understood by the knowledge model in the field by obtaining the people other than the field.It is right The text data of acquisition is segmented, part-of-speech tagging, identifies electric power O&M entity by name entity recognition method, then pass through reality Body Relation extraction method extracts the relationship between entity, and the data after identification are stored in chart database Neo4j, uses Graph model attribute list shows knowledge model, is applied the data being stored in chart database by graph visualization technology, right Knowledge model carries out entity relationship displaying.
The chart database used in the present embodiment are as follows: Neo4j chart database.
Wherein, Neo4j supports various ways to import data.Both it can manually import, can also directly be led using csv file Enter, meanwhile, Neo4j also supports to import data from mainstream relevant database.The present embodiment will using the method that CSV batch imports Data are directed into Neo4j.After importing data to Neo4j using CSV batch, so that it may be inquired in database and be deposited using Cypher The data of storage, and adopt and graphically show.
As shown in figure 3, specifically,
(1) data loading;The name entity extracted and entity relationship are directed into figure in such a way that batch imports In database.
(2) overall picture of entire knowledge mapping can be obtained by inquiring all nodes and relationship using Cypher query language.
(3) using node and relation information needed for Cypher language search, knowing for personalization can be provided for user Know service.
(4) the RRESTAPI interface of Neo4j can be called further to develop knowledge mapping knowledge graph by the way of programming Compose interface.
Specifically, in the present embodiment, since the calculating of deep learning algorithm is complex, the dimension of feature usually reaches Ranks up to ten thousand, it is very big to the resource overhead of computer, and due to deep learning algorithm can also have in finite sample amount compared with Good characterization ability uses the urllib2 of Python to electric system work order data under the premise of choosing limited quantity sample Packet is acquired the content of unstructured text data;It is parsed using content of the BeautifulSoup packet to acquisition;
Since a marked difference in terms of natural language processing is that English can be cut by space to Chinese with English Divide word, and Chinese is indefinite in the division of word, therefore when handling Chinese, it is necessary first to it is divided Word.It is segmented in the present embodiment using the Rwordseg packet under R environment, ICTCLAS Chinese of the Rwordseg based on the Chinese Academy of Sciences Segmentation methods are write as, and the functions such as participle, multistage part-of-speech tagging, keyword extraction may be implemented, can also import customized word Allusion quotation is segmented.
Since experimental data is from " Guo Wang Fujian electric power Co., Ltd -- Collective qualification services in single-sign-on platform " single two tickets " unstructured text data, therefore when being segmented using Rwordseg packet, it is also necessary to increase the field Dictionary, to improve the accuracy of participle.Comparatively power industry is not very open field, there is no complete fields at present Dictionary, it is therefore desirable to manually the dictionary in the field be constructed.Machine acquires and manually acquires the method that combines to construct The dictionary in the field.First by the entry in program automatic collection " single two tickets " data, artificial acquisition mode is then taken Further expand the dictionary in the field.
The building process of dictionary is as follows:
(1) analysis is labeled to the entry of " single two tickets ", has selected " opening one ", " opening single time ", " has opened in list Multiple labels such as appearance ".
(2) crawler is write using Python, crawls the entry under the multiple labels determined in (1).And to collected word Item carries out duplicate removal.
(3) due to acquisition be " one single two tickets " the tag page under entry, and at most only list certain in the tag page A little entries, those entries that do not list have no idea directly to acquire by program.Based on this reason, experiment is by manually looking into It sees the content in data set, supplements no collected entry.
Self-built dictionary can be installed using the installDict in Rwordseg packet, data set is segmented.To receipts The text data collected is segmented, and then, is used to carry out knowledge Entity recognition with these words.By the content obtained after participle with Self-built dictionary storage, it is subsequent to be further processed in conjunction with the two.
In the present embodiment, building dictionary and concept matching by way of come solve the problems, such as coreference resolution this, in structure Before building knowledge model, the same entity is mapped as in such a way that dictionary is by the different expression of identical concept.For example, for " country This entity of power grid ", a variety of form of presentation such as experiment meeting automatic identification " SGCC ", " national grid ", " State Grid Corporation of China ", and These form of presentation are converted into " national grid " this expression way.
Conceptual level in power domain knowledge model includes: out one, the time, applies for content and operation object.For The entity in concept and concept in knowledge model takes top-down method in the present embodiment to construct electronics neck Domain knowledge model.Top-down mode does not construct top layer relation body by artificial method not instead of first, directly to belong to Attribute key-value pair and entity are indicated the form of relationship in property figure.
1) the attribute key-value pair and tag representation in attributed graph
Such as " virtual server " entity term operation object class, then " virtual server " node is directed in attributed graph, A type attribute is added, value is " operation object ".
2) entity is to relationship
For example " virtual server " entity is with " Zhang Jianhui " in " drawer " concept and " Super-Entity is related, and " is opened Ticket people " executes certain operation to " virtual server ", then " virtual server " and " Zhang Jianhui " is carried out entity relationship company by us It connects, establishes relatedBody correlativity.
As shown in figure 4, seven footworks used in the present embodiment are developed by Stanford University Medical institute, it is main to apply In the method for domain body building, specific step and thinking are as follows:
1) determine range: one domain body of exploitation itself is not purpose, but establishes spy for specific purpose Determine the model in field, so it needs to be determined that ontology covering field, using ontology purpose and ontology maintenance the problems such as;
2) consider multiplexing: with the extensive use of ontology, seldom starting from scratch when defining ontology.So needing true Fixed whether to obtain available ontologies from third party, as oneself exploitation ontology starting point;
3) it enumerates term: listing a unstructured list for it is expected all relational languages occurred in ontology, wherein name Word is the basis of class name, and verb or verb phrase are the bases of attribute-name;
4) defining classification: after determining relational language, forming layered structure for term, can using it is top-down or from The upward two ways in bottom, it is ensured that layered structure is made of the hierarchical relationship of class and subclass.Or both the above method is combined, A large amount of concepts are integrated and they are associated;
5) defined attribute: this step usually intersects progress with previous step, while establishing the hierarchical structure of class, has to same Shi Jianli contacts the attribute of these classes, since there are inheritances for class, so needing attribute to be added to the top class that it is applicable in On, while giving class additive attribute, determine the domain and codomain of attribute;
6) side is defined: in order to keep the definition of attribute more complete, the side (facet) of further defined attribute, specifically Including three classes: radix refers to the different value for whether allowing or must having certain amount;Particular value refers to the particular value with some attribute Class is defined, this particular value can be derived from a given class, without being a specific value;Relation property refers to attribute Relation property, i.e. symmetry, transitivity, inverse attribute and functional value;
7) definitions example: defining ontology is usually for tissue collection instance, and it is one that these collection instances, which insert ontology, Independent step, and example set is usually to extract to obtain automatically from data source.
In the present embodiment, the processing and training of word feature vector be using word2vec tool, the tool by Google was released in open source in 2013.Word is exactly mapped in a vector by its main function, represents one using vector A word, in this way, the problem of can be exchanged into opposite amount processing the problem of natural language processing.By term vector, we can be with Modeling processing preferably is carried out to natural language processing task.Two kinds of models: CBOW have mainly been used in word2vec (ContinuousBag.of-WordsModel) and Skip-gram (ContinuousSkip.gramModel).
As shown in figure 5, two kinds of models all include three layers: input layer, mapping layer, output layer.W (t) represents current word, w (t- 2), w (t-1), w (t+1), w (t+2) indicate the context of word w (t).CBOW model is that oneself knows the context w of current word w (t) (t-2), word w (t) is predicted under the premise of w (t-1), w (t+1), w (t+2), and Skip-gram model is on the contrary, known at oneself Current word w (predicts its contextual information w (t-2), w (t-1), w (t+1), w (t+2) under the premise of power.
By taking CBOW model and Skip-gram model based on HierarchicalSoftmax as an example, in which:
(1) CBOW model
The main thought of CBOW model be known current word w (t) context under the premise of predict word w (t).CBOW model Network structure it is as shown in Figure 6.
It will be appreciated from fig. 6 that CBOW model is mainly made of input layer, mapping layer, output layer three parts, wherein input layer is inputted The term vector of 2c word of context of current word w (t);Mapping layer uses XwThe term vector of 2c word is added, is as a result deposited Enter xwIn;Output layer corresponds to a Hofman tree, and wherein leaf node represents each of corpus word, and non-leaf nodes represents One two classifier, xwPass through k (non-leaf sections of the k expression on from root node to current word leaf node path from root node Point number) two classifiers classify, and it is every that the error accumulation of correct branch can will be all assigned to by a classifier to vector In neule, after k subseries, xwInformation reaches the leaf node of current word, then will be in error accumulation vector n eule The information of preservation is updated respectively in 2c term vector.
In CBOW model, the processing of most critical is exactly the operation on output layer, and the corresponding Hofman tree of output layer is each A nonterminal node can be regarded as two classification problems, and two classification problems can be used logistic regression function and classify, Its formula is as follows:
The value range of logistic regression function is (0,1), and threshold value is set to 0.5, represents positive example greater than 0.5, is represented less than 0.5 Negative example, logistic regression function can be very good two classification problems of processing.
For each word w (t), the leaf node from root node root to word w (t) can all have a unique path l, Assuming that having k non-leaf nodes in the l of path, then just needing to carry out k times two classification, if to guarantee k two classification problems effect Fruit is best, and objective function can be taken as to maximal possibility estimation, and formula is as follows:
Wherein θ represents undetermined parameter, and value is stored in non-leaf nodes, as long as undetermined parameter can determine, represents Model training is completed.Conditional probability is indicated in vector xwUnder the premise of being θ with undetermined parameter, it is classified as djProbability, dj=1 generation Table positive example, dj=0 represents negative example.
(2) Skip-gram model
The main thought of Skip-gram model is to be predicted using current word contextual information.Its network structure is such as Shown in Fig. 7.
In order to compared with CBOW, depict mapping layer in Fig. 7, mapping layer is not present in tangible Skip-gram model. Therefore, Skip-gram model is mainly made of input layer and two layers of output layer.Wherein input layer be current word v (w) word to Amount, output layer equally correspond to a Hofman tree with CBOW model.The difference is that, passed through in v (w) from root node with CBOW After crossing k two classifiers, by the information update saved in error accumulation vector n eule into the term vector v (w) of word.
The objective function of Skip-gram model is different from the objective function of CBOW model, objective function such as institute under formula Show:
Wherein θ represents undetermined parameter, and value is stored in non-leaf nodes, as long as undetermined parameter θ has been determined, final Model also just complete by training.Conditional probability indicates to be classified as under the premise of vector v (w) and undetermined parameter are θProbability,Positive example is represented,Negative example is represented, u represents some word in the context of current word w here.Due to Skip-gram Model is contextual information to be predicted under the premise of oneself knows current word w, and include multiple words, therefore and CBOW in contextual information Unlike model, each training of Skip-gram model needs to be predicted that (win is context to win paths The number of word in context (w)), each path represents leaf corresponding to some word from root node to context in (w) Child node.As long as determining which context u predicted, path can both uniquely determine, also determine that.
Such as contextual information is indicated using w (c-2), w (c-1) ... w (c+1), w (c+2), then in Skip-gram mould Need to calculate separately in type p (w (t-2) | w (t)), p (w (t-1) | w (t)) ... p (w (t+1) | w (t), p (w (t+2) | w (t), and Value in each error accumulation vector n eule is updated into current word vector v (w).
Word2vec not only realizes CBOW model and Skip-gram model based on HierarchicalSotbnax, also Realize CBOW model and Skip-gram model based on NegativeSampling.In addition, word2vec also to low-frequency word, Corresponding processing has been done in the adjustment of high frequency words and learning rate, achieves good effect.
After term vector has indicated, since contribution degree of each word to entity is different, so needing to assign these words Different weights.Weight method TF-IDF of the calculating lexical item in vector, its expression TF (word frequency) and IDF's (inverse document frequency) Product:
TF-IDF=word frequency (TF) × inverse document frequency (IDF)
Word frequency (TermFrequency, abbreviation TF) indicates the number of Feature Words appearance divided by the total word number of article:
Wherein TF indicates the frequency that some keyword occurs, and IDF is the number of all documents divided by the text comprising the word Gear number purpose logarithm.
| D | indicate the number of all documents, | w ∈ d | indicate the number of documents comprising word w.
Due to "Yes" " " words such as " this " often will appear, therefore need IDF value to reduce its weight.So-called dimensionality reduction is exactly Reduce dimension.It is calculated specific to Documents Similarity, exactly reduces the quantity of word.The common word that can be used for dimensionality reduction is with function word With stop words based on (as: " and ", " this " etc.), in fact, take dimensionality reduction strategy effect not only can be improved in many cases Rate can also improve precision.
It is bigger to the importance of this text that last TF-IDF calculates the bigger expression entry of weight.
In this way, it is necessary to which the similar article of a group can just calculate IDF value.Similar article D=is successively calculated The weight of (w1, w2 ..., wn) total n keyword.When you provide an article E, E=is calculated using identical method (q1, q2 ..., qn) then calculates the similarity of D and E.
The similarity calculated between part document is just described by the cosine angle cos of two vectors.The phase of text D1 and D2 It is as follows like property formula:
Wherein molecule indicates the dot-product of two vectors, and denominator indicates the product of two vector field homoemorphisms.
Using word2vec tool by corpus text input into CBOW model and Skip-gram model, by content of text The vector being converted in vector space calculates, and calculates the similarity between outgoing vector, by carrying out the higher vector of similarity Cluster forms the different Subject Clustering of granularity thickness, to indicate the similarity degree on text semantic.
This patent is not limited to above-mentioned preferred forms, anyone can obtain other each under the enlightenment of this patent The power information operational system database construction method of the knowledge based map of kind form, it is all according to scope of the present invention patent institute The equivalent changes and modifications done should all belong to the covering scope of this patent.

Claims (4)

1. a kind of power information operational system database construction method of knowledge based map, which is characterized in that including following step It is rapid:
Step 1: work order acquisition being carried out to electric system work order data, and is converted to text formatting, carries out ontology using seven footworks Building is multiple textview fields according to text meaning Attribute transposition;
Step 2: as unit of work order, word segmentation processing being carried out to work order text data;
Step 3: textview field is grouped;
Step 4: domain word segmentation processing being carried out respectively to each textview field, using the segmenting method based on string matching respectively to each point The content of group carries out segmenting words;
Step 5: invalid vocabulary filtering being carried out according to invalid vocabulary, filters out invalid vocabulary and sensitive vocabulary;
Step 6: effective vocabulary is compared with vocabulary in knowledge base, new term is added to the word lists of knowledge base, There is vocabulary add up the frequency of its appearance oneself;
Step 7: extracting the entity relationship that vocabulary is added: by pre-defined entity relationship type and based on the feature of entity, taking out Entity relationship is taken, word feature vector is handled using word2vec, calculates the similarity between term vector, and according to similar Degree carries out the classification of entity relationship;
Step 8: the classification of entity and entity relationship is imported into Neo4j chart database.
2. the power information operational system database construction method of knowledge based map according to claim 1, feature Be: in step 2 and step 4, participle uses the ICTCLAS system of the Chinese Academy of Sciences;
All words after participle are formed into character list D, D={ d1, d2..., dn, wherein diIndicate a word, i ∈ [1, n];It will The word feature vector of each word E is expressed as V={ v1, v2..., vn, wherein viRepresent whether the word corresponds to d in character list Di, viCalculation it is as follows:
3. the power information operational system database construction method of knowledge based map according to claim 2, feature Be: in step 2 and step 4, participle further includes the building of part of speech feature: the building of part of speech feature and the building mode of word feature Unanimously.
4. the power information operational system database construction method of knowledge based map according to claim 1, feature It is: electric system work order data is acquired the content of unstructured text data using the urllib2 packet of Python; It is parsed using content of the BeautifulSoup packet to acquisition;It is segmented using the Rwordseg packet under R environment.
CN201810762686.6A 2018-07-12 2018-07-12 The power information operational system database construction method of knowledge based map Pending CN109033284A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810762686.6A CN109033284A (en) 2018-07-12 2018-07-12 The power information operational system database construction method of knowledge based map

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810762686.6A CN109033284A (en) 2018-07-12 2018-07-12 The power information operational system database construction method of knowledge based map

Publications (1)

Publication Number Publication Date
CN109033284A true CN109033284A (en) 2018-12-18

Family

ID=64640896

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810762686.6A Pending CN109033284A (en) 2018-07-12 2018-07-12 The power information operational system database construction method of knowledge based map

Country Status (1)

Country Link
CN (1) CN109033284A (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109858451A (en) * 2019-02-14 2019-06-07 清华大学深圳研究生院 A kind of non-cooperation hand detection method
CN109947949A (en) * 2019-03-12 2019-06-28 国家电网有限公司 Knowledge information intelligent management, device and server
CN110243834A (en) * 2019-07-11 2019-09-17 西南交通大学 The transformer equipment defect analysis method of knowledge based map
CN110263251A (en) * 2019-06-17 2019-09-20 广东电网有限责任公司 A kind of O&M knowledge method for pushing and device based on context model
CN111537831A (en) * 2020-04-01 2020-08-14 华中科技大学鄂州工业技术研究院 Power distribution network line fault positioning method and device
CN111680222A (en) * 2020-04-16 2020-09-18 陕西师范大学 Exploration type searching method of knowledge graph based on social platform
CN111930774A (en) * 2020-08-06 2020-11-13 全球能源互联网研究院有限公司 Automatic construction method and system for power knowledge graph ontology
CN111966836A (en) * 2020-08-29 2020-11-20 深圳呗佬智能有限公司 Knowledge graph vector representation method and device, computer equipment and storage medium
CN112101592A (en) * 2020-09-08 2020-12-18 中国电力科学研究院有限公司 Power secondary device defect diagnosis method, system, device and storage medium
CN112559739A (en) * 2020-12-01 2021-03-26 广东电网有限责任公司广州供电局 Method for processing insulation state data of power equipment
CN113157860A (en) * 2021-04-07 2021-07-23 国网山东省电力公司信息通信公司 Electric power equipment maintenance knowledge graph construction method based on small-scale data
CN113711146A (en) * 2019-04-17 2021-11-26 阿自倍尔株式会社 Management system
CN114266364A (en) * 2021-11-24 2022-04-01 国网北京市电力公司 Power grid fault processing method and device and computer readable storage medium
CN114780756A (en) * 2022-06-07 2022-07-22 国网浙江省电力有限公司信息通信分公司 Entity alignment method and device based on noise detection and noise perception
CN115146712A (en) * 2022-06-15 2022-10-04 北京天融信网络安全技术有限公司 Internet of things asset identification method, device, equipment and storage medium
CN115714009A (en) * 2022-11-17 2023-02-24 中澄明(北京)商务服务有限公司 System and method for intelligently recommending hospital of higher level to hospital of lower level according to hospital saturation
CN117390139A (en) * 2023-11-27 2024-01-12 国网江苏省电力有限公司扬州供电分公司 Method for evaluating working content accuracy of substation working ticket based on knowledge graph

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106600298A (en) * 2016-12-23 2017-04-26 国网山东省电力公司信息通信公司 Electric power information system customer service knowledge base construction method based on work order data analysis
CN107609052A (en) * 2017-08-23 2018-01-19 中国科学院软件研究所 A kind of generation method and device of the domain knowledge collection of illustrative plates based on semantic triangle
CN108460136A (en) * 2018-03-08 2018-08-28 国网福建省电力有限公司 Electric power O&M information knowledge map construction method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106600298A (en) * 2016-12-23 2017-04-26 国网山东省电力公司信息通信公司 Electric power information system customer service knowledge base construction method based on work order data analysis
CN107609052A (en) * 2017-08-23 2018-01-19 中国科学院软件研究所 A kind of generation method and device of the domain knowledge collection of illustrative plates based on semantic triangle
CN108460136A (en) * 2018-03-08 2018-08-28 国网福建省电力有限公司 Electric power O&M information knowledge map construction method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王仁武 等: "基于深度学习与图数据库构建中文商业知识图谱的探索研究", 《图书与情报》 *

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109858451A (en) * 2019-02-14 2019-06-07 清华大学深圳研究生院 A kind of non-cooperation hand detection method
CN109947949A (en) * 2019-03-12 2019-06-28 国家电网有限公司 Knowledge information intelligent management, device and server
CN113711146A (en) * 2019-04-17 2021-11-26 阿自倍尔株式会社 Management system
CN110263251A (en) * 2019-06-17 2019-09-20 广东电网有限责任公司 A kind of O&M knowledge method for pushing and device based on context model
CN110243834A (en) * 2019-07-11 2019-09-17 西南交通大学 The transformer equipment defect analysis method of knowledge based map
CN110243834B (en) * 2019-07-11 2020-03-31 西南交通大学 Transformer equipment defect analysis method based on knowledge graph
CN111537831A (en) * 2020-04-01 2020-08-14 华中科技大学鄂州工业技术研究院 Power distribution network line fault positioning method and device
CN111537831B (en) * 2020-04-01 2022-06-24 华中科技大学鄂州工业技术研究院 Power distribution network line fault positioning method and device
CN111680222A (en) * 2020-04-16 2020-09-18 陕西师范大学 Exploration type searching method of knowledge graph based on social platform
CN111930774A (en) * 2020-08-06 2020-11-13 全球能源互联网研究院有限公司 Automatic construction method and system for power knowledge graph ontology
CN111930774B (en) * 2020-08-06 2024-03-29 全球能源互联网研究院有限公司 Automatic construction method and system for electric power knowledge graph body
CN111966836A (en) * 2020-08-29 2020-11-20 深圳呗佬智能有限公司 Knowledge graph vector representation method and device, computer equipment and storage medium
CN112101592A (en) * 2020-09-08 2020-12-18 中国电力科学研究院有限公司 Power secondary device defect diagnosis method, system, device and storage medium
CN112559739A (en) * 2020-12-01 2021-03-26 广东电网有限责任公司广州供电局 Method for processing insulation state data of power equipment
CN113157860B (en) * 2021-04-07 2022-03-11 国网山东省电力公司信息通信公司 Electric power equipment maintenance knowledge graph construction method based on small-scale data
CN113157860A (en) * 2021-04-07 2021-07-23 国网山东省电力公司信息通信公司 Electric power equipment maintenance knowledge graph construction method based on small-scale data
CN114266364A (en) * 2021-11-24 2022-04-01 国网北京市电力公司 Power grid fault processing method and device and computer readable storage medium
CN114780756A (en) * 2022-06-07 2022-07-22 国网浙江省电力有限公司信息通信分公司 Entity alignment method and device based on noise detection and noise perception
CN115146712A (en) * 2022-06-15 2022-10-04 北京天融信网络安全技术有限公司 Internet of things asset identification method, device, equipment and storage medium
CN115146712B (en) * 2022-06-15 2023-04-28 北京天融信网络安全技术有限公司 Internet of things asset identification method, device, equipment and storage medium
CN115714009A (en) * 2022-11-17 2023-02-24 中澄明(北京)商务服务有限公司 System and method for intelligently recommending hospital of higher level to hospital of lower level according to hospital saturation
CN115714009B (en) * 2022-11-17 2023-08-25 北京恒生芸泰网络科技有限公司 System and method for intelligent recommendation of hospitalization saturation quantity of superior hospital to subordinate hospital
CN117390139A (en) * 2023-11-27 2024-01-12 国网江苏省电力有限公司扬州供电分公司 Method for evaluating working content accuracy of substation working ticket based on knowledge graph
CN117390139B (en) * 2023-11-27 2024-05-24 国网江苏省电力有限公司扬州供电分公司 Method for evaluating working content accuracy of substation working ticket based on knowledge graph

Similar Documents

Publication Publication Date Title
CN109033284A (en) The power information operational system database construction method of knowledge based map
CN108460136A (en) Electric power O&M information knowledge map construction method
CN106776711B (en) Chinese medical knowledge map construction method based on deep learning
EP3320490B1 (en) Transfer learning techniques for disparate label sets
CN108874783A (en) Power information O&M knowledge model construction method
US20180232443A1 (en) Intelligent matching system with ontology-aided relation extraction
CN107180045B (en) Method for extracting geographic entity relation contained in internet text
EP3729231A1 (en) Domain-specific natural language understanding of customer intent in self-help
US20150254308A1 (en) Record linkage algorithm for multi-structured data
CN106030571A (en) Dynamically modifying elements of user interface based on knowledge graph
CN111581990B (en) Cross-border transaction matching method and device
CN109804371B (en) Method and device for semantic knowledge migration
CN112989208B (en) Information recommendation method and device, electronic equipment and storage medium
CN112925901B (en) Evaluation resource recommendation method for assisting online questionnaire evaluation and application thereof
CN112559684A (en) Keyword extraction and information retrieval method
Wick et al. A unified approach for schema matching, coreference and canonicalization
CN103425740A (en) IOT (Internet Of Things) faced material information retrieval method based on semantic clustering
Nesi et al. Ge (o) Lo (cator): Geographic information extraction from unstructured text data and Web documents
CN111241299A (en) Knowledge graph automatic construction method for legal consultation and retrieval system thereof
Ramathulasi et al. Augmented latent Dirichlet allocation model via word embedded clusters for mashup service clustering
CN112597273A (en) Power distribution automation chart generation method based on NL2SQL technology
Al-Qawasmeh et al. Arabic named entity disambiguation using linked open data
Ali et al. CLOE: a cross-lingual ontology enrichment using multi-agent architecture
CN116467291A (en) Knowledge graph storage and search method and system
CN112948561B (en) Method and device for automatically expanding question-answer knowledge base

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20181218

RJ01 Rejection of invention patent application after publication