CN109213752A - A kind of data cleansing conversion method based on CIM - Google Patents

A kind of data cleansing conversion method based on CIM Download PDF

Info

Publication number
CN109213752A
CN109213752A CN201810887270.7A CN201810887270A CN109213752A CN 109213752 A CN109213752 A CN 109213752A CN 201810887270 A CN201810887270 A CN 201810887270A CN 109213752 A CN109213752 A CN 109213752A
Authority
CN
China
Prior art keywords
data
cim
operable
node
distributed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810887270.7A
Other languages
Chinese (zh)
Inventor
李晖
陈清族
陈世春
邹墨
何德明
许梓明
马汉斌
陈珺
谢驰
程友平
温天宝
林超
周暖青
林永辉
刘化龙
李雪梅
谢妙红
林朝灯
李建平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Corp of China SGCC
State Grid Information and Telecommunication Co Ltd
Information and Telecommunication Branch of State Grid Fujian Electric Power Co Ltd
Original Assignee
State Grid Corp of China SGCC
State Grid Information and Telecommunication Co Ltd
Information and Telecommunication Branch of State Grid Fujian Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Corp of China SGCC, State Grid Information and Telecommunication Co Ltd, Information and Telecommunication Branch of State Grid Fujian Electric Power Co Ltd filed Critical State Grid Corp of China SGCC
Priority to CN201810887270.7A priority Critical patent/CN109213752A/en
Publication of CN109213752A publication Critical patent/CN109213752A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/06Electricity, gas or water supply
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y04INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
    • Y04SSYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
    • Y04S10/00Systems supporting electrical power generation, transmission or distribution
    • Y04S10/50Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications

Abstract

The data cleansing conversion method based on CIM that the present invention provides a kind of, this method comprises: the operation data of capture electric system;The operation data of the electric system captured is cleaned and converted, obtains the data based on CIM unified standard, and store into distributed file system;Data are extracted from distributed file system, construct the Distributed Data Warehouse based on CIM.The data cleansing conversion method based on CIM that the invention proposes a kind of, under the support of improved grid operation data model and distributed data platform, source data is extracted, cleaned, is integrated, ensure the quality of data and reliability, realize the unified standard data output based on database, with the broad applicability for supporting clustered deploy(ment) and concurrent, integrated and analysis can be automated for electric network data and reliable support is provided.

Description

A kind of data cleansing conversion method based on CIM
Technical field
The invention belongs to power grid big data field more particularly to a kind of data cleansing conversion methods based on CIM.
Background technique
With the extensive use of all kinds of power transmission and transforming equipments, grid operation data amount is presented geometry grade and increases.For magnanimity Operation data carries out quick analysis processing, realizes anomaly data detection and excavation, is faced with the challenge of how operation of power networks Big data carries out valid data processing and analyzes with efficient data.Due to the software and hardware system of each net provincial company, resource all exist compared with Big difference increases the difficulty of data log analysis platform building.Traditional grid operation data platform can no longer meet The needs of enterprise operation data store optimization and parallel processing.And traditional data store organisation is intuitive, but its significant disadvantage Be data amount of redundancy it is larger.It causes operation information to repeat to store, brings one to the mixing operation between different operation data tables Fixed difficulty causes the search efficiency of operation data low.
Summary of the invention
It is an object of the invention to which operation of power networks big data is effectively treated, convenient for enterprise by multiple network systems into Row is integrated and merges to realize unified efficient big data analysis.It cleans transfer framework by establishing distributed data and can operate Data field avoids conflicting for data conversion process and data query, and carries out electric network data on the basis of the data warehouse of foundation It excavates, including the use of improved electric network data model realization association analysis and disorder data recognition.
To solve the above problems, the invention proposes a kind of data cleansing conversion method based on CIM, comprising:
Capture the operation data of electric system;
The operation data of the electric system captured is cleaned and is converted, the data based on CIM unified standard are obtained, And it stores into distributed file system;
Data are extracted from distributed file system, construct the Distributed Data Warehouse based on CIM.
Preferably, the operation data include equipment account information, operation/maintenance data, fault data, trend topological data, GIS device information.
Preferably, the model information of the electric system and the metadata of the Distributed Data Warehouse based on CIM are stored In MangoDB.
Preferably, described that task is disassembled later directly from distribution by MapReduce based on the Distributed Data Warehouse of CIM Formula file system extracts data and is analyzed, unified to carry out data management and the data access simultaneously mapping of implementation model data and performance Optimization.
Preferably, the model data mapping includes attribute and the bottom different types of data source of power system service model Model data mapping.
Preferably, the cleaning and conversion include two stages: the first stage is that data are drawn into and can be operated from data source Data are drawn into the data warehouse based on CIM: (1) the first rank from operable data buffer area by data buffer zone, second stage Section, the data source of isomery is drawn into operable data buffer area, by the first stage, the operation data of electric system is existed The copy backup of an identical structure, identical content is established in operable data buffer area;(2) second stage, to can operand Statistics merging is carried out according to the data of buffer area and is summarized, and stores data into the data bins based on CIM using step increment method mode In library;The data pick-up is increment extraction, if can not judge increment when extracting, calculates increment, data in load Time tag is added when being loaded into the data warehouse based on CIM;From operable data to the extraction stream of the data warehouse based on CIM Cheng Zhong after data are read out from operable data buffer area, first carries out unified information coded treatment, then respectively to true table number Different disposal is carried out according to, dimension table data;For the data variation of true table, different increasings is selected according to different situations of change Loading method is measured, if data temporally change, timestamp increment is used, if random variation is presented in data, carries out full table Comparison data increment;For the data variation of dimension table, with the newest data cover off-line data based on CIM.
Preferably, described to be analyzed from distributed file system extraction data, further comprise and electric system is transported The similarity analysis of row data.
Preferably, in the similarity analysis of the Operation of Electric Systems data, judged not according to sequence curve shape With contacting between sequence, select temporal characteristics correlative factor as the sample of calculating correlation;Steps are as follows for specific calculating:
(1) current time sequence Y={ Y (m) | m=1,2 ... p } is set as reference sequences, historical time operation data sequence Xi ={ Xi(m) | m=1,2 ... p }, i=1,2 ... k are to compare sequence, and p is sequential element number;
(2) it calculates
(3) calculate correlation coefficient ζi(m):
ζ in formulaiIt (m) is Y (m) in Xi(m) incidence coefficient at place:
Wherein △i(m)=| y (m)-xi(m) |, ρ is resolution ratio, and value interval is (0,1):
(4) calculating correlation:
The present invention compared with prior art, has the advantage that
The data cleansing conversion method based on CIM that the invention proposes a kind of, in improved grid operation data model and Under the support of distributed data platform, source data is extracted, cleaned, is integrated, ensure the quality of data and reliability, realizes base It is exported in the unified standard data of database, there is the broad applicability for supporting clustered deploy(ment) and concurrent, can be power grid number It is integrated according to automation and analysis provides reliable support.
Detailed description of the invention
Fig. 1 is the flow chart of the data cleansing conversion method according to an embodiment of the present invention based on CIM.
Specific embodiment
To make the object, technical solutions and advantages of the present invention clearer, the present invention will be made further in conjunction with attached drawing Detailed description.This description is to describe specific implementation consistent with the principles of the present invention by way of example, and not limitation Mode, the description of these embodiments is detailed enough, so that those skilled in the art can practice the present invention, is not being taken off Other embodiments can be used in the case where from scope and spirit of the present invention and can change and/or replace each element Structure.Therefore, the following detailed description should not be understood from restrictive sense.For the technology hand for realizing the present invention Section, creation characteristic reach purpose and effect is easy to understand, and following further describes the present invention in conjunction with specific drawings.
An aspect of of the present present invention provides a kind of data cleansing conversion method based on CIM.Fig. 1 is to implement according to the present invention The data cleansing conversion method flow chart based on CIM of example.
Operation of Electric Systems data monitoring platform the present invention is based on CIM includes that tidal data recovering server, data mart modeling are deposited Store up server and data analytics server.Tidal data recovering server captures Operation of Electric Systems data by sensor etc., described Operation data includes equipment account information, operation/maintenance data, fault data, trend topological data, GIS device information, also comprising non- Structured image and video.Large amount of complex, redundancy, the data of mistake are contained in the mass data of isomery, are needed in short-term It is interior to extract the data for constituting unified standard.Data mart modeling storage server is by distributed file system and MangoDB data Library is integrated, and by the supervising data storage with unified standard into distributed file system, the model of electric system is believed The metadata of breath and the Distributed Data Warehouse based on CIM is stored in MangoDB, the Distributed Data Warehouse wound based on CIM The table and field built are stored in MangoDB.While executing data manipulation, start MangoDB engine to verify first number According to whether there is.The data analytics server completes the distributed similarity analysis to Operation of Electric Systems data.It is being based on After the Distributed Data Warehouse of CIM passes through MapReduce dismantling task, data directly are extracted from distributed file system and are carried out Analysis, it is unified to carry out data management and data access, in this layer of implementation model data mapping and performance optimization, the model data The model data in each attribute of Mapping implementation power system service model and bottom different types of data source maps, and supports to being based on The access of the data warehouse, relational database and non-relational database of CIM, provides the unified query based on business model and update API;The performance optimization provides the data query of L2 cache, asynchronous parallel.
The data cleansing based on CIM is arranged on the basis of above-mentioned distributed structure/architecture and turns for the data mart modeling storage server Change frame, including semantic meaning analysis module, MangoDB rule base, scheduler module and cleaning conversion module.Electric system reception comes from Data cleansing convert task is interpreted as meeting the work of the DAG structure of unified format by the data cleansing convert task of user's request Make flow graph, due to electric system data conversion semanteme execute be not in logic it is optimal, need to transfer to semantic meaning analysis module Complete Optimization Work.
Semantic meaning analysis module is analyzed and is optimized to the cleaning conversion work flow graph formatted via electric system, passes through Each node in traversing graph, determines the activity attributes in workflow, converts to work flow diagram, finally by the work after optimization Flow graph send to coordination unit and executes.Detailed process is as follows:
1. looping through each node of work flow diagram.Node, that is, electric power system data the source for being 0 to in-degree, determines data source Data volume, the relevant information based on CIM is recorded in MangoDB rule base;Number can be operated in the node for being 0 for out-degree According to collection, relevant metadata is recorded in MangoDB rule base;It is both greater than 0 active node for out-degree and in-degree, judges its work Dynamic type, the binary active node of workflow is divided for being used as, records the movable attribute and present position.
2. after traverse node, carry out exchanging the optimizations such as conversion operating to the node in workflow, reduces between active node Data exchange.
3 will be divided into multiple sub- workflows by boundary of binary activity by the workflow of optimization, will be more in sub- workflow A unitary activity is classified as one group and is sent to coordination unit being ready to carry out, and indicates group, is coordination unit dynamic optimization workflow Journey provides reference.
Coordination unit further comprises divide and rule module and scheduling cleaning conversion module.Module of dividing and ruling by data divide from And existing resource is made full use of to play the performance advantage of parallel computation, data are horizontally divided into multiple numbers according to division rule According to stream.Corresponding data cleansing translation activity is then packaged distribution to each distributed parallel cleaning and converted by scheduling cleaning conversion module Module executes.Coordination unit receives execution information the holding with real-time tracking data cleaning active node for carrying out self-cleaning conversion module Traveling degree.Module of dividing and ruling and scheduling cleaning conversion module are coordinated to execute, the former to come the execution information of self-cleaning conversion module into Row Macro or mass analysis, thus the Data Partition Strategy optimized;The latter is according to obtained real-time optimized results, by task point It is fitted on cleaning conversion module.
Cleaning conversion module executes the calculating operation packet distributed via coordination unit in a distributed computing environment, by data Conversion intermediate result is cached in local.Only just used when needing to collect the output data of multiple nodes network bandwidth resources into The transmission of row data.
It is described to be completed based on the optimization of semantic logic in semantic resolution phase, when the institute of ergodic data cleaning conversion work stream After having active node, semantic meaning analysis module executes sequence with exchange activity, merges semanteme according to the attribute of different active nodes Duplicate active policy modifies data cleansing conversion work process under the premise of not changing the implementing result of datamation stream.Through The workflow for crossing optimization reduces volume of the circular flow of the data between different nodes.For the data cleansing convert task described based on CIM, The frame uses the optimisation strategy based on relational database.
The module of dividing and ruling divides data, specifically includes, and data source T is horizontally divided into T1And T2, then T=T1∪ T2, by each of data cleansing process based on CIM activity Acti(i ∈ [1, m]) regards the Function Mapping to T as, judges that data are clear It washes in conversion work stream with the presence or absence of following movable serial sequence: i.e. if Actm(Actm-1(…(Acti(T))))={ D1, D2..., Dm, D1, D2..., DmAct is then respectively met to the workflow subgraph of the Function Mapping result of T as each activitym (Actm-1(…(Acti(Ti))))∪Actm(Actm-1(...(Acti(T2))))={ D1, D2..., Dm}.If M is the relationship to T Operation, meets M (T)=M (T1)∪M(T2).If there are such sequences in workflow, one group is merged into.By dividing data Mobile component is fitted on different cleaning conversion module asynchronous executions by mode, forms the effect of assembly line.
For the data cleansing conversion work stream executed parallel, if there is data distribution and Mapreduce node resource Unmatched situation is calculated according to current system operation progress and decides whether then after system starts to execute processor active task Carry out data division.When there is idle node in Mapreduce node or when starting to execute some new unitary active task, Data will occur to divide.The execution process of the strategy is as follows, is carrying out processor active task in access Mapreduce node first Conversion module is cleaned, selects the wherein activity of deadline the latest as the object for dividing data;Access Mapreduce section later Available free cleaning conversion module in point judges whether to meet execution condition, and the execution condition is the time of node free time Window, which is greater than, is divided the sum of across the machine transmission time of data and operation time, and otherwise data divide nonsensical, will be eligible Cleaning conversion module record;The data volume that data divide occurs finally, calculating, turns these data in idle cleaning It is most short to change the mold the time span completing transmission on block and calculating.
When occurring idle cleaning conversion module in Mapreduce node has activity to be scheduled, system activates data to draw Divide algorithm.Data processing for the cleaning conversion module, main process is divided into two stages: the first stage is by data from number It is drawn into operable data buffer area according to source, data are drawn into the number based on CIM from operable data buffer area by second stage According to warehouse:
(1) data source of isomery is drawn into operable data buffer area by the first stage, will be electric by the first stage The operation data of Force system establishes the copy backup of an identical structure, identical content in operable data buffer area.
(2) second stage carries out statistics merging to the data of operable data buffer area and summarizes, uses step increment method side Formula stores data into the data warehouse based on CIM.The data pick-up is increment extraction, if can not judge to increase when extracting Amount then calculates increment in load, and data add time tag when being loaded into the data warehouse based on CIM.From can operand According in the extraction process to the data warehouse based on CIM, after data are read out from operable data buffer area, unification is first carried out Information coding processing, then different disposal is carried out to true table data, dimension table data respectively.For the data variation of true table, Different step increment method modes is selected according to different situations of change, if data temporally change, uses timestamp increment, if Random variation is presented in data, then carries out the comparison data increment of full table.For the data variation of dimension table, it is based on newest The data cover off-line data of CIM.
Backup of the operable data buffer area as Database Management System in Electrical Power System, by electricity such as production defect, network loads Force system operation data is backed up, in cleaning conversion process, so that it may use the grid operation data in operable data Backup is used as data source, these data are loaded into the topic model of the data warehouse based on CIM after conversion and cleaning.It needs The operation data for all electric system to enter the data warehouse based on CIM is transmitted directly to operable data buffering first Area, then the target being transferred in the data warehouse based on CIM is handled from operable data buffer area through over cleaning, conversion, mapping In theme, the data of operable data buffer area are deleted after treatment.
The data temporary library of the operable data buffer area is the initial data for storing electric system and each isomery system The initial data that system is transferred to, grid operation data are stored according to theme, are cleaned to the data of data temporary library, then face Subject data fairground is stored in theme and by data model.The data in subject data fairground using a conversion process, into Enter the data warehouse based on CIM.It is divided into multiple topic models, dimension table model in data warehouse based on CIM.
As further embodiment, it is of the invention based on the data conversion of CIM cleaning conversion module treatment process Middle execution following steps:
(1) judge the position converted and cleaned in electric power system data source;Field null value is captured, so After load or be substituted for other meaning data, further according to field null value complete shunt be loaded into different target library.
(2) data sample is extracted from data source, whether with definition consistent, search if analyzing the data that extract The format and structure of abnormal data define CIM business rule;Standardization data format realizes constraint definition to field format, together When numerical value, time, the character in data source are loaded with user-defined format;Field is disassembled according to CIM business demand.
(3) inquiry table verify data correctness is utilized, then invalid data, missing data are replaced;And it advises in advance Surely the processing strategie of data is lost;
(4) data are transformed into the data model of a standard, based on definition standardized data value and format;It is establishing During constraint condition, by ineligible invalid data, it is deposited into wrong data concentration by replacing or exporting, is guaranteed The uniqueness of data major key.
It is influenced to farthest reduce inquiry conflict bring, the present invention is further by the data cleansing based on CIM Flow path switch is divided into asynchronous conversion and synchronous conversion, is respectively used to processing power grid real-time running data and off-line data.It is described different Step conversion includes that the off-line operation data of real-time will be lost in electric power system data source in a manner of batch processing with predetermined period It is loaded into data warehouse.The synchronous conversion includes actively capturing the operation data of real-time change in electric system, and being loaded into can Operation data memory block.After completing the query analysis operation to latest data in operable data memory block, certain systems are triggered System condition, then batch import in the data warehouse based on CIM.Operable data memory block by multiple data copies with based on dual The copy of link indexes composition, and copy is the data space of logical construction and physical structure having the same, can operated Dynamic creation in data storage area.
When creating a copy, a corresponding wave file is saved in operable data memory block, by power grid reality When operation data be orderly loaded onto copy.Copy index be made of two horizontal and vertical queues, lateral queue be by Possess same data item ID but timestamp it is different replica node composition, longitudinal queue is by the copy queue of different data item ID Head node composition.
Copy queue is made of queue head node and queue nodes, and queue head node possesses two attributes: data item ID with First address.The source of data item ID mark data, in a copy queue, the data of all replica nodes are from identical Data source, thus possess identical data item ID, the identical data of these data item ID are known as same source data.First address storage one A address, it is directed toward first replica node of queue.
Queue nodes are gathered around there are five attribute, are replica node size, replica node data time stamp, operation label, number respectively According to the address of storage address and direction queue next node.Node size identify current copy node to data occupy Space size.Replica node is sorted from large to small according to timestamp.Operation label is for marking data in current copy node Which kind of operation is carried out, if current copy node is just carrying out for power grid real-time running data in source data being loaded into operable data storage Area, then the operation label of this replica node is set as 0, if current copy node is directed toward data and needs from operable data memory block batch Amount is loaded onto the data warehouse based on CIM, then operates label and be set to 1.Address data memory is directed toward replica node corresponding data and deposits The position of storage.
All copies from same data source constitute a copy queue, referred to as a copy cluster.Wherein copy cluster First address be exactly queue heads address of node.In operable data memory block, if storing the number of n kind different data item ID According to then there is n copy cluster;Also queue structure is used between copy cluster;Copy cluster queue does not have gauge outfit node.If can currently grasp Make that the queue of copy cluster is not present in data storage area, i.e., any copy cluster is also not present, then it represents that current operable data is deposited Without storage power grid real-time running data in storage area;
The creation process of copy be exactly power grid real-time running data deposit real-time storage region process, specifically: (1) when When having power grid real-time running data is captured to need to be loaded onto operable data memory block, replica management module is in operable data A block space is distributed in memory block, stores data in this space, is then created a copy, is directed toward this block space;(2) copy Cluster is queue structure, can only use sequential search mode, traverse each copy queue nodes in queue, compare copy cluster queue In whether there is copy cluster node identical with new data item ID, i.e., whether possess and newly arrive in retrieval operable data memory block The homologous data of power grid real-time running data.If so, being transferred to (3);If it is not, being transferred to (9).(3) by copy cluster queue The cluster copy first address of node, navigates to the head node of copy queue in current copy cluster.(4) the newly-built copy section of initialization Point.Operation label is set to 0.(5) newly-built replica node is inserted into copy queue.First by the data time of newly-built replica node Stamp compares since first replica node of queue, and until traversing a certain replica node, timestamp is greater than newly-built node Timestamp but the timestamp of its next node are less than the timestamp of newly-built node, and newly-built replica node is inserted in the node Next node.(6) if the power grid real-time running data that replica node is directed toward fails, or system command is received, and need The data warehouse based on CIM by the batch data in replica node is imported from operable data memory block, then by this mirror node Label be set to 1, meanwhile, the batch data in replica node is sequentially loaded into the data warehouse based on CIM.(7) if receiving number It is requested according to updating, then distributes memory space in operable data memory block, newly-built replica node simultaneously completes initialization operation, so Check whether operable data memory block has the corresponding copy queue of the data item ID for data of newly arriving afterwards.If so, being transferred to step (8);If no, being transferred to (9).(8) a new copy queue is built for newly-built replica node.Queue is initialized, by newly-built copy The data item ID of node is assigned to the data item ID of gauge outfit node;The first address of gauge outfit node is directed toward newly-built replica node first address. (9) newly-built replica node is inserted into copy queue.More latest copy cluster queue.If without newly-built copy queuing data item in cluster queue The corresponding cluster node of ID then creates copy cluster and initializes, the data item ID of newly-built copy queue is assigned to the data of cluster node Item ID, the cluster of cluster node are directed toward the head node of copy queue, by the tail portion of newly-built cluster node insertion copy cluster queue.(10) it is based on The copy index of deque completes corresponding update.
In terms of data model, the present invention uses improved Operation of Electric Systems data model.For same parent member Uniform rules coding, in true table and is stored in distributed document for the level coding Information Compression in operation data dimension table In system, for executing big data analysis on a large scale, on distribution Mapreduce node in Operation of Electric Systems monitoring.Classification Coding takes sequential encoding and splicing coding.The sequential encoding is according to predefined sequence using the decimal system to each in dimension Attribute is encoded, and the corresponding relationship before dimensional attribute cannot be directly acquired.And splice coding by the splicing of coding, pass through Dimension traversal is realized in the shifting function of coding.Coding rule is as follows:
All detail datas are categorized into a non-overlapping data structure.Assuming that d indicates any dimension in dimension table Degree, has following characteristics:
1) each d has and only comprising a theme.
2) d is the set constituted by n classification, is denoted as l1, l2..., ln, any one classification liAll contain only Unique dimensional attribute and miA value;
3) any dimension can be used as tree structure composed by the value of each level.
If liIt is any level of dimension d, corresponds to all values miSet as level liUniverse, then level li-1As level liFather node level, and the father node of highest level is defined as affiliated theme.Possess common parent p Level liThe set that value member is constituted is referred to as level liSubset domain.And the brotgher of node be belong to same class node at Member.
Each dimension can be used as a special single hierarchical tree, and the path of any node of the single tree is according to preceding Sequence traversal executes.The universe level coding of the node refers to will be after the coding splicing of the subset domain hierarchy of each node in path Obtained coding.
The data analytics server is also used to for grid operation data and its metadata being packaged into unified format, it includes There are metadata package module and shifting combination module.Metadata package module is packaged electric network information metadata, passes through member Data are to data cleansing and inspection;Shifting combination module group again in a manner of sectional encryption by grid operation data and metadata It closes, the safety and data of improve data transfer and exchange are uniformly processed.
The information generated by data in record electric network information metadata, power system information and transmission, in CIM data Under the rule constraint of conversion, so that being unsatisfactory for the data of rule can not pass through, to clean to data.It is rule-based clear It washes and data is cleaned by extraction basic metadata value and electric system additional safety level information.
After completing cleaning, by operation data basic metadata, electric system additional safety level information and system operatio It is packaged into final electric network information metadata, which is encapsulated in the form of key-value pair;Shifting combination module is with sectional encryption side Data and its metadata are encapsulated as translation-protocol by formula together.The data for having metadata are being packaged into unified format number by data According to, then carry out CIM data conversion.
In CIM data conversion, the data of the metadata package module and the encapsulation of shifting combination module are interpreted, respectively with extensive Telegram in reply network data and its metadata;Data are cleaned using rule according to metadata, to clean the data not being inconsistent normally;
Cleaning wherein is carried out to data using rule, the electric network information metadata provided according to electric system is be provided, it is right Data are cleaned.Unified rule description is provided in rule, realizes filter data to handle metadata information.It is described Rule is designed as customized mapping ruler expression formula, is made of variate-value and operator.Variate-value is from electric network information metadata Middle extraction.When cleaning, metadata is replaced into variate-value, then computation rule expression formula, finally exports calculated result.Defining number According to source to target matrix rule when, rule recorded using mapping expression formula.System is according to mapping expression parsing The position of one or more source literary name sections in aiming field source is formed out, and parses complicated conditional plan and data screening The transformation rule parsed is stored in rule base by scheduled format, then submits to corresponding conversion module and carry out by condition Processing.After mapping expression formula is parsed, the transformation rule of user-defined dispersion is just integrally incorporated in rule base.When When executing the extraction of data, the transformation rule in rule base is read, corresponding transition components is called to complete the extraction of data.
When the data analytics server carries out the distributed similarity analysis to Operation of Electric Systems data, specifically Including being associated analysis for electric system abnormal behaviour and power transmission and transformation monitoring data.It is right respectively before being associated analysis Power transmission and transformation monitoring data and electric system abnormal behaviour data are pre-processed.Pretreatment to electric system abnormal behaviour data Including 2 steps: 1) the unit exception behavioral data for having installed monitor terminal in all Operation of Electric Systems data is selected, And the various kinds of equipment failure frequency in each detection terminal is summarized;2) place is normalized in the data summarized Reason.For existing association analysis just for the spatial character of monitoring data, ignore time response.Corresponding equipment is filtered out to occur Abort situation, and the monitoring data of its monitor terminal is obtained, monitoring data is pre-processed according to the following steps: 1) entirely being supervised Power transmission and transformation qualified rate of each power transmission and transformation qualified rate that Statistical monitor terminal monitoring obtains in the survey period as the position; 2) each power transmission and transformation index average value for monthly counting monitor terminal is averaged in entire monitoring cycle, and it is each to obtain the position The average value of power transmission and transformation index;3) each power transmission and transformation index value calculated in above step is normalized, will be owned Data are converted between [0,1].By the pretreatment of data, the items of electric system abnormal behaviour data and power transmission and transformation monitoring are defeated Power transformation index is mapped as the numerical value in [0,1] section.
The related coefficient between variable is calculated, obtains the incidence matrix A for the m × n dimension being made of related coefficient, as follows.
The row variable of matrix is electric system abnormal behaviour statistical data in formula, uses xi, i=1 ..., m expression, column variable For power transmission and transformation monitoring data, y is usedj, j=1 ..., n are indicated.ρXi, yjFor xiAnd yjRelated coefficient.
For the structuring operation data of the power grid based on CIM, the present invention goes electric network composition data pick-up and conversion To be expressed as behavior model four-tuple N=(P, W, O, M), wherein P indicates the data set of network system data source, and W expression is based on The data set of the data warehouse of CIM, O indicate multiple mutually independent extraction set of tasks, and M indicates the data warehouse based on CIM The metadata set of modeling.For extracting task O={ O1, O2, O3, O1Data cleansing task is indicated, according to the data based on CIM Warehouse metadata extracts pretreated data from electric system;O2Indicate that data load task, by the number in interface document area It is mapped to the tables of data of the data warehouse transition file area based on CIM according to table and carries out relevant data conversion and loading;O3Table Show integration servers, according to the data warehouse model based on CIM, data verification carried out to the data in buffer area and data map, And by the data integration examined into the data warehouse based on CIM.
If T is the Data source table of data conversion process, TiIt is T in the data warehouse buffer area based on CIM of moment i Data copy, Ti={ D, T }, wherein D indicates timestamp.If I is that T from the i moment to the data at i+1 moment changes copy, then I= {Lsn, M, To, Tn}。LsnIndicate the log number that data change occurs, M indicates data change operation, ToIndicate data before changing Or the data before deleting, TnIndicate changed data or newly-increased data.Obtaining Ti+1When, and directly tables of data T is selected It selects operation to compare, the performance of source database be influenced smaller.In the data warehouse buffer area based on CIM, from Ti+1It is mapped to True table S, first acquisition Ti+1In the true data of [i, i+1] in the period, according still further to the metadata of the data warehouse based on CIM Definition, makees relevant aggregation project.
The data cleansing process further passes through similarity analysis determination and the biggish operation number of the current time degree of association According to similar sample set, characteristic feature sequence is then obtained using hierarchical clustering, using characteristic sequence as reference pair sequence to be detected Fault data identification is carried out, finally modifies to the fault data of identification, the corresponding normal data of characteristic sequence is moved to Sequence fault data section to be detected.By cluster process, different characteristic feature sequences are extracted, and are ginseng with characteristic feature sequence It examines, to may be identified and be modified containing the sequence to be detected of fault data.
In order to faster retrieve data tuple, the present invention establishes index to set of relationship data tuple in memory.Then it will visit It asks that most frequent set of relationship data tuple is put into caching, reduces I/O expense.Frequent set of relationship R will be accessed to store to slow It deposits, set of relationship R and the real-time copy data D for being stored in operable data memory block are used as input.In each iterative process, close The piecemeal Pi that assembly closes R is inputted as a detection.Hash attended operation is performed, i.e., it can traverse caching relation data All tuples in area, and searched in Hash table simultaneously.Whenever successful match, matched data flow tuple is exported.Locating Behind complete caching relation data field of reason, algorithm reads new tuple from power grid real-time running data source, is loaded into Hash In table, and identifier is inserted into queue.For next piecemeal in selection R, it is the smallest to first look for timestamp in queue The connection attribute of data tuple.The piecemeal for having the connection attribute in R is loaded into caching relation data field using index.Pass through this Kind mode, each new piecemeal can carry out matching operation at least one data tuple.
In similarity analysis of the data analytics server to electric network data model, judged not according to sequence curve shape With contacting between sequence.Select temporal characteristics correlative factor as the sample of calculating correlation.Steps are as follows for specific calculating:
(1) current time sequence Y={ Y (m) | m=1,2 ... p } is set as reference sequences, historical time operation data sequence Xi ={ Xi(m) | m=1,2 ... p }, i=1,2 ... k are to compare sequence, and p is sequential element number.
(2) it calculates
(3) calculate correlation coefficient
ζ in formulaiIt (m) is Y (m) in Xi(m) incidence coefficient at place.Wherein △i(m)=| y (m)-xi(m) |, ρ is to differentiate system Number, value interval are (0,1).
(4) calculating correlation:
In the above-mentioned hierarchical clustering stage, if data set X={ x1, x2... ..., xn, n is the quantity of element in X.It is wherein every A element is all a p dimensional vector, contains k class in X, it is assumed that the center v of i-th of classi={ vi1, vi2... .vip, definition is special Sign sequence is each cluster centre.J-th of element is u to the degree of membership at i-th of class center in XijIf set U={ uij, V= {vij}。
uijCalculation formula are as follows:
M is Weighted Index in formula.dij=| | xj-vi| | j-th of element is represented to the distance at i-th of class center.For poly- Class center viIt can be calculated as follows:
Clustering iterative process is to find the cluster centre and subordinated-degree matrix corresponding when objective function reaches minimum value, If objective function J are as follows:
Cluster result analyze and determines optimal dividing.If a data set comprising n sequence is divided into k class (C1, C2..., Ck), for CaIn i-th of sequence x (i), calculate the average distance a (i) of other sequences in x (i) and class.D (i, Cb) it is that x (i) arrives another class CbThe average distance of all sequences defines b (i)=min { d (i, Cb), b=1,2 ... k, a ≠ b.The Singularity Degree of sequence in the average distance and other classes of each sequence and sample in class is calculated, the calculating of each sequence i is public Formula are as follows:
The average Dissim value of data set whole sample is taken to evaluate the quality of cluster result, index maximum value corresponds to poly- Class optimal classification number.
When fault data judges, if d days shared by the similar sample set that similarity analysis obtains, there is d in every one kindn It,N takes 1~k, and t moment operation data maximum rate of change is denoted as αmax(t, dn)。
αmax(t, dn)=max { [L (d-i, t)-L (d-i, t-1)]/L (d-i, t-1 }), i=1~dn
Wherein function L (d, t) is the d days t moment operation datas.
If sequence X to be detectedd=(xd1, xd2….xdm), m is daily sampling number.Maximum membership degree characteristic sequence is Xt, in sampling time t, XdRelative to characteristic sequence XtChange rate are as follows:
δt=(xdt-xtt)/xtt
If δt> αmax(t, dn), then it is assumed that it is fault data.The method reduce workloads, improve calculating speed and mould Type working efficiency.
If detecting some sequence XdP point between q point be fault data, be subordinate to the maximum two feature sequences of angle value Column are respectively Xt1, Xt2.Maximum membership degree characteristic sequence is used during actual modification.It is as follows to modify formula.
X'd(i)=X't1(i)(uT1, i/(uT1, i+uT2, i))+X't2(i)(uT2, i/(uT1, i+uT2, i))
X't1(i)=X't1(i)×[Xd(p-1)/Xt1(p-1)+Xd(q+1)/Xt1(q+1)]
X't2(i)=X't2(i)×[Xd(p-1)/Xt2(p-1)+Xd(q+1)/Xt2(q+1)],
Wherein i=p, p+1 ..., q
In conclusion the invention proposes a kind of data cleansing conversion method based on CIM, in improved operation of power networks number Under support according to model and distributed data platform, source data is extracted, cleaned, is integrated, ensures the quality of data and reliable Property, it realizes the unified standard data output based on database, there is the broad applicability for supporting clustered deploy(ment) and concurrent, it can It is integrated for electric network data automation and analysis provides reliable support.
Obviously, it should be appreciated by those skilled in the art, each module of the above invention or each steps can be with general Computing system realize that they can be concentrated in single computing system, or be distributed in multiple computing systems and formed Network on, optionally, they can be realized with the program code that computing system can be performed, it is thus possible to they are stored It is executed within the storage system by computing system.In this way, the present invention is not limited to any specific hardware and softwares to combine.
It should be understood that above-mentioned specific embodiment of the invention is used only for exemplary illustration or explains of the invention Principle, but not to limit the present invention.Therefore, that is done without departing from the spirit and scope of the present invention is any Modification, equivalent replacement, improvement etc., should all be included in the protection scope of the present invention.In addition, appended claims purport of the present invention Covering the whole variations fallen into attached claim scope and boundary or this range and the equivalent form on boundary and is repairing Change example.

Claims (8)

1. a kind of data cleansing conversion method based on CIM characterized by comprising
Capture the operation data of electric system;
The operation data of the electric system captured is cleaned and converted, obtains the data based on CIM unified standard, and deposit Storage is into distributed file system;
Data are extracted from distributed file system, construct the Distributed Data Warehouse based on CIM.
2. the method according to claim 1, wherein the operation data includes equipment account information, O&M number According to, fault data, trend topological data, GIS device information.
3. the method according to claim 1, wherein further include: by the model information and base of the electric system It is stored in MangoDB in the metadata of the Distributed Data Warehouse of CIM.
4. according to the method described in claim 3, it is characterized in that, the Distributed Data Warehouse based on CIM passes through Directly data are extracted from distributed file system after MapReduce dismantling task to be analyzed, it is unified carry out data management with Data access and the mapping of implementation model data and performance optimization.
5. according to the method described in claim 4, it is characterized in that, model data mapping includes power system service model Attribute and bottom different types of data source model data map.
6. the method according to claim 1, wherein the cleaning and conversion include two stages: the first stage is Data are drawn into operable data buffer area from data source, data are drawn into base from operable data buffer area by second stage In the data warehouse of CIM: the data source of isomery is drawn into operable data buffer area by (1) first stage, by the first rank The operation data of electric system, it is standby to be established an identical structure, the copy of identical content by section in operable data buffer area Part;(2) second stage carries out statistics merging to the data of operable data buffer area and summarizes, will using step increment method mode Data are stored into the data warehouse based on CIM;The data pick-up is increment extraction, if can not judge increment when extracting, Increment then is calculated in load, data add time tag when being loaded into the data warehouse based on CIM;From operable data to In the extraction process of data warehouse based on CIM, after data are read out from operable data buffer area, unified information is first carried out Coded treatment, then different disposal is carried out to true table data, dimension table data respectively;For the data variation of true table, according to Different situations of change selects different step increment method modes, if data temporally change, timestamp increment is used, if data Random variation is presented, then carries out the comparison data increment of full table;For the data variation of dimension table, CIM is based on newest Data cover off-line data.
7. according to the method described in claim 4, it is characterized in that, described divided from distributed file system extraction data Analysis, further comprises the similarity analysis to Operation of Electric Systems data.
8. the method according to the description of claim 7 is characterized in that in the similarity analysis of the Operation of Electric Systems data In, the connection between different sequences is judged according to sequence curve shape, and temporal characteristics correlative factor is selected to be associated with as calculating The sample of degree;Steps are as follows for specific calculating:
(1) current time sequence Y={ Y (m) | m=1,2 ... p } is set as reference sequences, historical time operation data sequence Xi={ Xi (m) | m=1,2 ... p }, i=1,2 ... k are to compare sequence, and p is sequential element number;
(2) it calculates
(3) calculate correlation coefficient ζi(m):
ζ in formulaiIt (m) is Y (m) in Xi(m) incidence coefficient at place:
Wherein △i(m)=| y (m)-xi(m) |, ρ is resolution ratio, and value interval is (0,1):
(4) calculating correlation:
CN201810887270.7A 2018-08-06 2018-08-06 A kind of data cleansing conversion method based on CIM Pending CN109213752A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810887270.7A CN109213752A (en) 2018-08-06 2018-08-06 A kind of data cleansing conversion method based on CIM

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810887270.7A CN109213752A (en) 2018-08-06 2018-08-06 A kind of data cleansing conversion method based on CIM

Publications (1)

Publication Number Publication Date
CN109213752A true CN109213752A (en) 2019-01-15

Family

ID=64987594

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810887270.7A Pending CN109213752A (en) 2018-08-06 2018-08-06 A kind of data cleansing conversion method based on CIM

Country Status (1)

Country Link
CN (1) CN109213752A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109918367A (en) * 2019-03-19 2019-06-21 北京百度网讯科技有限公司 A kind of cleaning method of structural data, device, electronic equipment and storage medium
CN110968627A (en) * 2019-11-11 2020-04-07 南京峰凯云歌数据科技有限公司 Big data analysis method and system
CN111177128A (en) * 2019-12-11 2020-05-19 国网天津市电力公司电力科学研究院 Batch processing method and system for big metering data based on improved outlier detection algorithm
CN111177126A (en) * 2019-08-01 2020-05-19 腾讯科技(深圳)有限公司 Information processing method, device and equipment
CN111506640A (en) * 2020-04-21 2020-08-07 北京中电普华信息技术有限公司 Mapping method and device
CN112650744A (en) * 2020-12-31 2021-04-13 广州晟能软件科技有限公司 Data management method for preventing secondary pollution of data
CN112948203A (en) * 2021-02-03 2021-06-11 刘靖宇 Elevator intelligent inspection method based on big data
CN113742086A (en) * 2021-09-17 2021-12-03 中环曼普科技(南京)有限公司 Distributed parallel analysis type data cluster management method and system
CN116821223A (en) * 2023-08-25 2023-09-29 云南三耳科技有限公司 Industrial visual control platform and method based on digital twinning
CN112650744B (en) * 2020-12-31 2024-04-30 广州晟能软件科技有限公司 Data treatment method for preventing secondary pollution of data

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101764835A (en) * 2008-12-25 2010-06-30 华为技术有限公司 Task allocation method and device based on MapReduce programming framework
CN104636204A (en) * 2014-12-04 2015-05-20 中国联合网络通信集团有限公司 Task scheduling method and device
CN105138405A (en) * 2015-08-06 2015-12-09 湖南大学 To-be-released resource list based MapReduce task speculation execution method and apparatus
CN106528880A (en) * 2016-12-14 2017-03-22 云南电网有限责任公司电力科学研究院 Normalizing method and system for data structure format of multi-source power service data
CN107451622A (en) * 2017-08-18 2017-12-08 长安大学 A kind of tunnel operation state division methods based on big data cluster analysis
US9906604B2 (en) * 2015-03-09 2018-02-27 Dell Products L.P. System and method for dynamic discovery of web services for a management console
CN107766541A (en) * 2017-10-30 2018-03-06 北京国电通网络技术有限公司 With electricity consumption overall situation full dose data transfer and storage method, device, electronic equipment
CN107798139A (en) * 2017-11-23 2018-03-13 国网上海市电力公司 A kind of master/slave data isomery method based on CIM/XML

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101764835A (en) * 2008-12-25 2010-06-30 华为技术有限公司 Task allocation method and device based on MapReduce programming framework
CN104636204A (en) * 2014-12-04 2015-05-20 中国联合网络通信集团有限公司 Task scheduling method and device
US9906604B2 (en) * 2015-03-09 2018-02-27 Dell Products L.P. System and method for dynamic discovery of web services for a management console
CN105138405A (en) * 2015-08-06 2015-12-09 湖南大学 To-be-released resource list based MapReduce task speculation execution method and apparatus
CN106528880A (en) * 2016-12-14 2017-03-22 云南电网有限责任公司电力科学研究院 Normalizing method and system for data structure format of multi-source power service data
CN107451622A (en) * 2017-08-18 2017-12-08 长安大学 A kind of tunnel operation state division methods based on big data cluster analysis
CN107766541A (en) * 2017-10-30 2018-03-06 北京国电通网络技术有限公司 With electricity consumption overall situation full dose data transfer and storage method, device, electronic equipment
CN107798139A (en) * 2017-11-23 2018-03-13 国网上海市电力公司 A kind of master/slave data isomery method based on CIM/XML

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
叶彬;曾伟民;肖治华;郭创新;朱乘治;曹一家: "数据仓库在电力系统中的应用", 《电力系统及其自动化学报》 *
周秀文: "灰色关联度的研究与应用", 《中国优秀硕士论文全文数据库 基础科学辑》 *
尚博祥;王扬;孙轶凡: "公共信息模型(CIM)在智能电网信息化中的应用", 《中国电机工程学会2012电力行业信息化年会》 *
赵林;张令涛;马仲佳: "基于大数据技术调度端电网模型管理和分析架构", 《电网技术》 *
钟庆;陈伟坤;许中: "设备故障统计数据与电能质量监测数据的关联分析", 《电力电容器与无功补偿》 *
陈盛荣,刘广钟: "分布式环境下ETL 系统的优化策略研究", 《现代计算机(专业版)》 *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109918367A (en) * 2019-03-19 2019-06-21 北京百度网讯科技有限公司 A kind of cleaning method of structural data, device, electronic equipment and storage medium
CN111177126A (en) * 2019-08-01 2020-05-19 腾讯科技(深圳)有限公司 Information processing method, device and equipment
CN110968627A (en) * 2019-11-11 2020-04-07 南京峰凯云歌数据科技有限公司 Big data analysis method and system
CN111177128B (en) * 2019-12-11 2023-10-27 国网天津市电力公司电力科学研究院 Metering big data batch processing method and system based on improved outlier detection algorithm
CN111177128A (en) * 2019-12-11 2020-05-19 国网天津市电力公司电力科学研究院 Batch processing method and system for big metering data based on improved outlier detection algorithm
CN111506640A (en) * 2020-04-21 2020-08-07 北京中电普华信息技术有限公司 Mapping method and device
CN112650744A (en) * 2020-12-31 2021-04-13 广州晟能软件科技有限公司 Data management method for preventing secondary pollution of data
CN112650744B (en) * 2020-12-31 2024-04-30 广州晟能软件科技有限公司 Data treatment method for preventing secondary pollution of data
CN112948203A (en) * 2021-02-03 2021-06-11 刘靖宇 Elevator intelligent inspection method based on big data
CN112948203B (en) * 2021-02-03 2023-04-07 刘靖宇 Elevator intelligent inspection method based on big data
CN113742086A (en) * 2021-09-17 2021-12-03 中环曼普科技(南京)有限公司 Distributed parallel analysis type data cluster management method and system
CN116821223A (en) * 2023-08-25 2023-09-29 云南三耳科技有限公司 Industrial visual control platform and method based on digital twinning
CN116821223B (en) * 2023-08-25 2023-11-24 云南三耳科技有限公司 Industrial visual control platform and method based on digital twinning

Similar Documents

Publication Publication Date Title
CN109213752A (en) A kind of data cleansing conversion method based on CIM
CN105069703B (en) A kind of electrical network mass data management method
CN103488673B (en) For performing the method for reconciliation process, controller and data-storage system
US10599684B2 (en) Data relationships storage platform
CN103023970B (en) Method and system for storing mass data of Internet of Things (IoT)
CN106815338A (en) A kind of real-time storage of big data, treatment and inquiry system
CN109120461B (en) A kind of service feature end-to-end monitoring method, system and device
CN104809244B (en) Data digging method and device under a kind of big data environment
CN113254630B (en) Domain knowledge map recommendation method for global comprehensive observation results
US20190050435A1 (en) Object data association index system and methods for the construction and applications thereof
CN109308290A (en) A kind of efficient data cleaning conversion method based on CIM
CN107103064A (en) Data statistical approach and device
CN106570145B (en) Distributed database result caching method based on hierarchical mapping
CN103995828B (en) A kind of cloud storage daily record data analysis method
CN112181960A (en) Intelligent operation and maintenance framework system based on AIOps
CN112148578A (en) IT fault defect prediction method based on machine learning
CN108182263A (en) A kind of date storage method of data center's total management system
US11182386B2 (en) Offloading statistics collection
CN107330098A (en) A kind of querying method of self-defined report, calculate node and inquiry system
CN116049454A (en) Intelligent searching method and system based on multi-source heterogeneous data
Theeten et al. Chive: Bandwidth optimized continuous querying in distributed clouds
CN111125450A (en) Management method of multilayer topology network resource object
CN109947743A (en) A kind of the NoSQL big data storage method and system of optimization
CN117221087A (en) Alarm root cause positioning method, device and medium
CN116795816A (en) Stream processing-based multi-bin construction method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190115