CN109213752A

CN109213752A - A kind of data cleansing conversion method based on CIM

Info

Publication number: CN109213752A
Application number: CN201810887270.7A
Authority: CN
Inventors: 李晖; 陈清族; 陈世春; 邹墨; 何德明; 许梓明; 马汉斌; 陈珺; 谢驰; 程友平; 温天宝; 林超; 周暖青; 林永辉; 刘化龙; 李雪梅; 谢妙红; 林朝灯; 李建平
Original assignee: State Grid Corp of China SGCC; State Grid Information and Telecommunication Co Ltd; Information and Telecommunication Branch of State Grid Fujian Electric Power Co Ltd
Current assignee: State Grid Corp of China SGCC; State Grid Information and Telecommunication Co Ltd; Information and Telecommunication Branch of State Grid Fujian Electric Power Co Ltd
Priority date: 2018-08-06
Filing date: 2018-08-06
Publication date: 2019-01-15

Abstract

The data cleansing conversion method based on CIM that the present invention provides a kind of, this method comprises: the operation data of capture electric system；The operation data of the electric system captured is cleaned and converted, obtains the data based on CIM unified standard, and store into distributed file system；Data are extracted from distributed file system, construct the Distributed Data Warehouse based on CIM.The data cleansing conversion method based on CIM that the invention proposes a kind of, under the support of improved grid operation data model and distributed data platform, source data is extracted, cleaned, is integrated, ensure the quality of data and reliability, realize the unified standard data output based on database, with the broad applicability for supporting clustered deploy(ment) and concurrent, integrated and analysis can be automated for electric network data and reliable support is provided.

Description

A kind of data cleansing conversion method based on CIM

Technical field

The invention belongs to power grid big data field more particularly to a kind of data cleansing conversion methods based on CIM.

Background technique

With the extensive use of all kinds of power transmission and transforming equipments, grid operation data amount is presented geometry grade and increases.For magnanimity Operation data carries out quick analysis processing, realizes anomaly data detection and excavation, is faced with the challenge of how operation of power networks Big data carries out valid data processing and analyzes with efficient data.Due to the software and hardware system of each net provincial company, resource all exist compared with Big difference increases the difficulty of data log analysis platform building.Traditional grid operation data platform can no longer meet The needs of enterprise operation data store optimization and parallel processing.And traditional data store organisation is intuitive, but its significant disadvantage Be data amount of redundancy it is larger.It causes operation information to repeat to store, brings one to the mixing operation between different operation data tables Fixed difficulty causes the search efficiency of operation data low.

Summary of the invention

It is an object of the invention to which operation of power networks big data is effectively treated, convenient for enterprise by multiple network systems into Row is integrated and merges to realize unified efficient big data analysis.It cleans transfer framework by establishing distributed data and can operate Data field avoids conflicting for data conversion process and data query, and carries out electric network data on the basis of the data warehouse of foundation It excavates, including the use of improved electric network data model realization association analysis and disorder data recognition.

To solve the above problems, the invention proposes a kind of data cleansing conversion method based on CIM, comprising:

Capture the operation data of electric system；

The operation data of the electric system captured is cleaned and is converted, the data based on CIM unified standard are obtained, And it stores into distributed file system；

Data are extracted from distributed file system, construct the Distributed Data Warehouse based on CIM.

Preferably, the operation data include equipment account information, operation/maintenance data, fault data, trend topological data, GIS device information.

Preferably, the model information of the electric system and the metadata of the Distributed Data Warehouse based on CIM are stored In MangoDB.

Preferably, described that task is disassembled later directly from distribution by MapReduce based on the Distributed Data Warehouse of CIM Formula file system extracts data and is analyzed, unified to carry out data management and the data access simultaneously mapping of implementation model data and performance Optimization.

Preferably, the model data mapping includes attribute and the bottom different types of data source of power system service model Model data mapping.

Preferably, the cleaning and conversion include two stages: the first stage is that data are drawn into and can be operated from data source Data are drawn into the data warehouse based on CIM: (1) the first rank from operable data buffer area by data buffer zone, second stage Section, the data source of isomery is drawn into operable data buffer area, by the first stage, the operation data of electric system is existed The copy backup of an identical structure, identical content is established in operable data buffer area；(2) second stage, to can operand Statistics merging is carried out according to the data of buffer area and is summarized, and stores data into the data bins based on CIM using step increment method mode In library；The data pick-up is increment extraction, if can not judge increment when extracting, calculates increment, data in load Time tag is added when being loaded into the data warehouse based on CIM；From operable data to the extraction stream of the data warehouse based on CIM Cheng Zhong after data are read out from operable data buffer area, first carries out unified information coded treatment, then respectively to true table number Different disposal is carried out according to, dimension table data；For the data variation of true table, different increasings is selected according to different situations of change Loading method is measured, if data temporally change, timestamp increment is used, if random variation is presented in data, carries out full table Comparison data increment；For the data variation of dimension table, with the newest data cover off-line data based on CIM.

Preferably, described to be analyzed from distributed file system extraction data, further comprise and electric system is transported The similarity analysis of row data.

Preferably, in the similarity analysis of the Operation of Electric Systems data, judged not according to sequence curve shape With contacting between sequence, select temporal characteristics correlative factor as the sample of calculating correlation；Steps are as follows for specific calculating:

(1) current time sequence Y={ Y (m) | m=1,2 ... p } is set as reference sequences, historical time operation data sequence X_i ={ X_i(m) | m=1,2 ... p }, i=1,2 ... k are to compare sequence, and p is sequential element number；

(2) it calculates

(3) calculate correlation coefficient ζ_i(m):

ζ in formula_iIt (m) is Y (m) in X_i(m) incidence coefficient at place:

Wherein △_i(m)=| y (m)-x_i(m) |, ρ is resolution ratio, and value interval is (0,1):

(4) calculating correlation:

The present invention compared with prior art, has the advantage that

The data cleansing conversion method based on CIM that the invention proposes a kind of, in improved grid operation data model and Under the support of distributed data platform, source data is extracted, cleaned, is integrated, ensure the quality of data and reliability, realizes base It is exported in the unified standard data of database, there is the broad applicability for supporting clustered deploy(ment) and concurrent, can be power grid number It is integrated according to automation and analysis provides reliable support.

Detailed description of the invention

Fig. 1 is the flow chart of the data cleansing conversion method according to an embodiment of the present invention based on CIM.

Specific embodiment

To make the object, technical solutions and advantages of the present invention clearer, the present invention will be made further in conjunction with attached drawing Detailed description.This description is to describe specific implementation consistent with the principles of the present invention by way of example, and not limitation Mode, the description of these embodiments is detailed enough, so that those skilled in the art can practice the present invention, is not being taken off Other embodiments can be used in the case where from scope and spirit of the present invention and can change and/or replace each element Structure.Therefore, the following detailed description should not be understood from restrictive sense.For the technology hand for realizing the present invention Section, creation characteristic reach purpose and effect is easy to understand, and following further describes the present invention in conjunction with specific drawings.

An aspect of of the present present invention provides a kind of data cleansing conversion method based on CIM.Fig. 1 is to implement according to the present invention The data cleansing conversion method flow chart based on CIM of example.

Operation of Electric Systems data monitoring platform the present invention is based on CIM includes that tidal data recovering server, data mart modeling are deposited Store up server and data analytics server.Tidal data recovering server captures Operation of Electric Systems data by sensor etc., described Operation data includes equipment account information, operation/maintenance data, fault data, trend topological data, GIS device information, also comprising non- Structured image and video.Large amount of complex, redundancy, the data of mistake are contained in the mass data of isomery, are needed in short-term It is interior to extract the data for constituting unified standard.Data mart modeling storage server is by distributed file system and MangoDB data Library is integrated, and by the supervising data storage with unified standard into distributed file system, the model of electric system is believed The metadata of breath and the Distributed Data Warehouse based on CIM is stored in MangoDB, the Distributed Data Warehouse wound based on CIM The table and field built are stored in MangoDB.While executing data manipulation, start MangoDB engine to verify first number According to whether there is.The data analytics server completes the distributed similarity analysis to Operation of Electric Systems data.It is being based on After the Distributed Data Warehouse of CIM passes through MapReduce dismantling task, data directly are extracted from distributed file system and are carried out Analysis, it is unified to carry out data management and data access, in this layer of implementation model data mapping and performance optimization, the model data The model data in each attribute of Mapping implementation power system service model and bottom different types of data source maps, and supports to being based on The access of the data warehouse, relational database and non-relational database of CIM, provides the unified query based on business model and update API；The performance optimization provides the data query of L2 cache, asynchronous parallel.

The data cleansing based on CIM is arranged on the basis of above-mentioned distributed structure/architecture and turns for the data mart modeling storage server Change frame, including semantic meaning analysis module, MangoDB rule base, scheduler module and cleaning conversion module.Electric system reception comes from Data cleansing convert task is interpreted as meeting the work of the DAG structure of unified format by the data cleansing convert task of user's request Make flow graph, due to electric system data conversion semanteme execute be not in logic it is optimal, need to transfer to semantic meaning analysis module Complete Optimization Work.

Semantic meaning analysis module is analyzed and is optimized to the cleaning conversion work flow graph formatted via electric system, passes through Each node in traversing graph, determines the activity attributes in workflow, converts to work flow diagram, finally by the work after optimization Flow graph send to coordination unit and executes.Detailed process is as follows:

1. looping through each node of work flow diagram.Node, that is, electric power system data the source for being 0 to in-degree, determines data source Data volume, the relevant information based on CIM is recorded in MangoDB rule base；Number can be operated in the node for being 0 for out-degree According to collection, relevant metadata is recorded in MangoDB rule base；It is both greater than 0 active node for out-degree and in-degree, judges its work Dynamic type, the binary active node of workflow is divided for being used as, records the movable attribute and present position.

2. after traverse node, carry out exchanging the optimizations such as conversion operating to the node in workflow, reduces between active node Data exchange.

3 will be divided into multiple sub- workflows by boundary of binary activity by the workflow of optimization, will be more in sub- workflow A unitary activity is classified as one group and is sent to coordination unit being ready to carry out, and indicates group, is coordination unit dynamic optimization workflow Journey provides reference.

Coordination unit further comprises divide and rule module and scheduling cleaning conversion module.Module of dividing and ruling by data divide from And existing resource is made full use of to play the performance advantage of parallel computation, data are horizontally divided into multiple numbers according to division rule According to stream.Corresponding data cleansing translation activity is then packaged distribution to each distributed parallel cleaning and converted by scheduling cleaning conversion module Module executes.Coordination unit receives execution information the holding with real-time tracking data cleaning active node for carrying out self-cleaning conversion module Traveling degree.Module of dividing and ruling and scheduling cleaning conversion module are coordinated to execute, the former to come the execution information of self-cleaning conversion module into Row Macro or mass analysis, thus the Data Partition Strategy optimized；The latter is according to obtained real-time optimized results, by task point It is fitted on cleaning conversion module.

Cleaning conversion module executes the calculating operation packet distributed via coordination unit in a distributed computing environment, by data Conversion intermediate result is cached in local.Only just used when needing to collect the output data of multiple nodes network bandwidth resources into The transmission of row data.

It is described to be completed based on the optimization of semantic logic in semantic resolution phase, when the institute of ergodic data cleaning conversion work stream After having active node, semantic meaning analysis module executes sequence with exchange activity, merges semanteme according to the attribute of different active nodes Duplicate active policy modifies data cleansing conversion work process under the premise of not changing the implementing result of datamation stream.Through The workflow for crossing optimization reduces volume of the circular flow of the data between different nodes.For the data cleansing convert task described based on CIM, The frame uses the optimisation strategy based on relational database.

The module of dividing and ruling divides data, specifically includes, and data source T is horizontally divided into T₁And T₂, then T=T₁∪ T₂, by each of data cleansing process based on CIM activity Act_i(i ∈ [1, m]) regards the Function Mapping to T as, judges that data are clear It washes in conversion work stream with the presence or absence of following movable serial sequence: i.e. if Act_m(Act_m-1(…(Act_i(T))))={ D₁, D₂..., D_m, D₁, D₂..., D_mAct is then respectively met to the workflow subgraph of the Function Mapping result of T as each activity_m (Act_m-1(…(Act_i(T_i))))∪Act_m(Act_m-1(...(Act_i(T₂))))={ D₁, D₂..., D_m}.If M is the relationship to T Operation, meets M (T)=M (T₁)∪M(T₂).If there are such sequences in workflow, one group is merged into.By dividing data Mobile component is fitted on different cleaning conversion module asynchronous executions by mode, forms the effect of assembly line.

For the data cleansing conversion work stream executed parallel, if there is data distribution and Mapreduce node resource Unmatched situation is calculated according to current system operation progress and decides whether then after system starts to execute processor active task Carry out data division.When there is idle node in Mapreduce node or when starting to execute some new unitary active task, Data will occur to divide.The execution process of the strategy is as follows, is carrying out processor active task in access Mapreduce node first Conversion module is cleaned, selects the wherein activity of deadline the latest as the object for dividing data；Access Mapreduce section later Available free cleaning conversion module in point judges whether to meet execution condition, and the execution condition is the time of node free time Window, which is greater than, is divided the sum of across the machine transmission time of data and operation time, and otherwise data divide nonsensical, will be eligible Cleaning conversion module record；The data volume that data divide occurs finally, calculating, turns these data in idle cleaning It is most short to change the mold the time span completing transmission on block and calculating.

When occurring idle cleaning conversion module in Mapreduce node has activity to be scheduled, system activates data to draw Divide algorithm.Data processing for the cleaning conversion module, main process is divided into two stages: the first stage is by data from number It is drawn into operable data buffer area according to source, data are drawn into the number based on CIM from operable data buffer area by second stage According to warehouse:

(1) data source of isomery is drawn into operable data buffer area by the first stage, will be electric by the first stage The operation data of Force system establishes the copy backup of an identical structure, identical content in operable data buffer area.

(2) second stage carries out statistics merging to the data of operable data buffer area and summarizes, uses step increment method side Formula stores data into the data warehouse based on CIM.The data pick-up is increment extraction, if can not judge to increase when extracting Amount then calculates increment in load, and data add time tag when being loaded into the data warehouse based on CIM.From can operand According in the extraction process to the data warehouse based on CIM, after data are read out from operable data buffer area, unification is first carried out Information coding processing, then different disposal is carried out to true table data, dimension table data respectively.For the data variation of true table, Different step increment method modes is selected according to different situations of change, if data temporally change, uses timestamp increment, if Random variation is presented in data, then carries out the comparison data increment of full table.For the data variation of dimension table, it is based on newest The data cover off-line data of CIM.

Backup of the operable data buffer area as Database Management System in Electrical Power System, by electricity such as production defect, network loads Force system operation data is backed up, in cleaning conversion process, so that it may use the grid operation data in operable data Backup is used as data source, these data are loaded into the topic model of the data warehouse based on CIM after conversion and cleaning.It needs The operation data for all electric system to enter the data warehouse based on CIM is transmitted directly to operable data buffering first Area, then the target being transferred in the data warehouse based on CIM is handled from operable data buffer area through over cleaning, conversion, mapping In theme, the data of operable data buffer area are deleted after treatment.

The data temporary library of the operable data buffer area is the initial data for storing electric system and each isomery system The initial data that system is transferred to, grid operation data are stored according to theme, are cleaned to the data of data temporary library, then face Subject data fairground is stored in theme and by data model.The data in subject data fairground using a conversion process, into Enter the data warehouse based on CIM.It is divided into multiple topic models, dimension table model in data warehouse based on CIM.

As further embodiment, it is of the invention based on the data conversion of CIM cleaning conversion module treatment process Middle execution following steps:

(1) judge the position converted and cleaned in electric power system data source；Field null value is captured, so After load or be substituted for other meaning data, further according to field null value complete shunt be loaded into different target library.

(2) data sample is extracted from data source, whether with definition consistent, search if analyzing the data that extract The format and structure of abnormal data define CIM business rule；Standardization data format realizes constraint definition to field format, together When numerical value, time, the character in data source are loaded with user-defined format；Field is disassembled according to CIM business demand.

(3) inquiry table verify data correctness is utilized, then invalid data, missing data are replaced；And it advises in advance Surely the processing strategie of data is lost；

(4) data are transformed into the data model of a standard, based on definition standardized data value and format；It is establishing During constraint condition, by ineligible invalid data, it is deposited into wrong data concentration by replacing or exporting, is guaranteed The uniqueness of data major key.

It is influenced to farthest reduce inquiry conflict bring, the present invention is further by the data cleansing based on CIM Flow path switch is divided into asynchronous conversion and synchronous conversion, is respectively used to processing power grid real-time running data and off-line data.It is described different Step conversion includes that the off-line operation data of real-time will be lost in electric power system data source in a manner of batch processing with predetermined period It is loaded into data warehouse.The synchronous conversion includes actively capturing the operation data of real-time change in electric system, and being loaded into can Operation data memory block.After completing the query analysis operation to latest data in operable data memory block, certain systems are triggered System condition, then batch import in the data warehouse based on CIM.Operable data memory block by multiple data copies with based on dual The copy of link indexes composition, and copy is the data space of logical construction and physical structure having the same, can operated Dynamic creation in data storage area.

When creating a copy, a corresponding wave file is saved in operable data memory block, by power grid reality When operation data be orderly loaded onto copy.Copy index be made of two horizontal and vertical queues, lateral queue be by Possess same data item ID but timestamp it is different replica node composition, longitudinal queue is by the copy queue of different data item ID Head node composition.

Copy queue is made of queue head node and queue nodes, and queue head node possesses two attributes: data item ID with First address.The source of data item ID mark data, in a copy queue, the data of all replica nodes are from identical Data source, thus possess identical data item ID, the identical data of these data item ID are known as same source data.First address storage one A address, it is directed toward first replica node of queue.

Queue nodes are gathered around there are five attribute, are replica node size, replica node data time stamp, operation label, number respectively According to the address of storage address and direction queue next node.Node size identify current copy node to data occupy Space size.Replica node is sorted from large to small according to timestamp.Operation label is for marking data in current copy node Which kind of operation is carried out, if current copy node is just carrying out for power grid real-time running data in source data being loaded into operable data storage Area, then the operation label of this replica node is set as 0, if current copy node is directed toward data and needs from operable data memory block batch Amount is loaded onto the data warehouse based on CIM, then operates label and be set to 1.Address data memory is directed toward replica node corresponding data and deposits The position of storage.

All copies from same data source constitute a copy queue, referred to as a copy cluster.Wherein copy cluster First address be exactly queue heads address of node.In operable data memory block, if storing the number of n kind different data item ID According to then there is n copy cluster；Also queue structure is used between copy cluster；Copy cluster queue does not have gauge outfit node.If can currently grasp Make that the queue of copy cluster is not present in data storage area, i.e., any copy cluster is also not present, then it represents that current operable data is deposited Without storage power grid real-time running data in storage area；

The creation process of copy be exactly power grid real-time running data deposit real-time storage region process, specifically: (1) when When having power grid real-time running data is captured to need to be loaded onto operable data memory block, replica management module is in operable data A block space is distributed in memory block, stores data in this space, is then created a copy, is directed toward this block space；(2) copy Cluster is queue structure, can only use sequential search mode, traverse each copy queue nodes in queue, compare copy cluster queue In whether there is copy cluster node identical with new data item ID, i.e., whether possess and newly arrive in retrieval operable data memory block The homologous data of power grid real-time running data.If so, being transferred to (3)；If it is not, being transferred to (9).(3) by copy cluster queue The cluster copy first address of node, navigates to the head node of copy queue in current copy cluster.(4) the newly-built copy section of initialization Point.Operation label is set to 0.(5) newly-built replica node is inserted into copy queue.First by the data time of newly-built replica node Stamp compares since first replica node of queue, and until traversing a certain replica node, timestamp is greater than newly-built node Timestamp but the timestamp of its next node are less than the timestamp of newly-built node, and newly-built replica node is inserted in the node Next node.(6) if the power grid real-time running data that replica node is directed toward fails, or system command is received, and need The data warehouse based on CIM by the batch data in replica node is imported from operable data memory block, then by this mirror node Label be set to 1, meanwhile, the batch data in replica node is sequentially loaded into the data warehouse based on CIM.(7) if receiving number It is requested according to updating, then distributes memory space in operable data memory block, newly-built replica node simultaneously completes initialization operation, so Check whether operable data memory block has the corresponding copy queue of the data item ID for data of newly arriving afterwards.If so, being transferred to step (8)；If no, being transferred to (9).(8) a new copy queue is built for newly-built replica node.Queue is initialized, by newly-built copy The data item ID of node is assigned to the data item ID of gauge outfit node；The first address of gauge outfit node is directed toward newly-built replica node first address. (9) newly-built replica node is inserted into copy queue.More latest copy cluster queue.If without newly-built copy queuing data item in cluster queue The corresponding cluster node of ID then creates copy cluster and initializes, the data item ID of newly-built copy queue is assigned to the data of cluster node Item ID, the cluster of cluster node are directed toward the head node of copy queue, by the tail portion of newly-built cluster node insertion copy cluster queue.(10) it is based on The copy index of deque completes corresponding update.

In terms of data model, the present invention uses improved Operation of Electric Systems data model.For same parent member Uniform rules coding, in true table and is stored in distributed document for the level coding Information Compression in operation data dimension table In system, for executing big data analysis on a large scale, on distribution Mapreduce node in Operation of Electric Systems monitoring.Classification Coding takes sequential encoding and splicing coding.The sequential encoding is according to predefined sequence using the decimal system to each in dimension Attribute is encoded, and the corresponding relationship before dimensional attribute cannot be directly acquired.And splice coding by the splicing of coding, pass through Dimension traversal is realized in the shifting function of coding.Coding rule is as follows:

All detail datas are categorized into a non-overlapping data structure.Assuming that d indicates any dimension in dimension table Degree, has following characteristics:

1) each d has and only comprising a theme.

2) d is the set constituted by n classification, is denoted as l₁, l₂..., l_n, any one classification l_iAll contain only Unique dimensional attribute and m_iA value；

3) any dimension can be used as tree structure composed by the value of each level.

If l_iIt is any level of dimension d, corresponds to all values m_iSet as level l_iUniverse, then level l_i-1As level l_iFather node level, and the father node of highest level is defined as affiliated theme.Possess common parent p Level l_iThe set that value member is constituted is referred to as level l_iSubset domain.And the brotgher of node be belong to same class node at Member.

Each dimension can be used as a special single hierarchical tree, and the path of any node of the single tree is according to preceding Sequence traversal executes.The universe level coding of the node refers to will be after the coding splicing of the subset domain hierarchy of each node in path Obtained coding.

The data analytics server is also used to for grid operation data and its metadata being packaged into unified format, it includes There are metadata package module and shifting combination module.Metadata package module is packaged electric network information metadata, passes through member Data are to data cleansing and inspection；Shifting combination module group again in a manner of sectional encryption by grid operation data and metadata It closes, the safety and data of improve data transfer and exchange are uniformly processed.

The information generated by data in record electric network information metadata, power system information and transmission, in CIM data Under the rule constraint of conversion, so that being unsatisfactory for the data of rule can not pass through, to clean to data.It is rule-based clear It washes and data is cleaned by extraction basic metadata value and electric system additional safety level information.

After completing cleaning, by operation data basic metadata, electric system additional safety level information and system operatio It is packaged into final electric network information metadata, which is encapsulated in the form of key-value pair；Shifting combination module is with sectional encryption side Data and its metadata are encapsulated as translation-protocol by formula together.The data for having metadata are being packaged into unified format number by data According to, then carry out CIM data conversion.

In CIM data conversion, the data of the metadata package module and the encapsulation of shifting combination module are interpreted, respectively with extensive Telegram in reply network data and its metadata；Data are cleaned using rule according to metadata, to clean the data not being inconsistent normally；

Cleaning wherein is carried out to data using rule, the electric network information metadata provided according to electric system is be provided, it is right Data are cleaned.Unified rule description is provided in rule, realizes filter data to handle metadata information.It is described Rule is designed as customized mapping ruler expression formula, is made of variate-value and operator.Variate-value is from electric network information metadata Middle extraction.When cleaning, metadata is replaced into variate-value, then computation rule expression formula, finally exports calculated result.Defining number According to source to target matrix rule when, rule recorded using mapping expression formula.System is according to mapping expression parsing The position of one or more source literary name sections in aiming field source is formed out, and parses complicated conditional plan and data screening The transformation rule parsed is stored in rule base by scheduled format, then submits to corresponding conversion module and carry out by condition Processing.After mapping expression formula is parsed, the transformation rule of user-defined dispersion is just integrally incorporated in rule base.When When executing the extraction of data, the transformation rule in rule base is read, corresponding transition components is called to complete the extraction of data.

When the data analytics server carries out the distributed similarity analysis to Operation of Electric Systems data, specifically Including being associated analysis for electric system abnormal behaviour and power transmission and transformation monitoring data.It is right respectively before being associated analysis Power transmission and transformation monitoring data and electric system abnormal behaviour data are pre-processed.Pretreatment to electric system abnormal behaviour data Including 2 steps: 1) the unit exception behavioral data for having installed monitor terminal in all Operation of Electric Systems data is selected, And the various kinds of equipment failure frequency in each detection terminal is summarized；2) place is normalized in the data summarized Reason.For existing association analysis just for the spatial character of monitoring data, ignore time response.Corresponding equipment is filtered out to occur Abort situation, and the monitoring data of its monitor terminal is obtained, monitoring data is pre-processed according to the following steps: 1) entirely being supervised Power transmission and transformation qualified rate of each power transmission and transformation qualified rate that Statistical monitor terminal monitoring obtains in the survey period as the position； 2) each power transmission and transformation index average value for monthly counting monitor terminal is averaged in entire monitoring cycle, and it is each to obtain the position The average value of power transmission and transformation index；3) each power transmission and transformation index value calculated in above step is normalized, will be owned Data are converted between [0,1].By the pretreatment of data, the items of electric system abnormal behaviour data and power transmission and transformation monitoring are defeated Power transformation index is mapped as the numerical value in [0,1] section.

The related coefficient between variable is calculated, obtains the incidence matrix A for the m × n dimension being made of related coefficient, as follows.

The row variable of matrix is electric system abnormal behaviour statistical data in formula, uses x_i, i=1 ..., m expression, column variable For power transmission and transformation monitoring data, y is used_j, j=1 ..., n are indicated.ρ_{Xi, yj}For x_iAnd y_jRelated coefficient.

For the structuring operation data of the power grid based on CIM, the present invention goes electric network composition data pick-up and conversion To be expressed as behavior model four-tuple N=(P, W, O, M), wherein P indicates the data set of network system data source, and W expression is based on The data set of the data warehouse of CIM, O indicate multiple mutually independent extraction set of tasks, and M indicates the data warehouse based on CIM The metadata set of modeling.For extracting task O={ O₁, O₂, O₃, O₁Data cleansing task is indicated, according to the data based on CIM Warehouse metadata extracts pretreated data from electric system；O₂Indicate that data load task, by the number in interface document area It is mapped to the tables of data of the data warehouse transition file area based on CIM according to table and carries out relevant data conversion and loading；O₃Table Show integration servers, according to the data warehouse model based on CIM, data verification carried out to the data in buffer area and data map, And by the data integration examined into the data warehouse based on CIM.

If T is the Data source table of data conversion process, T_iIt is T in the data warehouse buffer area based on CIM of moment i Data copy, T_i={ D, T }, wherein D indicates timestamp.If I is that T from the i moment to the data at i+1 moment changes copy, then I= {L_sn, M, T_o, T_n}。L_snIndicate the log number that data change occurs, M indicates data change operation, T_oIndicate data before changing Or the data before deleting, T_nIndicate changed data or newly-increased data.Obtaining T_i+1When, and directly tables of data T is selected It selects operation to compare, the performance of source database be influenced smaller.In the data warehouse buffer area based on CIM, from T_i+1It is mapped to True table S, first acquisition T_i+1In the true data of [i, i+1] in the period, according still further to the metadata of the data warehouse based on CIM Definition, makees relevant aggregation project.

The data cleansing process further passes through similarity analysis determination and the biggish operation number of the current time degree of association According to similar sample set, characteristic feature sequence is then obtained using hierarchical clustering, using characteristic sequence as reference pair sequence to be detected Fault data identification is carried out, finally modifies to the fault data of identification, the corresponding normal data of characteristic sequence is moved to Sequence fault data section to be detected.By cluster process, different characteristic feature sequences are extracted, and are ginseng with characteristic feature sequence It examines, to may be identified and be modified containing the sequence to be detected of fault data.

In order to faster retrieve data tuple, the present invention establishes index to set of relationship data tuple in memory.Then it will visit It asks that most frequent set of relationship data tuple is put into caching, reduces I/O expense.Frequent set of relationship R will be accessed to store to slow It deposits, set of relationship R and the real-time copy data D for being stored in operable data memory block are used as input.In each iterative process, close The piecemeal Pi that assembly closes R is inputted as a detection.Hash attended operation is performed, i.e., it can traverse caching relation data All tuples in area, and searched in Hash table simultaneously.Whenever successful match, matched data flow tuple is exported.Locating Behind complete caching relation data field of reason, algorithm reads new tuple from power grid real-time running data source, is loaded into Hash In table, and identifier is inserted into queue.For next piecemeal in selection R, it is the smallest to first look for timestamp in queue The connection attribute of data tuple.The piecemeal for having the connection attribute in R is loaded into caching relation data field using index.Pass through this Kind mode, each new piecemeal can carry out matching operation at least one data tuple.

In similarity analysis of the data analytics server to electric network data model, judged not according to sequence curve shape With contacting between sequence.Select temporal characteristics correlative factor as the sample of calculating correlation.Steps are as follows for specific calculating:

(1) current time sequence Y={ Y (m) | m=1,2 ... p } is set as reference sequences, historical time operation data sequence X_i ={ X_i(m) | m=1,2 ... p }, i=1,2 ... k are to compare sequence, and p is sequential element number.

(2) it calculates

(3) calculate correlation coefficient

ζ in formula_iIt (m) is Y (m) in X_i(m) incidence coefficient at place.Wherein △_i(m)=| y (m)-x_i(m) |, ρ is to differentiate system Number, value interval are (0,1).

(4) calculating correlation:

In the above-mentioned hierarchical clustering stage, if data set X={ x₁, x₂... ..., x_n, n is the quantity of element in X.It is wherein every A element is all a p dimensional vector, contains k class in X, it is assumed that the center v of i-th of class_i={ v_i1, v_i2... .v_ip, definition is special Sign sequence is each cluster centre.J-th of element is u to the degree of membership at i-th of class center in X_ijIf set U={ u_ij, V= {v_ij}。

u_ijCalculation formula are as follows:

M is Weighted Index in formula.d_ij=| | x_j-v_i| | j-th of element is represented to the distance at i-th of class center.For poly- Class center v_iIt can be calculated as follows:

Clustering iterative process is to find the cluster centre and subordinated-degree matrix corresponding when objective function reaches minimum value, If objective function J are as follows:

Cluster result analyze and determines optimal dividing.If a data set comprising n sequence is divided into k class (C₁, C₂..., C_k), for C_aIn i-th of sequence x (i), calculate the average distance a (i) of other sequences in x (i) and class.D (i, C_b) it is that x (i) arrives another class C_bThe average distance of all sequences defines b (i)=min { d (i, C_b), b=1,2 ... k, a ≠ b.The Singularity Degree of sequence in the average distance and other classes of each sequence and sample in class is calculated, the calculating of each sequence i is public Formula are as follows:

The average Dissim value of data set whole sample is taken to evaluate the quality of cluster result, index maximum value corresponds to poly- Class optimal classification number.

When fault data judges, if d days shared by the similar sample set that similarity analysis obtains, there is d in every one kind_n It,N takes 1~k, and t moment operation data maximum rate of change is denoted as α_max(t, d_n)。

α_max(t, d_n)=max { [L (d-i, t)-L (d-i, t-1)]/L (d-i, t-1 }), i=1~d_n

Wherein function L (d, t) is the d days t moment operation datas.

If sequence X to be detected_d=(x_d1, x_d2….x_dm), m is daily sampling number.Maximum membership degree characteristic sequence is X_t, in sampling time t, X_dRelative to characteristic sequence X_tChange rate are as follows:

δ_t=(x_dt-x_tt)/x_tt

If δ_t> α_max(t, d_n), then it is assumed that it is fault data.The method reduce workloads, improve calculating speed and mould Type working efficiency.

If detecting some sequence X_dP point between q point be fault data, be subordinate to the maximum two feature sequences of angle value Column are respectively X_t1, X_t2.Maximum membership degree characteristic sequence is used during actual modification.It is as follows to modify formula.

X'_d(i)=X'_t1(i)(u_{T1, i}/(u_{T1, i}+u_{T2, i}))+X'_t2(i)(u_{T2, i}/(u_{T1, i}+u_{T2, i}))

X'_t1(i)=X'_t1(i)×[X_d(p-1)/X_t1(p-1)+X_d(q+1)/X_t1(q+1)]

X'_t2(i)=X'_t2(i)×[X_d(p-1)/X_t2(p-1)+X_d(q+1)/X_t2(q+1)],

Wherein i=p, p+1 ..., q

In conclusion the invention proposes a kind of data cleansing conversion method based on CIM, in improved operation of power networks number Under support according to model and distributed data platform, source data is extracted, cleaned, is integrated, ensures the quality of data and reliable Property, it realizes the unified standard data output based on database, there is the broad applicability for supporting clustered deploy(ment) and concurrent, it can It is integrated for electric network data automation and analysis provides reliable support.

Obviously, it should be appreciated by those skilled in the art, each module of the above invention or each steps can be with general Computing system realize that they can be concentrated in single computing system, or be distributed in multiple computing systems and formed Network on, optionally, they can be realized with the program code that computing system can be performed, it is thus possible to they are stored It is executed within the storage system by computing system.In this way, the present invention is not limited to any specific hardware and softwares to combine.

It should be understood that above-mentioned specific embodiment of the invention is used only for exemplary illustration or explains of the invention Principle, but not to limit the present invention.Therefore, that is done without departing from the spirit and scope of the present invention is any Modification, equivalent replacement, improvement etc., should all be included in the protection scope of the present invention.In addition, appended claims purport of the present invention Covering the whole variations fallen into attached claim scope and boundary or this range and the equivalent form on boundary and is repairing Change example.

Claims

1. a kind of data cleansing conversion method based on CIM characterized by comprising

Capture the operation data of electric system；

The operation data of the electric system captured is cleaned and converted, obtains the data based on CIM unified standard, and deposit Storage is into distributed file system；

2. the method according to claim 1, wherein the operation data includes equipment account information, O&M number According to, fault data, trend topological data, GIS device information.

3. the method according to claim 1, wherein further include: by the model information and base of the electric system It is stored in MangoDB in the metadata of the Distributed Data Warehouse of CIM.

4. according to the method described in claim 3, it is characterized in that, the Distributed Data Warehouse based on CIM passes through Directly data are extracted from distributed file system after MapReduce dismantling task to be analyzed, it is unified carry out data management with Data access and the mapping of implementation model data and performance optimization.

5. according to the method described in claim 4, it is characterized in that, model data mapping includes power system service model Attribute and bottom different types of data source model data map.

6. the method according to claim 1, wherein the cleaning and conversion include two stages: the first stage is Data are drawn into operable data buffer area from data source, data are drawn into base from operable data buffer area by second stage In the data warehouse of CIM: the data source of isomery is drawn into operable data buffer area by (1) first stage, by the first rank The operation data of electric system, it is standby to be established an identical structure, the copy of identical content by section in operable data buffer area Part；(2) second stage carries out statistics merging to the data of operable data buffer area and summarizes, will using step increment method mode Data are stored into the data warehouse based on CIM；The data pick-up is increment extraction, if can not judge increment when extracting, Increment then is calculated in load, data add time tag when being loaded into the data warehouse based on CIM；From operable data to In the extraction process of data warehouse based on CIM, after data are read out from operable data buffer area, unified information is first carried out Coded treatment, then different disposal is carried out to true table data, dimension table data respectively；For the data variation of true table, according to Different situations of change selects different step increment method modes, if data temporally change, timestamp increment is used, if data Random variation is presented, then carries out the comparison data increment of full table；For the data variation of dimension table, CIM is based on newest Data cover off-line data.

7. according to the method described in claim 4, it is characterized in that, described divided from distributed file system extraction data Analysis, further comprises the similarity analysis to Operation of Electric Systems data.

8. the method according to the description of claim 7 is characterized in that in the similarity analysis of the Operation of Electric Systems data In, the connection between different sequences is judged according to sequence curve shape, and temporal characteristics correlative factor is selected to be associated with as calculating The sample of degree；Steps are as follows for specific calculating:

(1) current time sequence Y={ Y (m) | m=1,2 ... p } is set as reference sequences, historical time operation data sequence X_i={ X_i (m) | m=1,2 ... p }, i=1,2 ... k are to compare sequence, and p is sequential element number；

(2) it calculates

(3) calculate correlation coefficient ζ_i(m):

ζ in formula_iIt (m) is Y (m) in X_i(m) incidence coefficient at place:

(4) calculating correlation: