CN107679146A - The method of calibration and system of electric network data quality - Google Patents

The method of calibration and system of electric network data quality Download PDF

Info

Publication number
CN107679146A
CN107679146A CN201710876201.1A CN201710876201A CN107679146A CN 107679146 A CN107679146 A CN 107679146A CN 201710876201 A CN201710876201 A CN 201710876201A CN 107679146 A CN107679146 A CN 107679146A
Authority
CN
China
Prior art keywords
data
data record
power network
search index
original data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710876201.1A
Other languages
Chinese (zh)
Inventor
黄文琦
许爱东
陈晓
陈华军
李果
蒋屹新
杨航
张福铮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CSG Electric Power Research Institute
Research Institute of Southern Power Grid Co Ltd
Original Assignee
Research Institute of Southern Power Grid Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Research Institute of Southern Power Grid Co Ltd filed Critical Research Institute of Southern Power Grid Co Ltd
Priority to CN201710876201.1A priority Critical patent/CN107679146A/en
Publication of CN107679146A publication Critical patent/CN107679146A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems

Abstract

The present invention relates to a kind of method of calibration and system of electric network data quality, obtains power network original data record and is stored in distributed memory system, the first search index table of power network original data record is stored in distributed memory system;Create multiple parallel tasks, in each parallel task, obtain the check field of object identifier rule, searched according to the check field in the first search index table and obtain the first power network original data record corresponding with the check field, the comparison data and reference data in the first power network original data record are extracted, the comparison data of extraction is verified according to the reference data of extraction.Electric network data is recorded into progress distributed storage can make checking procedure have good autgmentability, the relation of the search index by verifying the regular check field being related to and data record, support that verification is regular and carry out efficient query processing when performing.

Description

The method of calibration and system of electric network data quality
Technical field
The present invention relates to electric power network technique field, more particularly to a kind of method of calibration and system of electric network data quality.
Background technology
Traditional relation data management system pursues the uniformity and correctness of height, is needed in the analysis towards mass data When asking, using Longitudinal Extension (scale up) method, i.e., individual node is lifted by upgrading hardware (CPU, internal memory, hard disk etc.) Ability, be limited by very large its scalability and performance.
It is existing at present to be based on the continuous increase of electrical network business data scale and data quality monitoring rule complexity There is serious bottleneck in the disposal ability of the data quality monitoring system of traditional data management and calculating platform, and treatment effeciency is low Under, it is difficult to the monitoring and verification of the quality of data are rapidly completed, is increasingly difficult to meet daily production management and business decision Demand.
The content of the invention
Based on this, it is necessary to confront for the data quality monitoring system data based on traditional data management and calculating platform The problem of efficiency of monitoring and the verification of amount is low, there is provided a kind of method of calibration and system of electric network data quality.
A kind of method of calibration of electric network data quality, comprises the following steps:
Power network original data record is obtained, the power network original data record is stored in distributed memory system;Its In, the power network original data record includes comparison data record to be verified and recorded for the reference data of verification;
Multiple parallel tasks are created, in each parallel task, perform following operation:Obtain the verification of object identifier rule Field, searched according to the check field in the first search index table of the power network original data record, obtain with First power network original data record corresponding to the check field, extract the ratio logarithm in the first power network original data record According to and reference data, the comparison data of extraction is verified according to the reference data of extraction;Wherein, first search index Table is stored in the distributed memory system;
Export the check results of multiple parallel tasks.
A kind of check system of electric network data quality, including:
Data storage cell, for obtaining power network original data record, the power network original data record is stored in point In cloth storage system;Wherein, the power network original data record includes comparison data record to be verified and for verification Reference data records;
Task creation unit, for creating multiple parallel tasks;
Search index unit, in each parallel task, the check field of object identifier rule being obtained, according to described Check field is searched in the first search index table of the power network original data record, is obtained and the check field pair The the first power network original data record answered;
Comparing unit is extracted, for extracting comparison data and reference data in the first power network original data record, The comparison data of extraction is verified according to the reference data of extraction;Wherein, the first search index table is stored in described In distributed memory system;
As a result output unit, for exporting the check results of multiple parallel tasks.
According to the method for calibration and system of the electric network data quality of the invention described above, it is to obtain power network original data record And be stored in distributed memory system, the first search index of power network original data record is stored in distributed memory system Table;Multiple parallel tasks are created, in each parallel task, the check field of object identifier rule are obtained, according to the verification Field is searched in the first search index table obtains the first power network original data record corresponding with the check field, extracts institute The comparison data and reference data in the first power network original data record are stated, the ratio logarithm according to the reference data of extraction to extraction According to being verified.In this scheme, electric network data is recorded into progress distributed storage can make checking procedure have good expansion Malleability, the relation of the search index by verifying the regular check field being related to and data record, when supporting that verification rule performs Efficient query processing is carried out, in addition, passing through multiple parallel tasks so that every verification rule can parallelization processing, so as to Improve the verification efficiency of electric network data quality.
A kind of readable storage medium storing program for executing, is stored thereon with executable program, and the program is realized above-mentioned when being executed by processor The step of method of calibration of electric network data quality.
A kind of calibration equipment, including memory, processor and storage on a memory and can run on a processor can Configuration processor, the step of realizing the method for calibration of above-mentioned electric network data quality during computing device program.
According to the method for calibration of the electric network data quality of the invention described above, the present invention also provide a kind of readable storage medium storing program for executing and Calibration equipment, for realizing the method for calibration of above-mentioned electric network data quality by program.
Brief description of the drawings
Fig. 1 is the schematic flow sheet of the method for calibration of the electric network data quality in one embodiment of the invention;
Fig. 2 is the structural representation of the check system of the electric network data quality in one embodiment of the invention;
Fig. 3 is the structural representation of the check system of the electric network data quality in one embodiment of the invention;
Fig. 4 is the structural representation of the check system of the electric network data quality in one embodiment of the invention;
Fig. 5 is the structural representation of the check system of the electric network data quality in one embodiment of the invention;
Fig. 6 is the verification general illustration in a specific embodiment of the invention;
Fig. 7 is incremental data storage and index schematic diagram in a specific embodiment of the invention;
Fig. 8 is lot history data storage and index schematic diagram in a specific embodiment of the invention;
Fig. 9 is that the verification rule parallelization in a specific embodiment of the invention handles schematic diagram.
Embodiment
For the objects, technical solutions and advantages of the present invention are more clearly understood, below in conjunction with drawings and Examples, to this Invention is described in further detail.It should be appreciated that embodiment described herein is only to explain the present invention, Do not limit protection scope of the present invention.
It is shown in Figure 1, it is the schematic flow sheet of the method for calibration of the electric network data quality of one embodiment of the invention.Should The method of calibration of electric network data quality in embodiment comprises the following steps:
Step S110:Power network original data record is obtained, the power network original data record is stored in distributed storage In system;Wherein, the power network original data record includes comparison data record to be verified and the reference data for verification Record;
In this step, distributed memory system can be distributed storage power network initial data, be easy to the increase of electric network data Or delete, make checking procedure that there is good autgmentability;Reference data record for verification is comparison data note to be verified The verification standard of record;
Step S120:Multiple parallel tasks are created, in each parallel task, perform following operation:Obtain object identifier The check field of rule, is looked into according to the check field in the first search index table of the power network original data record Look for, obtain the first power network original data record corresponding with the check field, extract the first power network original data record In comparison data and reference data, the comparison data of extraction is verified according to the reference data of extraction;Wherein, described One search index table is stored in the distributed memory system;
In this step, verified in each parallel task according to object identifier rule, by searching search index, The comparison data and reference data corresponding with check field can be obtained, so as to be verified;
Step S130:Export the check results of multiple parallel tasks.
In the present embodiment, obtain power network original data record and be stored in distributed memory system, distributed storage The first search index table of power network original data record is stored in system;Multiple parallel tasks are created, in each parallel task, The check field of object identifier rule is obtained, searches acquisition and the school in the first search index table according to the check field The first power network original data record corresponding to field is tested, extracts comparison data and base in the first power network original data record Quasi- data, the comparison data of extraction is verified according to the reference data of extraction.In this scheme, by electric network data record into Row distributed storage can make checking procedure have good autgmentability, and the check field being related to by verifying rule is remembered with data The relation of the search index of record, efficient query processing is carried out when supporting verification rule to perform, in addition, by multiple parallel tasks, So that every verification rule can parallelization processing, so as to improve the verification efficiency of electric network data quality.
Optionally, distributed memory system can be HBase distributed memory systems, and HBase distributed memory systems carry For the big data table managerial ability based on row memory module, can storage management it is billions of more than data record, each record can Arranged comprising more than million data;HBase provide at random and real-time reading and writing data access ability, and with enhanced scalability, High availability, fault-tolerant processing ability, load balance ability and real time data query capability.
In one of the embodiments, the method for calibration of electric network data quality is further comprising the steps of:
The first search index table, the major key of the first search index table are established in the distributed memory system For the check field of various verification rules, the train value of the first search index table is the master of the power network original data record Key.
In the present embodiment, the first search index table can be established in distributed memory system, by various verifications Major key of the check field of rule as the first search index table, the major key of power network original data record is as the first search index The train value of table, by the first search index table, can according to corresponding to quickly being found check field the first power network initial data Record.
, can be with it should be noted that after the major key of power network original data record corresponding with check field is got The first power network original data record according to corresponding to being found the major key of corresponding power network original data record, from corresponding first Comparison data and reference data are extracted in power network original data record.
Optionally, check field can be the major key of power network original data record or any attribute column, the ratio logarithm of extraction According to being actual field corresponding with check field, can be check field in itself or other data fields.
In one of the embodiments, the method for calibration of electric network data quality is further comprising the steps of:
In each parallel task, the timestamp scope of object identifier rule is obtained, according to the timestamp scope in institute State and searched in the second search index table of power network original data record, obtain the second electricity corresponding with the timestamp scope Net original data record, comparison data and reference data in the second power network original data record are extracted, according to extraction Reference data verifies to the comparison data of extraction;Wherein, the second search index table is stored in the distributed storage In system.
In the present embodiment, it can be stabbed with passage time and search the second power network original data record, extract comparison therein Data and reference data are verified, and realize the verification of the electric network data quality based on time window.
It should be noted that when passage time stamp searches the second power network original data record, timestamp and the second power network are former Comparison data in beginning data record is corresponding, and reference data is corresponding with comparison data, and reference data has no directly with timestamp Connect contact.
In one of the embodiments, the method for calibration of electric network data quality is further comprising the steps of:
The second search index table, the major key of the second search index table are established in the distributed memory system For the timestamp of various verification rules, the train value of the second search index table is the major key of the power network original data record.
In the present embodiment, the second search index table can be established in distributed memory system, by various verifications Major key of the timestamp of rule as the second search index table, the major key of power network original data record is as the second search index table Train value, by the second search index table, can according to corresponding to quickly being found timestamp the second power network original data record.
It should be noted that after the major key of power network original data record corresponding with timestamp is got, Ke Yigen Corresponding second power network original data record is found according to the major key of corresponding power network original data record, from the corresponding second electricity Comparison data and reference data are extracted in net original data record.
In one of the embodiments, the method for calibration of electric network data quality is further comprising the steps of:
The index text of distributed file system is established according to the first search index table and the second search index table Part, the index file is read to internal memory, reads Operation Log file of the distributed file system to the index file, Operation note in the Operation Log file is applied in internal memory index, internal memory index is write into the index file In, the index file indexed according to write-in internal memory loads the power network original data record of batch, former according to the power network of batch respectively Reference data in beginning data record verifies to comparison data.
In the present embodiment, index file can be established in distributed file system, in power network Raw data quality school When testing, index file is read in into internal memory, read operation daily record is applied in internal memory index, and internal memory is indexed into write-in index file, Index file based on write-in internal memory index is verified, and realizes carrying out batch power network Raw data quality through the above way Verification data can be quickly loaded during verification, lifts checking feature.
Optionally, distributed file system can be HDFS (Hadoop Distributed File System, i.e., Hadoop distributed file systems), HDFS possesses the more copy memory mechanisms of good data, and the error of powerful back end Detection and node failure Restoration Mechanism.
Optionally, after internal memory is indexed into write-in index file, Operation Log file can be deleted, release storage is empty Between, improve verification speed.
In one of the embodiments, the method for calibration of electric network data quality is further comprising the steps of:
When detecting power grid increment data record, based on check field of the power grid increment data record is generated One search index is simultaneously added to the first search index table, generates based on timestamp of the power grid increment data record Two search indexes are simultaneously added to the second search index table.
In the present embodiment, when detecting power grid increment data record, corresponding first search index can be added Into the first search index table, corresponding second search index is added in the second search index table, it is ensured that concordance list it is complete Whole property, realize the full validation of electric network data quality.
Optionally, when being verified to power grid increment data, because the timestamp of incremental data and initial data is obvious Difference, the power grid increment data that can be inquired about according to timestamp scope in electric network data record are verified.
In one of the embodiments, the step following steps of multiple parallel tasks are created:
Multiple parallel tasks are created in MapReduce parallel computation frames, to all in distributed file system Verification rule establishes instruction file, corresponding instruction file is read to each parallel task, according to corresponding instruction file The parameter for performing verification rule and processing logic are configured for each parallel task.
In the present embodiment, parallel task can be created using MapReduce parallel computation frames, MapReduce is simultaneously All Map nodes in row Computational frame can be executed concurrently different verification rules, if there is failure to send out in implementation procedure Raw, MapReduce parallel computation frames automatically can start new task in other nodes and perform the school failed to reattempt to Test rule, can effectively solve load balancing in whole parallel procedure and it is fault-tolerant the problems such as, verify parameter and the processing of rule Logic is stored in instruction file, can be called from distributed file system, to indicate that file can be quick as foundation Establish parallel task.
Optionally, before parallel task execution, configuration file can also be read, verification type is provided with configuration file Scope is stabbed with checking time, specific verification type and timestamp scope can be determined in verification.
In one of the embodiments, instruction file corresponds to one or more of verification rules.
In the present embodiment, instruction file can correspond to one and verify regular or a plurality of verification rule, if corresponding one Verification rule, parallel task can be verified for the verification rule, if corresponding a plurality of verification is regular, parallel task can be with Parallel check is carried out for a plurality of verification rule, improves the treatment effeciency of verification rule.
Optionally, a plurality of verification rule in an instruction file belongs to the verification rule of same attribute type.
In one of the embodiments, the corresponding parallel task of file is indicated.
In the present embodiment, the corresponding parallel task of file is indicated, an instruction file is carried out by a parallel task Processing, realize that each instruction file can be handled with parallelization, improve the treatment effeciency of instruction file.
According to the method for calibration of above-mentioned electric network data quality, the present invention also provides a kind of verification system of electric network data quality System, just the embodiment of the check system of the electric network data quality of the present invention is described in detail below.
It is shown in Figure 2, it is the structural representation of the check system of the electric network data quality of one embodiment of the invention, should The check system of electric network data quality in embodiment includes:
Data storage cell 210, for obtaining power network original data record, the power network original data record is stored in In distributed memory system;Wherein, the power network original data record includes comparison data record to be verified and for verifying Reference data record;
Task creation unit 220, for creating multiple parallel tasks;
Search index unit 230, in each parallel task, obtaining the check field of object identifier rule, according to The check field is searched in the first search index table of the power network original data record, is obtained and the check word First power network original data record corresponding to section;
Comparing unit 240 is extracted, for extracting comparison data and base value in the first power network original data record According to being verified according to the reference data of extraction to the comparison data of extraction;Wherein, the first search index table is stored in institute State in distributed memory system;
As a result output unit 250, for exporting the check results of multiple parallel tasks.
In one of the embodiments, as shown in figure 3, the check system of electric network data quality also establishes unit including index 260, for establishing the first search index table, the major key of the first search index table in the distributed memory system For the check field of various verification rules, the train value of the first search index table is the master of the power network original data record Key.
In one of the embodiments, search index unit 230 is additionally operable in each parallel task, obtains object identifier The timestamp scope of rule, enters according to the timestamp scope in the second search index table of the power network original data record Row is searched, and obtains the second power network original data record corresponding with the timestamp scope, extracts the second power network original number According to the comparison data and reference data in record, the comparison data of extraction is verified according to the reference data of extraction;Wherein, The second search index table is stored in the distributed memory system.
In one of the embodiments, index establishes unit 260 and is additionally operable to establish institute in the distributed memory system The second search index table is stated, the major key of the second search index table is the timestamp of various verification rules, and described second inquires about The train value of concordance list is the major key of the power network original data record.
In one of the embodiments, as shown in figure 4, the check system of electric network data quality also includes file index unit 270, for establishing the index text of distributed file system according to the first search index table and the second search index table Part, the index file is read to internal memory, reads Operation Log file of the distributed file system to the index file, Operation note in the Operation Log file is applied in internal memory index, internal memory index is write into the index file In, the index file indexed according to write-in internal memory loads the power network original data record of batch, former according to the power network of batch respectively Reference data in beginning data record verifies to comparison data.
In one of the embodiments, as shown in figure 5, the check system of electric network data quality also includes index adjustment unit 280, for when detecting power grid increment data record, generating based on check field of the power grid increment data record One search index is simultaneously added to the first search index table, generates based on timestamp of the power grid increment data record Two search indexes are simultaneously added to the second search index table.
In one of the embodiments, task creation unit 220 created in MapReduce parallel computation frames it is multiple simultaneously Row task, the instruction file for the verification rule established in distributed file system is read to each parallel task, according to corresponding Instruction file be each parallel task configuration perform verification rule parameter and processing logic.
In one of the embodiments, instruction file corresponds to one or more of verification rules.
In one of the embodiments, the corresponding parallel task of file is indicated.
The check system of the electric network data quality of the present invention and a pair of the method for calibration 1 of the electric network data quality of the present invention Should, the technical characteristic and its advantage illustrated in the embodiment of the method for calibration of above-mentioned electric network data quality is applied to power network In the embodiment of the check system of the quality of data.
According to the method for calibration of above-mentioned electric network data quality, the embodiment of the present invention also provides a kind of readable storage medium storing program for executing and one Kind calibration equipment.Executable program is stored with readable storage medium storing program for executing, the program realizes above-mentioned power network number when being executed by processor According to quality method of calibration the step of;Calibration equipment includes memory, processor and storage on a memory and can be in processor The executable program of upper operation, the step of realizing the method for calibration of above-mentioned electric network data quality during computing device program.
In a specific embodiment, the method for calibration of electric network data quality is that one kind is stored and place parallel based on distribution The method of calibration of the electric network data quality of reason, it is big to solve the existing computation delay based on relational database system method, difficult In extension, the problem of cost performance is low.
The main thought of the technical solution adopted by the present invention is:
All original data records are stored using one kind distribution storage method;
Check field is indexed using the indexing means based on non-primary key, the school being related to according to verification rule during verification Test field and search concordance list, original data record major key corresponding to acquisition, further according to the original data record table major key got Search original data record table and obtain original data record, then extraction compares field and is compared;
Original data record settling time is stabbed using HBase and indexed, in incremental data quality indicator or based on the time During the quality of data verification of the thin time granularity of window, original data record table is inquired about according to timestamp scope, it is determined that needing to verify Data area after verified;
The secondary index file and Operation Log file recorded using HDFS data storages, so as to full dose Raw data quality Verification data can be quickly loaded during verification, checking feature is lifted, when full dose Raw data quality verifies, by lazy halyard quotation Part reads in internal memory, and read operation daily record is applied on internal memory index, is then based on internal memory index and is verified;
The quick execution of verification rule is completed using the parallelization mode based on MapReduce.
Further, the distribution storage method is the distribution storage method based on HBase, can support magnanimity verification data Storage, and extension can be facilitated according to demand.Further, the verification rule is the parallelization verification based on MapReduce Rule.According to verification data amount and the convenient extension of regular quantity can be verified, response performance is controllable, cost-effective.
Further, check field is indexed using the method based on non-master key index, to realize based on non-master The verification rule query processing of key field.
Further, check field is original data record major key or any attribute column;Comparing field is and the school Test a certain field corresponding to field, can be check field in itself or other fields.
Further, original data record settling time is stabbed and indexed, in incremental data quality indicator or based on the time During the thin time granularity quality of data verification of window, remembered according to timestamp search index timestamp concordance list with obtaining initial data Major key is recorded, then inquires about original data record table and is verified with obtaining original data record.
Further, HDFS secondary index files are established for full dose initial data, Operation Log is established for incremental data, When full dose historical data verifies, HDFS secondary indexes file is read to internal memory, Operation Log is applied on internal memory index, then Verified based on internal memory index.
Further, instruction file is established to all verification rules, instruction file content performs verification rule comprising all Parameter, the Map such as the parameter then needed, including rule name, regular execution logic mark, input data table, output data table appoint Corresponding instruction file is read in business, is obtained and is performed the parameter that corresponding verification rule needs, calls corresponding processing logic to carry out school Test.
Further, each instruction file corresponds to one or more of verification rules, and the execution parameter for verifying rule is write In file is indicated, the execution parameter includes verification rule name, and regular execution logic represents, input data table, exports number According to parameters such as tables.
Further, each instruction file is handled by a Map task,
The solution of the present invention can efficiently carry out the verification of electric network data quality expansiblely:First, electric network data is entered Row distribution storage, makes system be with good expansibility;Second, nonproductive poll is established by the field being related to for verification rule Index, efficient query processing is carried out to support to verify when rule performs;3rd, devise a verification based on MapReduce Regular method for parallel processing so that every verification rule can parallelization processing, effectively improve system responsiveness energy.
HBase is a distributed memory system in Hadoop ecological environments.Lacked for distributed file system HDFS The defects of few structuring semi-structured data storage is accessed with random read-write ability, in HDFS (Hadoop Distributed File System, i.e. Hadoop distributed file systems) on, HBase provides a distributed, solution Certainly large-scale structuring and semi-structured data storage access problem.HBase provides the big data table based on row memory module Managerial ability, can storage management it is billions of more than data record, each record can arrange comprising more than million data;HBase Attempt provide at random and real-time reading and writing data access ability, and with enhanced scalability, high availability, fault-tolerant processing ability, Load balance ability and real time data query capability.
HBase bottom data is stored in HDFS, thus HBase be place one's entire reliance upon bottom HDFS work 's.Because HDFS has employed well the more copy memory mechanisms of data and powerful back end fluffing check and node mistake Restoration Mechanism is imitated, the high reliability based on HDFS HBase natural succession HDFS this data storages in data storage With fault-tolerant processing ability.
Hadoop MapReduce provide that one huge but the superior Distributed Storage of design and parallel computation are soft Part framework, the storage management of distributed mass data can be automatically performed, can automatically divide and calculate data and dispatch calculating task, Automatic distribution and subtasking and collection result of calculation on clustered node, by distributed data storage, data communication, fault-tolerant place Many ins and outs in the parallel computations such as reason transfer to system to be responsible for processing, greatly reduce the burden of software developer.
As shown in fig. 6, the present invention is remembered initial data using distributed data storage and management system HBase data storages Address book stored is into HBase, to be accessed according to major key quick search;The check field being related to for verification rule establishes inquiry rope Draw, to be accessed according to check word segment value quick search;The secondary index based on timestamp is established for original data record, is supported Quality of data verification based on time window;For the full dose data of historical accumulation, while establish index file and be stored in distribution On formula file system HDFS, quickly to be loaded when carrying out batch data quality indicator, the full table scan to HBase is avoided; And the incremental data for flowing into real time establishes Operation Log, index file when solving data record increase, deleting, change Maintenance issues, periodically union operation daily record and index file, reduce merging expense during batch data quality indicator;Using school The parallelization for testing rule performs, and a parallel task handles one to a plurality of verification rule.
The flow that batch data is stored and is indexed is comprised the following steps:
(1) the reference data table of CSV forms to be verified and comparison data table are stored in HBase, original data record Major key of the major key as HBase table, the row of the non-primary key attribute of original data record as HBase table, different dependents of dead military hero in Different row races, improved using HBase towards row storage (data of same row race are unified to be stored) when inquiring about certain column data Response performance;
(2) by based in the search index table deposit HBase for verifying regular check field, check field is looked into as HBase The major key of concordance list, row name of the original data record major key as search index table are ask, all major keys belong to same row race, adopted With convenient increase, deletion, modification and the inquiry that record is marked to search index of this data pattern;
(3) the search index table based on data record timestamp is stored in HBase, data record timestamp conduct The major key of HBase search index tables, original data record major key store as the train value of search index table.
(4) when by based in the search index table deposit HBase for verifying regular check field, while search index table is deposited In the index file for entering HDFS.
The flow that incremental data is stored and is indexed is comprised the following steps:
(1) in the original data record table that incremental data is recorded to insertion HBase;
(2) by the search index that HBase is inserted based on the search index for verifying regular check field of incremental data record In;
(3) the search index table based on data record timestamp of incremental data record is inserted to HBase secondary index In;
(4) Operation Log that incremental data records is appended in the Operation Log file on HDFS.
The flow that Operation Log is merged into index file comprises the following steps:
(1) index file on HDFS is read into internal memory;
(2) Operation Log file on HDFS is read, operation is applied in internal memory index one by one;
(3) internal memory index is re-write in the index file on HDFS;
(4) the Operation Log file on HDFS is deleted.
Parallelization verifies rule process flow:
(1) type will be verified, checking time stamp scope is written in configuration file;
(2) start MapReduce operations to start to perform quality of data verification;
(3) each Map tasks read an instruction file, obtain rule name, regular execution logic mark, input data The parameter such as table, output data table;And read the verification type in configuration file and verification scope timestamp;
(4) verified for batch, according to batch data single gauge, then checking process is verified;
(5) for the verification based on time window, carrying out incremental data single gauge according to timestamp scope, then checking process enters Row verification.
Batch data single gauge then checking process:
(1) the inquiry rope that the search index table on HDFS is applied it in internal memory to internal memory, read operation daily record is read Draw table, deletion action journal file;
(2) travel through the search index table in internal memory and enter line discipline verification.
Incremental data single gauge then checking process:
(1) according to initial time stamp and termination timestamp, query time stamp concordance list, obtain in incremental time time window All Record IDs;Original data record table is inquired about, obtains corresponding check field set;
(2) according to the field value in check field set, secondary index table is inquired about, comparison field value is obtained and is verified.
Then checking process is also applied for the verification of initial data to above-mentioned incremental data single gauge.
As shown in fig. 7, the distribution storage that the present invention relates to and the embodiment of indexing means are:In order to complete to big The quick processing of data record and a large amount of verification rules is measured, in addition to by the storage of former tables of data into HBase, it would be desirable to pin To the field involved by verification rule, design special rapid data concordance list and store into HBase.For example, in former number According in Tables 1 and 2, major key (rowkey fields) is the ID of each record.If necessary to (be designated as word to the A fields of former tables of data 1 Section A) and the B field (being designated as field B) of former tables of data 2 verified, then we need to establish field A's and field B respectively Concordance list with verification when quickly search.In order to realize incremental data quality indicator and the thin time grain based on time window The quality of data verification of degree, establishes timestamp search index, to be defined according to timestamp scope for original data record table Carry out the data area of quality indicator.As shown in figure 8, in order to lift the quality indicator performance of full dose historical data, remember for data Record table and establish auxiliary HDFS index files and Operation Log, quickly to load verification data to internal memory in full dose data check In verified.
It is for the embodiment for verifying regular parallelization processing in the present invention:In order to complete to mass data record and The quick processing of a large amount of verification rules, using the parallelization execution mechanism based on MapReduce.As shown in Figure 9:First will be each The ID of verification rule and parameter etc., which are written in HDFS files independent one by one, (referred to as indicates file), MapReduce operations In contain all these verification rule processing modules realizations.Machine is run according to Hadoop MapReduce acquiescence System, each Map tasks can only read an instruction file and be handled, and the selection of specific processing module is then by this here The read instruction file of business determines.
Map nodes all in cluster can be just caused to be executed concurrently different verification rules by this method.Such as There is failure in fruit implementation procedure, Hadoop MapReduce automatically can start new Map tasks to weigh in other nodes New try performs these verification rules.The load balancing of whole parallel procedure and it is fault-tolerant the problems such as all by Hadoop MapReduce Framework solves in the lump.
The present invention realizes a prototype system based on some existing open source softwares.Wherein distribution storage and index use HBase, the regular parallelization processing of verification use HDFS and MapReduce, and these three softwares are not belonging to present disclosure in itself. The prototype system realized by using real electrical network business data and verification rule to the present invention and existing relation data management System carries out test comparison, and the prototype system that the present invention realizes is better than conventional relationship data management in response performance, scalability System, it was demonstrated that the validity of the electric network data quality determining method based on distribution storage and parallel processing of the present invention.
Each technical characteristic of embodiment described above can be combined arbitrarily, to make description succinct, not to above-mentioned reality Apply all possible combination of each technical characteristic in example to be all described, as long as however, the combination of these technical characteristics is not deposited In contradiction, the scope that this specification is recorded all is considered to be.
Can be with one of ordinary skill in the art will appreciate that realizing that all or part of step in above-described embodiment method is The hardware of correlation is instructed to complete by program.Described program can be stored in read/write memory medium.The program exists During execution, including the step described in the above method.Described storage medium, including:ROM/RAM, magnetic disc, CD etc..
Embodiment described above only expresses the several embodiments of the present invention, and its description is more specific and detailed, but simultaneously Can not therefore it be construed as limiting the scope of the patent.It should be pointed out that come for one of ordinary skill in the art Say, without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to the protection of the present invention Scope.Therefore, the protection domain of patent of the present invention should be determined by the appended claims.

Claims (10)

1. a kind of method of calibration of electric network data, it is characterised in that comprise the following steps:
Power network original data record is obtained, the power network original data record is stored in distributed memory system;Wherein, institute Stating power network original data record includes comparison data record to be verified and is recorded for the reference data of verification;
Multiple parallel tasks are created, in each parallel task, perform following operation:Obtain the check word of object identifier rule Section, is searched according to the check field in the first search index table of the power network original data record, acquisition and institute The first power network original data record corresponding to check field is stated, extracts the comparison data in the first power network original data record And reference data, the comparison data of extraction is verified according to the reference data of extraction;Wherein, the first search index table It is stored in the distributed memory system;
Export the check results of multiple parallel tasks.
2. the method for calibration of electric network data quality according to claim 1, it is characterised in that further comprising the steps of:
The first search index table is established in the distributed memory system, the major key of the first search index table is each The check field of kind verification rule, the train value of the first search index table are the major key of the power network original data record.
3. the method for calibration of electric network data quality according to claim 1, it is characterised in that further comprising the steps of:
In each parallel task, the timestamp scope of object identifier rule is obtained, according to the timestamp scope in the electricity Searched in second search index table of net original data record, it is former to obtain the second power network corresponding with the timestamp scope Beginning data record, comparison data and reference data in the second power network original data record are extracted, according to the benchmark of extraction Data verify to the comparison data of extraction;Wherein, the second search index table is stored in the distributed memory system In.
4. the method for calibration of electric network data quality according to claim 3, it is characterised in that further comprising the steps of:
The second search index table is established in the distributed memory system, the major key of the second search index table is each The timestamp of kind verification rule, the train value of the second search index table are the major key of the power network original data record.
5. the method for calibration of electric network data quality according to claim 3, it is characterised in that further comprising the steps of:
The index file of distributed file system is established according to the first search index table and the second search index table, is read The index file is taken Operation Log file of the distributed file system to the index file to be read, by institute to internal memory State the operation note in Operation Log file to be applied in internal memory index, internal memory index write in the index file, The power network original data record of the index file loading batch indexed according to write-in internal memory, respectively according to the power network original number of batch Comparison data is verified according to the reference data in record.
6. the method for calibration of electric network data quality according to claim 3, it is characterised in that further comprising the steps of:
When detecting power grid increment data record, first based on check field for generating the power grid increment data record is looked into Ask and index and be added to the first search index table, second based on timestamp for generating the power grid increment data record is looked into Ask and index and be added to the second search index table.
7. the method for calibration of electric network data quality according to claim 3, it is characterised in that described to create multiple parallel The step following steps of business:
Multiple parallel tasks are created in MapReduce parallel computation frames, to all verifications in distributed file system Rule establishes instruction file, reads corresponding instruction file to each parallel task, is every according to corresponding instruction file Individual parallel task configuration performs the parameter and processing logic of verification rule.
8. the method for calibration of electric network data quality according to claim 7, it is characterised in that the instruction file corresponding one Bar or a plurality of verification rule.
9. the method for calibration of electric network data quality according to claim 7, it is characterised in that the instruction file corresponding one Individual parallel task.
A kind of 10. check system of electric network data, it is characterised in that including:
Data storage cell, for obtaining power network original data record, the power network original data record is stored in distribution In storage system;Wherein, the power network original data record includes comparison data record to be verified and the benchmark for verification Data record;
Task creation unit, for creating multiple parallel tasks;
Search index unit, in each parallel task, the check field of object identifier rule being obtained, according to the verification Field is searched in the first search index table of the power network original data record, is obtained corresponding with the check field First power network original data record;
Comparing unit is extracted, for extracting comparison data and reference data in the first power network original data record, according to The reference data of extraction verifies to the comparison data of extraction;Wherein, the first search index table is stored in the distribution In formula storage system;
As a result output unit, for exporting the check results of multiple parallel tasks.
CN201710876201.1A 2017-09-25 2017-09-25 The method of calibration and system of electric network data quality Pending CN107679146A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710876201.1A CN107679146A (en) 2017-09-25 2017-09-25 The method of calibration and system of electric network data quality

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710876201.1A CN107679146A (en) 2017-09-25 2017-09-25 The method of calibration and system of electric network data quality

Publications (1)

Publication Number Publication Date
CN107679146A true CN107679146A (en) 2018-02-09

Family

ID=61138126

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710876201.1A Pending CN107679146A (en) 2017-09-25 2017-09-25 The method of calibration and system of electric network data quality

Country Status (1)

Country Link
CN (1) CN107679146A (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108595664A (en) * 2018-04-28 2018-09-28 尚谷科技(天津)有限公司 A kind of agricultural data monitoring method under hadoop environment
CN108762933A (en) * 2018-05-31 2018-11-06 成都四方伟业软件股份有限公司 Quality of data method of calibration and device
CN109462517A (en) * 2018-10-24 2019-03-12 云南电网有限责任公司信息中心 A kind of method, system and the equipment of the data monitoring towards digital electric network business
CN109460995A (en) * 2018-09-26 2019-03-12 平安国际融资租赁有限公司 Financial accreditation method, apparatus, computer equipment and storage medium
CN109635300A (en) * 2018-12-14 2019-04-16 泰康保险集团股份有限公司 Data verification method and device
CN110704404A (en) * 2019-08-29 2020-01-17 苏宁云计算有限公司 Data quality checking method, device and system
CN111209597A (en) * 2018-11-22 2020-05-29 迈普通信技术股份有限公司 Data verification method and application system
CN112540987A (en) * 2020-12-08 2021-03-23 湖州中朔信息技术有限公司 Big data management system of distribution and utilization electricity based on data mart
CN112667618A (en) * 2020-12-30 2021-04-16 湖南长城医疗科技有限公司 Public area sanitation platform quality control system and method
CN112799945A (en) * 2021-01-29 2021-05-14 中国工商银行股份有限公司 Batch file verification method and device
CN112860769A (en) * 2021-03-10 2021-05-28 广东电网有限责任公司 Energy planning data management system
CN112910086A (en) * 2021-01-18 2021-06-04 国网山东省电力公司青岛供电公司 Intelligent substation data verification method and system
CN115099713A (en) * 2022-08-01 2022-09-23 武汉胜天地消防工程有限公司 Smart power grid operation log collection and analysis management system based on big data

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102024046A (en) * 2010-12-14 2011-04-20 成都市华为赛门铁克科技有限公司 Data repeatability checking method and device as well as system
CN102799746A (en) * 2012-05-07 2012-11-28 山东电力集团公司青岛供电公司 Power grid information checking method and system, and power grid planning auxiliary system
CN104391903A (en) * 2014-11-14 2015-03-04 广州科腾信息技术有限公司 Distributed storage and parallel calculation-based power grid data quality detection method
US20160314026A1 (en) * 2015-04-27 2016-10-27 Microsoft Technology Licensing, Llc Establishing causality order of computer trace records

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102024046A (en) * 2010-12-14 2011-04-20 成都市华为赛门铁克科技有限公司 Data repeatability checking method and device as well as system
CN102799746A (en) * 2012-05-07 2012-11-28 山东电力集团公司青岛供电公司 Power grid information checking method and system, and power grid planning auxiliary system
CN104391903A (en) * 2014-11-14 2015-03-04 广州科腾信息技术有限公司 Distributed storage and parallel calculation-based power grid data quality detection method
US20160314026A1 (en) * 2015-04-27 2016-10-27 Microsoft Technology Licensing, Llc Establishing causality order of computer trace records

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108595664B (en) * 2018-04-28 2022-05-31 上海左岸芯慧电子科技有限公司 Agricultural data monitoring method in hadoop environment
CN108595664A (en) * 2018-04-28 2018-09-28 尚谷科技(天津)有限公司 A kind of agricultural data monitoring method under hadoop environment
CN108762933A (en) * 2018-05-31 2018-11-06 成都四方伟业软件股份有限公司 Quality of data method of calibration and device
CN109460995A (en) * 2018-09-26 2019-03-12 平安国际融资租赁有限公司 Financial accreditation method, apparatus, computer equipment and storage medium
CN109460995B (en) * 2018-09-26 2024-02-06 平安国际融资租赁有限公司 Financial certification method, device, computer equipment and storage medium
CN109462517A (en) * 2018-10-24 2019-03-12 云南电网有限责任公司信息中心 A kind of method, system and the equipment of the data monitoring towards digital electric network business
CN111209597A (en) * 2018-11-22 2020-05-29 迈普通信技术股份有限公司 Data verification method and application system
CN109635300A (en) * 2018-12-14 2019-04-16 泰康保险集团股份有限公司 Data verification method and device
CN109635300B (en) * 2018-12-14 2023-12-19 泰康保险集团股份有限公司 Data verification method and device
CN110704404B (en) * 2019-08-29 2023-04-28 苏宁云计算有限公司 Data quality verification method, device and system
CN110704404A (en) * 2019-08-29 2020-01-17 苏宁云计算有限公司 Data quality checking method, device and system
CN112540987A (en) * 2020-12-08 2021-03-23 湖州中朔信息技术有限公司 Big data management system of distribution and utilization electricity based on data mart
CN112667618A (en) * 2020-12-30 2021-04-16 湖南长城医疗科技有限公司 Public area sanitation platform quality control system and method
CN112667618B (en) * 2020-12-30 2023-06-06 湖南长城医疗科技有限公司 Public area sanitary platform quality control system and method
CN112910086A (en) * 2021-01-18 2021-06-04 国网山东省电力公司青岛供电公司 Intelligent substation data verification method and system
CN112799945A (en) * 2021-01-29 2021-05-14 中国工商银行股份有限公司 Batch file verification method and device
CN112799945B (en) * 2021-01-29 2024-03-15 中国工商银行股份有限公司 Batch file verification method and device
CN112860769A (en) * 2021-03-10 2021-05-28 广东电网有限责任公司 Energy planning data management system
CN115099713A (en) * 2022-08-01 2022-09-23 武汉胜天地消防工程有限公司 Smart power grid operation log collection and analysis management system based on big data
CN115099713B (en) * 2022-08-01 2023-04-07 河南蓝通信息技术有限公司 Smart power grid operation log acquisition and analysis management system based on big data

Similar Documents

Publication Publication Date Title
CN107679146A (en) The method of calibration and system of electric network data quality
CN108255712B (en) Test system and test method of data system
CN102968374B (en) A kind of data warehouse method of testing
CN104866426A (en) Software test integrated control method and system
CN104866580A (en) Method for quickly detecting impact caused by database modification to current service
CN104391903A (en) Distributed storage and parallel calculation-based power grid data quality detection method
CN104036029B (en) Large data consistency control methods and system
US10331657B1 (en) Contention analysis for journal-based databases
CN104252481A (en) Dynamic check method and device for consistency of main and salve databases
CN104239377A (en) Platform-crossing data retrieval method and device
CN107491487A (en) A kind of full-text database framework and bitmap index establishment, data query method, server and medium
CN106682036A (en) Data exchange system and exchange method thereof
CN102236672A (en) Method and device for importing data
CN110891000B (en) GPU bandwidth performance detection method, system and related device
CN111240968A (en) Automatic test management method and system
CN108664388A (en) Dynamic field data return to test system, method, electronic equipment and the readable storage medium storing program for executing of interface
CN114600094A (en) Generating hash trees for database architectures
CN104778179A (en) Data migration test method and system
CN108519856A (en) Based on the data block copy laying method under isomery Hadoop cluster environment
CN107122238A (en) Efficient iterative Mechanism Design method based on Hadoop cloud Computational frame
CN111026709B (en) Data processing method and device based on cluster access
CN105868956A (en) Data processing method and device
CN112948473A (en) Data processing method, device and system of data warehouse and storage medium
CN105335459B (en) Consolidated accounts data pick-up method based on XBRL intelligence reporting platform
CN115329011A (en) Data model construction method, data query method, data model construction device and data query device, and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20180209

RJ01 Rejection of invention patent application after publication