CN107679146A - The method of calibration and system of electric network data quality - Google Patents
The method of calibration and system of electric network data quality Download PDFInfo
- Publication number
- CN107679146A CN107679146A CN201710876201.1A CN201710876201A CN107679146A CN 107679146 A CN107679146 A CN 107679146A CN 201710876201 A CN201710876201 A CN 201710876201A CN 107679146 A CN107679146 A CN 107679146A
- Authority
- CN
- China
- Prior art keywords
- data
- data record
- power network
- search index
- original data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/13—File access structures, e.g. distributed indices
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/17—Details of further file system functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/182—Distributed file systems
Abstract
The present invention relates to a kind of method of calibration and system of electric network data quality, obtains power network original data record and is stored in distributed memory system, the first search index table of power network original data record is stored in distributed memory system;Create multiple parallel tasks, in each parallel task, obtain the check field of object identifier rule, searched according to the check field in the first search index table and obtain the first power network original data record corresponding with the check field, the comparison data and reference data in the first power network original data record are extracted, the comparison data of extraction is verified according to the reference data of extraction.Electric network data is recorded into progress distributed storage can make checking procedure have good autgmentability, the relation of the search index by verifying the regular check field being related to and data record, support that verification is regular and carry out efficient query processing when performing.
Description
Technical field
The present invention relates to electric power network technique field, more particularly to a kind of method of calibration and system of electric network data quality.
Background technology
Traditional relation data management system pursues the uniformity and correctness of height, is needed in the analysis towards mass data
When asking, using Longitudinal Extension (scale up) method, i.e., individual node is lifted by upgrading hardware (CPU, internal memory, hard disk etc.)
Ability, be limited by very large its scalability and performance.
It is existing at present to be based on the continuous increase of electrical network business data scale and data quality monitoring rule complexity
There is serious bottleneck in the disposal ability of the data quality monitoring system of traditional data management and calculating platform, and treatment effeciency is low
Under, it is difficult to the monitoring and verification of the quality of data are rapidly completed, is increasingly difficult to meet daily production management and business decision
Demand.
The content of the invention
Based on this, it is necessary to confront for the data quality monitoring system data based on traditional data management and calculating platform
The problem of efficiency of monitoring and the verification of amount is low, there is provided a kind of method of calibration and system of electric network data quality.
A kind of method of calibration of electric network data quality, comprises the following steps:
Power network original data record is obtained, the power network original data record is stored in distributed memory system;Its
In, the power network original data record includes comparison data record to be verified and recorded for the reference data of verification;
Multiple parallel tasks are created, in each parallel task, perform following operation:Obtain the verification of object identifier rule
Field, searched according to the check field in the first search index table of the power network original data record, obtain with
First power network original data record corresponding to the check field, extract the ratio logarithm in the first power network original data record
According to and reference data, the comparison data of extraction is verified according to the reference data of extraction;Wherein, first search index
Table is stored in the distributed memory system;
Export the check results of multiple parallel tasks.
A kind of check system of electric network data quality, including:
Data storage cell, for obtaining power network original data record, the power network original data record is stored in point
In cloth storage system;Wherein, the power network original data record includes comparison data record to be verified and for verification
Reference data records;
Task creation unit, for creating multiple parallel tasks;
Search index unit, in each parallel task, the check field of object identifier rule being obtained, according to described
Check field is searched in the first search index table of the power network original data record, is obtained and the check field pair
The the first power network original data record answered;
Comparing unit is extracted, for extracting comparison data and reference data in the first power network original data record,
The comparison data of extraction is verified according to the reference data of extraction;Wherein, the first search index table is stored in described
In distributed memory system;
As a result output unit, for exporting the check results of multiple parallel tasks.
According to the method for calibration and system of the electric network data quality of the invention described above, it is to obtain power network original data record
And be stored in distributed memory system, the first search index of power network original data record is stored in distributed memory system
Table;Multiple parallel tasks are created, in each parallel task, the check field of object identifier rule are obtained, according to the verification
Field is searched in the first search index table obtains the first power network original data record corresponding with the check field, extracts institute
The comparison data and reference data in the first power network original data record are stated, the ratio logarithm according to the reference data of extraction to extraction
According to being verified.In this scheme, electric network data is recorded into progress distributed storage can make checking procedure have good expansion
Malleability, the relation of the search index by verifying the regular check field being related to and data record, when supporting that verification rule performs
Efficient query processing is carried out, in addition, passing through multiple parallel tasks so that every verification rule can parallelization processing, so as to
Improve the verification efficiency of electric network data quality.
A kind of readable storage medium storing program for executing, is stored thereon with executable program, and the program is realized above-mentioned when being executed by processor
The step of method of calibration of electric network data quality.
A kind of calibration equipment, including memory, processor and storage on a memory and can run on a processor can
Configuration processor, the step of realizing the method for calibration of above-mentioned electric network data quality during computing device program.
According to the method for calibration of the electric network data quality of the invention described above, the present invention also provide a kind of readable storage medium storing program for executing and
Calibration equipment, for realizing the method for calibration of above-mentioned electric network data quality by program.
Brief description of the drawings
Fig. 1 is the schematic flow sheet of the method for calibration of the electric network data quality in one embodiment of the invention;
Fig. 2 is the structural representation of the check system of the electric network data quality in one embodiment of the invention;
Fig. 3 is the structural representation of the check system of the electric network data quality in one embodiment of the invention;
Fig. 4 is the structural representation of the check system of the electric network data quality in one embodiment of the invention;
Fig. 5 is the structural representation of the check system of the electric network data quality in one embodiment of the invention;
Fig. 6 is the verification general illustration in a specific embodiment of the invention;
Fig. 7 is incremental data storage and index schematic diagram in a specific embodiment of the invention;
Fig. 8 is lot history data storage and index schematic diagram in a specific embodiment of the invention;
Fig. 9 is that the verification rule parallelization in a specific embodiment of the invention handles schematic diagram.
Embodiment
For the objects, technical solutions and advantages of the present invention are more clearly understood, below in conjunction with drawings and Examples, to this
Invention is described in further detail.It should be appreciated that embodiment described herein is only to explain the present invention,
Do not limit protection scope of the present invention.
It is shown in Figure 1, it is the schematic flow sheet of the method for calibration of the electric network data quality of one embodiment of the invention.Should
The method of calibration of electric network data quality in embodiment comprises the following steps:
Step S110:Power network original data record is obtained, the power network original data record is stored in distributed storage
In system;Wherein, the power network original data record includes comparison data record to be verified and the reference data for verification
Record;
In this step, distributed memory system can be distributed storage power network initial data, be easy to the increase of electric network data
Or delete, make checking procedure that there is good autgmentability;Reference data record for verification is comparison data note to be verified
The verification standard of record;
Step S120:Multiple parallel tasks are created, in each parallel task, perform following operation:Obtain object identifier
The check field of rule, is looked into according to the check field in the first search index table of the power network original data record
Look for, obtain the first power network original data record corresponding with the check field, extract the first power network original data record
In comparison data and reference data, the comparison data of extraction is verified according to the reference data of extraction;Wherein, described
One search index table is stored in the distributed memory system;
In this step, verified in each parallel task according to object identifier rule, by searching search index,
The comparison data and reference data corresponding with check field can be obtained, so as to be verified;
Step S130:Export the check results of multiple parallel tasks.
In the present embodiment, obtain power network original data record and be stored in distributed memory system, distributed storage
The first search index table of power network original data record is stored in system;Multiple parallel tasks are created, in each parallel task,
The check field of object identifier rule is obtained, searches acquisition and the school in the first search index table according to the check field
The first power network original data record corresponding to field is tested, extracts comparison data and base in the first power network original data record
Quasi- data, the comparison data of extraction is verified according to the reference data of extraction.In this scheme, by electric network data record into
Row distributed storage can make checking procedure have good autgmentability, and the check field being related to by verifying rule is remembered with data
The relation of the search index of record, efficient query processing is carried out when supporting verification rule to perform, in addition, by multiple parallel tasks,
So that every verification rule can parallelization processing, so as to improve the verification efficiency of electric network data quality.
Optionally, distributed memory system can be HBase distributed memory systems, and HBase distributed memory systems carry
For the big data table managerial ability based on row memory module, can storage management it is billions of more than data record, each record can
Arranged comprising more than million data;HBase provide at random and real-time reading and writing data access ability, and with enhanced scalability,
High availability, fault-tolerant processing ability, load balance ability and real time data query capability.
In one of the embodiments, the method for calibration of electric network data quality is further comprising the steps of:
The first search index table, the major key of the first search index table are established in the distributed memory system
For the check field of various verification rules, the train value of the first search index table is the master of the power network original data record
Key.
In the present embodiment, the first search index table can be established in distributed memory system, by various verifications
Major key of the check field of rule as the first search index table, the major key of power network original data record is as the first search index
The train value of table, by the first search index table, can according to corresponding to quickly being found check field the first power network initial data
Record.
, can be with it should be noted that after the major key of power network original data record corresponding with check field is got
The first power network original data record according to corresponding to being found the major key of corresponding power network original data record, from corresponding first
Comparison data and reference data are extracted in power network original data record.
Optionally, check field can be the major key of power network original data record or any attribute column, the ratio logarithm of extraction
According to being actual field corresponding with check field, can be check field in itself or other data fields.
In one of the embodiments, the method for calibration of electric network data quality is further comprising the steps of:
In each parallel task, the timestamp scope of object identifier rule is obtained, according to the timestamp scope in institute
State and searched in the second search index table of power network original data record, obtain the second electricity corresponding with the timestamp scope
Net original data record, comparison data and reference data in the second power network original data record are extracted, according to extraction
Reference data verifies to the comparison data of extraction;Wherein, the second search index table is stored in the distributed storage
In system.
In the present embodiment, it can be stabbed with passage time and search the second power network original data record, extract comparison therein
Data and reference data are verified, and realize the verification of the electric network data quality based on time window.
It should be noted that when passage time stamp searches the second power network original data record, timestamp and the second power network are former
Comparison data in beginning data record is corresponding, and reference data is corresponding with comparison data, and reference data has no directly with timestamp
Connect contact.
In one of the embodiments, the method for calibration of electric network data quality is further comprising the steps of:
The second search index table, the major key of the second search index table are established in the distributed memory system
For the timestamp of various verification rules, the train value of the second search index table is the major key of the power network original data record.
In the present embodiment, the second search index table can be established in distributed memory system, by various verifications
Major key of the timestamp of rule as the second search index table, the major key of power network original data record is as the second search index table
Train value, by the second search index table, can according to corresponding to quickly being found timestamp the second power network original data record.
It should be noted that after the major key of power network original data record corresponding with timestamp is got, Ke Yigen
Corresponding second power network original data record is found according to the major key of corresponding power network original data record, from the corresponding second electricity
Comparison data and reference data are extracted in net original data record.
In one of the embodiments, the method for calibration of electric network data quality is further comprising the steps of:
The index text of distributed file system is established according to the first search index table and the second search index table
Part, the index file is read to internal memory, reads Operation Log file of the distributed file system to the index file,
Operation note in the Operation Log file is applied in internal memory index, internal memory index is write into the index file
In, the index file indexed according to write-in internal memory loads the power network original data record of batch, former according to the power network of batch respectively
Reference data in beginning data record verifies to comparison data.
In the present embodiment, index file can be established in distributed file system, in power network Raw data quality school
When testing, index file is read in into internal memory, read operation daily record is applied in internal memory index, and internal memory is indexed into write-in index file,
Index file based on write-in internal memory index is verified, and realizes carrying out batch power network Raw data quality through the above way
Verification data can be quickly loaded during verification, lifts checking feature.
Optionally, distributed file system can be HDFS (Hadoop Distributed File System, i.e.,
Hadoop distributed file systems), HDFS possesses the more copy memory mechanisms of good data, and the error of powerful back end
Detection and node failure Restoration Mechanism.
Optionally, after internal memory is indexed into write-in index file, Operation Log file can be deleted, release storage is empty
Between, improve verification speed.
In one of the embodiments, the method for calibration of electric network data quality is further comprising the steps of:
When detecting power grid increment data record, based on check field of the power grid increment data record is generated
One search index is simultaneously added to the first search index table, generates based on timestamp of the power grid increment data record
Two search indexes are simultaneously added to the second search index table.
In the present embodiment, when detecting power grid increment data record, corresponding first search index can be added
Into the first search index table, corresponding second search index is added in the second search index table, it is ensured that concordance list it is complete
Whole property, realize the full validation of electric network data quality.
Optionally, when being verified to power grid increment data, because the timestamp of incremental data and initial data is obvious
Difference, the power grid increment data that can be inquired about according to timestamp scope in electric network data record are verified.
In one of the embodiments, the step following steps of multiple parallel tasks are created:
Multiple parallel tasks are created in MapReduce parallel computation frames, to all in distributed file system
Verification rule establishes instruction file, corresponding instruction file is read to each parallel task, according to corresponding instruction file
The parameter for performing verification rule and processing logic are configured for each parallel task.
In the present embodiment, parallel task can be created using MapReduce parallel computation frames, MapReduce is simultaneously
All Map nodes in row Computational frame can be executed concurrently different verification rules, if there is failure to send out in implementation procedure
Raw, MapReduce parallel computation frames automatically can start new task in other nodes and perform the school failed to reattempt to
Test rule, can effectively solve load balancing in whole parallel procedure and it is fault-tolerant the problems such as, verify parameter and the processing of rule
Logic is stored in instruction file, can be called from distributed file system, to indicate that file can be quick as foundation
Establish parallel task.
Optionally, before parallel task execution, configuration file can also be read, verification type is provided with configuration file
Scope is stabbed with checking time, specific verification type and timestamp scope can be determined in verification.
In one of the embodiments, instruction file corresponds to one or more of verification rules.
In the present embodiment, instruction file can correspond to one and verify regular or a plurality of verification rule, if corresponding one
Verification rule, parallel task can be verified for the verification rule, if corresponding a plurality of verification is regular, parallel task can be with
Parallel check is carried out for a plurality of verification rule, improves the treatment effeciency of verification rule.
Optionally, a plurality of verification rule in an instruction file belongs to the verification rule of same attribute type.
In one of the embodiments, the corresponding parallel task of file is indicated.
In the present embodiment, the corresponding parallel task of file is indicated, an instruction file is carried out by a parallel task
Processing, realize that each instruction file can be handled with parallelization, improve the treatment effeciency of instruction file.
According to the method for calibration of above-mentioned electric network data quality, the present invention also provides a kind of verification system of electric network data quality
System, just the embodiment of the check system of the electric network data quality of the present invention is described in detail below.
It is shown in Figure 2, it is the structural representation of the check system of the electric network data quality of one embodiment of the invention, should
The check system of electric network data quality in embodiment includes:
Data storage cell 210, for obtaining power network original data record, the power network original data record is stored in
In distributed memory system;Wherein, the power network original data record includes comparison data record to be verified and for verifying
Reference data record;
Task creation unit 220, for creating multiple parallel tasks;
Search index unit 230, in each parallel task, obtaining the check field of object identifier rule, according to
The check field is searched in the first search index table of the power network original data record, is obtained and the check word
First power network original data record corresponding to section;
Comparing unit 240 is extracted, for extracting comparison data and base value in the first power network original data record
According to being verified according to the reference data of extraction to the comparison data of extraction;Wherein, the first search index table is stored in institute
State in distributed memory system;
As a result output unit 250, for exporting the check results of multiple parallel tasks.
In one of the embodiments, as shown in figure 3, the check system of electric network data quality also establishes unit including index
260, for establishing the first search index table, the major key of the first search index table in the distributed memory system
For the check field of various verification rules, the train value of the first search index table is the master of the power network original data record
Key.
In one of the embodiments, search index unit 230 is additionally operable in each parallel task, obtains object identifier
The timestamp scope of rule, enters according to the timestamp scope in the second search index table of the power network original data record
Row is searched, and obtains the second power network original data record corresponding with the timestamp scope, extracts the second power network original number
According to the comparison data and reference data in record, the comparison data of extraction is verified according to the reference data of extraction;Wherein,
The second search index table is stored in the distributed memory system.
In one of the embodiments, index establishes unit 260 and is additionally operable to establish institute in the distributed memory system
The second search index table is stated, the major key of the second search index table is the timestamp of various verification rules, and described second inquires about
The train value of concordance list is the major key of the power network original data record.
In one of the embodiments, as shown in figure 4, the check system of electric network data quality also includes file index unit
270, for establishing the index text of distributed file system according to the first search index table and the second search index table
Part, the index file is read to internal memory, reads Operation Log file of the distributed file system to the index file,
Operation note in the Operation Log file is applied in internal memory index, internal memory index is write into the index file
In, the index file indexed according to write-in internal memory loads the power network original data record of batch, former according to the power network of batch respectively
Reference data in beginning data record verifies to comparison data.
In one of the embodiments, as shown in figure 5, the check system of electric network data quality also includes index adjustment unit
280, for when detecting power grid increment data record, generating based on check field of the power grid increment data record
One search index is simultaneously added to the first search index table, generates based on timestamp of the power grid increment data record
Two search indexes are simultaneously added to the second search index table.
In one of the embodiments, task creation unit 220 created in MapReduce parallel computation frames it is multiple simultaneously
Row task, the instruction file for the verification rule established in distributed file system is read to each parallel task, according to corresponding
Instruction file be each parallel task configuration perform verification rule parameter and processing logic.
In one of the embodiments, instruction file corresponds to one or more of verification rules.
In one of the embodiments, the corresponding parallel task of file is indicated.
The check system of the electric network data quality of the present invention and a pair of the method for calibration 1 of the electric network data quality of the present invention
Should, the technical characteristic and its advantage illustrated in the embodiment of the method for calibration of above-mentioned electric network data quality is applied to power network
In the embodiment of the check system of the quality of data.
According to the method for calibration of above-mentioned electric network data quality, the embodiment of the present invention also provides a kind of readable storage medium storing program for executing and one
Kind calibration equipment.Executable program is stored with readable storage medium storing program for executing, the program realizes above-mentioned power network number when being executed by processor
According to quality method of calibration the step of;Calibration equipment includes memory, processor and storage on a memory and can be in processor
The executable program of upper operation, the step of realizing the method for calibration of above-mentioned electric network data quality during computing device program.
In a specific embodiment, the method for calibration of electric network data quality is that one kind is stored and place parallel based on distribution
The method of calibration of the electric network data quality of reason, it is big to solve the existing computation delay based on relational database system method, difficult
In extension, the problem of cost performance is low.
The main thought of the technical solution adopted by the present invention is:
All original data records are stored using one kind distribution storage method;
Check field is indexed using the indexing means based on non-primary key, the school being related to according to verification rule during verification
Test field and search concordance list, original data record major key corresponding to acquisition, further according to the original data record table major key got
Search original data record table and obtain original data record, then extraction compares field and is compared;
Original data record settling time is stabbed using HBase and indexed, in incremental data quality indicator or based on the time
During the quality of data verification of the thin time granularity of window, original data record table is inquired about according to timestamp scope, it is determined that needing to verify
Data area after verified;
The secondary index file and Operation Log file recorded using HDFS data storages, so as to full dose Raw data quality
Verification data can be quickly loaded during verification, checking feature is lifted, when full dose Raw data quality verifies, by lazy halyard quotation
Part reads in internal memory, and read operation daily record is applied on internal memory index, is then based on internal memory index and is verified;
The quick execution of verification rule is completed using the parallelization mode based on MapReduce.
Further, the distribution storage method is the distribution storage method based on HBase, can support magnanimity verification data
Storage, and extension can be facilitated according to demand.Further, the verification rule is the parallelization verification based on MapReduce
Rule.According to verification data amount and the convenient extension of regular quantity can be verified, response performance is controllable, cost-effective.
Further, check field is indexed using the method based on non-master key index, to realize based on non-master
The verification rule query processing of key field.
Further, check field is original data record major key or any attribute column;Comparing field is and the school
Test a certain field corresponding to field, can be check field in itself or other fields.
Further, original data record settling time is stabbed and indexed, in incremental data quality indicator or based on the time
During the thin time granularity quality of data verification of window, remembered according to timestamp search index timestamp concordance list with obtaining initial data
Major key is recorded, then inquires about original data record table and is verified with obtaining original data record.
Further, HDFS secondary index files are established for full dose initial data, Operation Log is established for incremental data,
When full dose historical data verifies, HDFS secondary indexes file is read to internal memory, Operation Log is applied on internal memory index, then
Verified based on internal memory index.
Further, instruction file is established to all verification rules, instruction file content performs verification rule comprising all
Parameter, the Map such as the parameter then needed, including rule name, regular execution logic mark, input data table, output data table appoint
Corresponding instruction file is read in business, is obtained and is performed the parameter that corresponding verification rule needs, calls corresponding processing logic to carry out school
Test.
Further, each instruction file corresponds to one or more of verification rules, and the execution parameter for verifying rule is write
In file is indicated, the execution parameter includes verification rule name, and regular execution logic represents, input data table, exports number
According to parameters such as tables.
Further, each instruction file is handled by a Map task,
The solution of the present invention can efficiently carry out the verification of electric network data quality expansiblely:First, electric network data is entered
Row distribution storage, makes system be with good expansibility;Second, nonproductive poll is established by the field being related to for verification rule
Index, efficient query processing is carried out to support to verify when rule performs;3rd, devise a verification based on MapReduce
Regular method for parallel processing so that every verification rule can parallelization processing, effectively improve system responsiveness energy.
HBase is a distributed memory system in Hadoop ecological environments.Lacked for distributed file system HDFS
The defects of few structuring semi-structured data storage is accessed with random read-write ability, in HDFS (Hadoop Distributed
File System, i.e. Hadoop distributed file systems) on, HBase provides a distributed, solution
Certainly large-scale structuring and semi-structured data storage access problem.HBase provides the big data table based on row memory module
Managerial ability, can storage management it is billions of more than data record, each record can arrange comprising more than million data;HBase
Attempt provide at random and real-time reading and writing data access ability, and with enhanced scalability, high availability, fault-tolerant processing ability,
Load balance ability and real time data query capability.
HBase bottom data is stored in HDFS, thus HBase be place one's entire reliance upon bottom HDFS work
's.Because HDFS has employed well the more copy memory mechanisms of data and powerful back end fluffing check and node mistake
Restoration Mechanism is imitated, the high reliability based on HDFS HBase natural succession HDFS this data storages in data storage
With fault-tolerant processing ability.
Hadoop MapReduce provide that one huge but the superior Distributed Storage of design and parallel computation are soft
Part framework, the storage management of distributed mass data can be automatically performed, can automatically divide and calculate data and dispatch calculating task,
Automatic distribution and subtasking and collection result of calculation on clustered node, by distributed data storage, data communication, fault-tolerant place
Many ins and outs in the parallel computations such as reason transfer to system to be responsible for processing, greatly reduce the burden of software developer.
As shown in fig. 6, the present invention is remembered initial data using distributed data storage and management system HBase data storages
Address book stored is into HBase, to be accessed according to major key quick search;The check field being related to for verification rule establishes inquiry rope
Draw, to be accessed according to check word segment value quick search;The secondary index based on timestamp is established for original data record, is supported
Quality of data verification based on time window;For the full dose data of historical accumulation, while establish index file and be stored in distribution
On formula file system HDFS, quickly to be loaded when carrying out batch data quality indicator, the full table scan to HBase is avoided;
And the incremental data for flowing into real time establishes Operation Log, index file when solving data record increase, deleting, change
Maintenance issues, periodically union operation daily record and index file, reduce merging expense during batch data quality indicator;Using school
The parallelization for testing rule performs, and a parallel task handles one to a plurality of verification rule.
The flow that batch data is stored and is indexed is comprised the following steps:
(1) the reference data table of CSV forms to be verified and comparison data table are stored in HBase, original data record
Major key of the major key as HBase table, the row of the non-primary key attribute of original data record as HBase table, different dependents of dead military hero in
Different row races, improved using HBase towards row storage (data of same row race are unified to be stored) when inquiring about certain column data
Response performance;
(2) by based in the search index table deposit HBase for verifying regular check field, check field is looked into as HBase
The major key of concordance list, row name of the original data record major key as search index table are ask, all major keys belong to same row race, adopted
With convenient increase, deletion, modification and the inquiry that record is marked to search index of this data pattern;
(3) the search index table based on data record timestamp is stored in HBase, data record timestamp conduct
The major key of HBase search index tables, original data record major key store as the train value of search index table.
(4) when by based in the search index table deposit HBase for verifying regular check field, while search index table is deposited
In the index file for entering HDFS.
The flow that incremental data is stored and is indexed is comprised the following steps:
(1) in the original data record table that incremental data is recorded to insertion HBase;
(2) by the search index that HBase is inserted based on the search index for verifying regular check field of incremental data record
In;
(3) the search index table based on data record timestamp of incremental data record is inserted to HBase secondary index
In;
(4) Operation Log that incremental data records is appended in the Operation Log file on HDFS.
The flow that Operation Log is merged into index file comprises the following steps:
(1) index file on HDFS is read into internal memory;
(2) Operation Log file on HDFS is read, operation is applied in internal memory index one by one;
(3) internal memory index is re-write in the index file on HDFS;
(4) the Operation Log file on HDFS is deleted.
Parallelization verifies rule process flow:
(1) type will be verified, checking time stamp scope is written in configuration file;
(2) start MapReduce operations to start to perform quality of data verification;
(3) each Map tasks read an instruction file, obtain rule name, regular execution logic mark, input data
The parameter such as table, output data table;And read the verification type in configuration file and verification scope timestamp;
(4) verified for batch, according to batch data single gauge, then checking process is verified;
(5) for the verification based on time window, carrying out incremental data single gauge according to timestamp scope, then checking process enters
Row verification.
Batch data single gauge then checking process:
(1) the inquiry rope that the search index table on HDFS is applied it in internal memory to internal memory, read operation daily record is read
Draw table, deletion action journal file;
(2) travel through the search index table in internal memory and enter line discipline verification.
Incremental data single gauge then checking process:
(1) according to initial time stamp and termination timestamp, query time stamp concordance list, obtain in incremental time time window
All Record IDs;Original data record table is inquired about, obtains corresponding check field set;
(2) according to the field value in check field set, secondary index table is inquired about, comparison field value is obtained and is verified.
Then checking process is also applied for the verification of initial data to above-mentioned incremental data single gauge.
As shown in fig. 7, the distribution storage that the present invention relates to and the embodiment of indexing means are:In order to complete to big
The quick processing of data record and a large amount of verification rules is measured, in addition to by the storage of former tables of data into HBase, it would be desirable to pin
To the field involved by verification rule, design special rapid data concordance list and store into HBase.For example, in former number
According in Tables 1 and 2, major key (rowkey fields) is the ID of each record.If necessary to (be designated as word to the A fields of former tables of data 1
Section A) and the B field (being designated as field B) of former tables of data 2 verified, then we need to establish field A's and field B respectively
Concordance list with verification when quickly search.In order to realize incremental data quality indicator and the thin time grain based on time window
The quality of data verification of degree, establishes timestamp search index, to be defined according to timestamp scope for original data record table
Carry out the data area of quality indicator.As shown in figure 8, in order to lift the quality indicator performance of full dose historical data, remember for data
Record table and establish auxiliary HDFS index files and Operation Log, quickly to load verification data to internal memory in full dose data check
In verified.
It is for the embodiment for verifying regular parallelization processing in the present invention:In order to complete to mass data record and
The quick processing of a large amount of verification rules, using the parallelization execution mechanism based on MapReduce.As shown in Figure 9:First will be each
The ID of verification rule and parameter etc., which are written in HDFS files independent one by one, (referred to as indicates file), MapReduce operations
In contain all these verification rule processing modules realizations.Machine is run according to Hadoop MapReduce acquiescence
System, each Map tasks can only read an instruction file and be handled, and the selection of specific processing module is then by this here
The read instruction file of business determines.
Map nodes all in cluster can be just caused to be executed concurrently different verification rules by this method.Such as
There is failure in fruit implementation procedure, Hadoop MapReduce automatically can start new Map tasks to weigh in other nodes
New try performs these verification rules.The load balancing of whole parallel procedure and it is fault-tolerant the problems such as all by Hadoop MapReduce
Framework solves in the lump.
The present invention realizes a prototype system based on some existing open source softwares.Wherein distribution storage and index use
HBase, the regular parallelization processing of verification use HDFS and MapReduce, and these three softwares are not belonging to present disclosure in itself.
The prototype system realized by using real electrical network business data and verification rule to the present invention and existing relation data management
System carries out test comparison, and the prototype system that the present invention realizes is better than conventional relationship data management in response performance, scalability
System, it was demonstrated that the validity of the electric network data quality determining method based on distribution storage and parallel processing of the present invention.
Each technical characteristic of embodiment described above can be combined arbitrarily, to make description succinct, not to above-mentioned reality
Apply all possible combination of each technical characteristic in example to be all described, as long as however, the combination of these technical characteristics is not deposited
In contradiction, the scope that this specification is recorded all is considered to be.
Can be with one of ordinary skill in the art will appreciate that realizing that all or part of step in above-described embodiment method is
The hardware of correlation is instructed to complete by program.Described program can be stored in read/write memory medium.The program exists
During execution, including the step described in the above method.Described storage medium, including:ROM/RAM, magnetic disc, CD etc..
Embodiment described above only expresses the several embodiments of the present invention, and its description is more specific and detailed, but simultaneously
Can not therefore it be construed as limiting the scope of the patent.It should be pointed out that come for one of ordinary skill in the art
Say, without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to the protection of the present invention
Scope.Therefore, the protection domain of patent of the present invention should be determined by the appended claims.
Claims (10)
1. a kind of method of calibration of electric network data, it is characterised in that comprise the following steps:
Power network original data record is obtained, the power network original data record is stored in distributed memory system;Wherein, institute
Stating power network original data record includes comparison data record to be verified and is recorded for the reference data of verification;
Multiple parallel tasks are created, in each parallel task, perform following operation:Obtain the check word of object identifier rule
Section, is searched according to the check field in the first search index table of the power network original data record, acquisition and institute
The first power network original data record corresponding to check field is stated, extracts the comparison data in the first power network original data record
And reference data, the comparison data of extraction is verified according to the reference data of extraction;Wherein, the first search index table
It is stored in the distributed memory system;
Export the check results of multiple parallel tasks.
2. the method for calibration of electric network data quality according to claim 1, it is characterised in that further comprising the steps of:
The first search index table is established in the distributed memory system, the major key of the first search index table is each
The check field of kind verification rule, the train value of the first search index table are the major key of the power network original data record.
3. the method for calibration of electric network data quality according to claim 1, it is characterised in that further comprising the steps of:
In each parallel task, the timestamp scope of object identifier rule is obtained, according to the timestamp scope in the electricity
Searched in second search index table of net original data record, it is former to obtain the second power network corresponding with the timestamp scope
Beginning data record, comparison data and reference data in the second power network original data record are extracted, according to the benchmark of extraction
Data verify to the comparison data of extraction;Wherein, the second search index table is stored in the distributed memory system
In.
4. the method for calibration of electric network data quality according to claim 3, it is characterised in that further comprising the steps of:
The second search index table is established in the distributed memory system, the major key of the second search index table is each
The timestamp of kind verification rule, the train value of the second search index table are the major key of the power network original data record.
5. the method for calibration of electric network data quality according to claim 3, it is characterised in that further comprising the steps of:
The index file of distributed file system is established according to the first search index table and the second search index table, is read
The index file is taken Operation Log file of the distributed file system to the index file to be read, by institute to internal memory
State the operation note in Operation Log file to be applied in internal memory index, internal memory index write in the index file,
The power network original data record of the index file loading batch indexed according to write-in internal memory, respectively according to the power network original number of batch
Comparison data is verified according to the reference data in record.
6. the method for calibration of electric network data quality according to claim 3, it is characterised in that further comprising the steps of:
When detecting power grid increment data record, first based on check field for generating the power grid increment data record is looked into
Ask and index and be added to the first search index table, second based on timestamp for generating the power grid increment data record is looked into
Ask and index and be added to the second search index table.
7. the method for calibration of electric network data quality according to claim 3, it is characterised in that described to create multiple parallel
The step following steps of business:
Multiple parallel tasks are created in MapReduce parallel computation frames, to all verifications in distributed file system
Rule establishes instruction file, reads corresponding instruction file to each parallel task, is every according to corresponding instruction file
Individual parallel task configuration performs the parameter and processing logic of verification rule.
8. the method for calibration of electric network data quality according to claim 7, it is characterised in that the instruction file corresponding one
Bar or a plurality of verification rule.
9. the method for calibration of electric network data quality according to claim 7, it is characterised in that the instruction file corresponding one
Individual parallel task.
A kind of 10. check system of electric network data, it is characterised in that including:
Data storage cell, for obtaining power network original data record, the power network original data record is stored in distribution
In storage system;Wherein, the power network original data record includes comparison data record to be verified and the benchmark for verification
Data record;
Task creation unit, for creating multiple parallel tasks;
Search index unit, in each parallel task, the check field of object identifier rule being obtained, according to the verification
Field is searched in the first search index table of the power network original data record, is obtained corresponding with the check field
First power network original data record;
Comparing unit is extracted, for extracting comparison data and reference data in the first power network original data record, according to
The reference data of extraction verifies to the comparison data of extraction;Wherein, the first search index table is stored in the distribution
In formula storage system;
As a result output unit, for exporting the check results of multiple parallel tasks.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710876201.1A CN107679146A (en) | 2017-09-25 | 2017-09-25 | The method of calibration and system of electric network data quality |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710876201.1A CN107679146A (en) | 2017-09-25 | 2017-09-25 | The method of calibration and system of electric network data quality |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107679146A true CN107679146A (en) | 2018-02-09 |
Family
ID=61138126
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710876201.1A Pending CN107679146A (en) | 2017-09-25 | 2017-09-25 | The method of calibration and system of electric network data quality |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107679146A (en) |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108595664A (en) * | 2018-04-28 | 2018-09-28 | 尚谷科技(天津)有限公司 | A kind of agricultural data monitoring method under hadoop environment |
CN108762933A (en) * | 2018-05-31 | 2018-11-06 | 成都四方伟业软件股份有限公司 | Quality of data method of calibration and device |
CN109462517A (en) * | 2018-10-24 | 2019-03-12 | 云南电网有限责任公司信息中心 | A kind of method, system and the equipment of the data monitoring towards digital electric network business |
CN109460995A (en) * | 2018-09-26 | 2019-03-12 | 平安国际融资租赁有限公司 | Financial accreditation method, apparatus, computer equipment and storage medium |
CN109635300A (en) * | 2018-12-14 | 2019-04-16 | 泰康保险集团股份有限公司 | Data verification method and device |
CN110704404A (en) * | 2019-08-29 | 2020-01-17 | 苏宁云计算有限公司 | Data quality checking method, device and system |
CN111209597A (en) * | 2018-11-22 | 2020-05-29 | 迈普通信技术股份有限公司 | Data verification method and application system |
CN112540987A (en) * | 2020-12-08 | 2021-03-23 | 湖州中朔信息技术有限公司 | Big data management system of distribution and utilization electricity based on data mart |
CN112667618A (en) * | 2020-12-30 | 2021-04-16 | 湖南长城医疗科技有限公司 | Public area sanitation platform quality control system and method |
CN112799945A (en) * | 2021-01-29 | 2021-05-14 | 中国工商银行股份有限公司 | Batch file verification method and device |
CN112860769A (en) * | 2021-03-10 | 2021-05-28 | 广东电网有限责任公司 | Energy planning data management system |
CN112910086A (en) * | 2021-01-18 | 2021-06-04 | 国网山东省电力公司青岛供电公司 | Intelligent substation data verification method and system |
CN115099713A (en) * | 2022-08-01 | 2022-09-23 | 武汉胜天地消防工程有限公司 | Smart power grid operation log collection and analysis management system based on big data |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102024046A (en) * | 2010-12-14 | 2011-04-20 | 成都市华为赛门铁克科技有限公司 | Data repeatability checking method and device as well as system |
CN102799746A (en) * | 2012-05-07 | 2012-11-28 | 山东电力集团公司青岛供电公司 | Power grid information checking method and system, and power grid planning auxiliary system |
CN104391903A (en) * | 2014-11-14 | 2015-03-04 | 广州科腾信息技术有限公司 | Distributed storage and parallel calculation-based power grid data quality detection method |
US20160314026A1 (en) * | 2015-04-27 | 2016-10-27 | Microsoft Technology Licensing, Llc | Establishing causality order of computer trace records |
-
2017
- 2017-09-25 CN CN201710876201.1A patent/CN107679146A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102024046A (en) * | 2010-12-14 | 2011-04-20 | 成都市华为赛门铁克科技有限公司 | Data repeatability checking method and device as well as system |
CN102799746A (en) * | 2012-05-07 | 2012-11-28 | 山东电力集团公司青岛供电公司 | Power grid information checking method and system, and power grid planning auxiliary system |
CN104391903A (en) * | 2014-11-14 | 2015-03-04 | 广州科腾信息技术有限公司 | Distributed storage and parallel calculation-based power grid data quality detection method |
US20160314026A1 (en) * | 2015-04-27 | 2016-10-27 | Microsoft Technology Licensing, Llc | Establishing causality order of computer trace records |
Cited By (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108595664B (en) * | 2018-04-28 | 2022-05-31 | 上海左岸芯慧电子科技有限公司 | Agricultural data monitoring method in hadoop environment |
CN108595664A (en) * | 2018-04-28 | 2018-09-28 | 尚谷科技(天津)有限公司 | A kind of agricultural data monitoring method under hadoop environment |
CN108762933A (en) * | 2018-05-31 | 2018-11-06 | 成都四方伟业软件股份有限公司 | Quality of data method of calibration and device |
CN109460995A (en) * | 2018-09-26 | 2019-03-12 | 平安国际融资租赁有限公司 | Financial accreditation method, apparatus, computer equipment and storage medium |
CN109460995B (en) * | 2018-09-26 | 2024-02-06 | 平安国际融资租赁有限公司 | Financial certification method, device, computer equipment and storage medium |
CN109462517A (en) * | 2018-10-24 | 2019-03-12 | 云南电网有限责任公司信息中心 | A kind of method, system and the equipment of the data monitoring towards digital electric network business |
CN111209597A (en) * | 2018-11-22 | 2020-05-29 | 迈普通信技术股份有限公司 | Data verification method and application system |
CN109635300A (en) * | 2018-12-14 | 2019-04-16 | 泰康保险集团股份有限公司 | Data verification method and device |
CN109635300B (en) * | 2018-12-14 | 2023-12-19 | 泰康保险集团股份有限公司 | Data verification method and device |
CN110704404B (en) * | 2019-08-29 | 2023-04-28 | 苏宁云计算有限公司 | Data quality verification method, device and system |
CN110704404A (en) * | 2019-08-29 | 2020-01-17 | 苏宁云计算有限公司 | Data quality checking method, device and system |
CN112540987A (en) * | 2020-12-08 | 2021-03-23 | 湖州中朔信息技术有限公司 | Big data management system of distribution and utilization electricity based on data mart |
CN112667618A (en) * | 2020-12-30 | 2021-04-16 | 湖南长城医疗科技有限公司 | Public area sanitation platform quality control system and method |
CN112667618B (en) * | 2020-12-30 | 2023-06-06 | 湖南长城医疗科技有限公司 | Public area sanitary platform quality control system and method |
CN112910086A (en) * | 2021-01-18 | 2021-06-04 | 国网山东省电力公司青岛供电公司 | Intelligent substation data verification method and system |
CN112799945A (en) * | 2021-01-29 | 2021-05-14 | 中国工商银行股份有限公司 | Batch file verification method and device |
CN112799945B (en) * | 2021-01-29 | 2024-03-15 | 中国工商银行股份有限公司 | Batch file verification method and device |
CN112860769A (en) * | 2021-03-10 | 2021-05-28 | 广东电网有限责任公司 | Energy planning data management system |
CN115099713A (en) * | 2022-08-01 | 2022-09-23 | 武汉胜天地消防工程有限公司 | Smart power grid operation log collection and analysis management system based on big data |
CN115099713B (en) * | 2022-08-01 | 2023-04-07 | 河南蓝通信息技术有限公司 | Smart power grid operation log acquisition and analysis management system based on big data |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107679146A (en) | The method of calibration and system of electric network data quality | |
CN108255712B (en) | Test system and test method of data system | |
CN102968374B (en) | A kind of data warehouse method of testing | |
CN104866426A (en) | Software test integrated control method and system | |
CN104866580A (en) | Method for quickly detecting impact caused by database modification to current service | |
CN104391903A (en) | Distributed storage and parallel calculation-based power grid data quality detection method | |
CN104036029B (en) | Large data consistency control methods and system | |
US10331657B1 (en) | Contention analysis for journal-based databases | |
CN104252481A (en) | Dynamic check method and device for consistency of main and salve databases | |
CN104239377A (en) | Platform-crossing data retrieval method and device | |
CN107491487A (en) | A kind of full-text database framework and bitmap index establishment, data query method, server and medium | |
CN106682036A (en) | Data exchange system and exchange method thereof | |
CN102236672A (en) | Method and device for importing data | |
CN110891000B (en) | GPU bandwidth performance detection method, system and related device | |
CN111240968A (en) | Automatic test management method and system | |
CN108664388A (en) | Dynamic field data return to test system, method, electronic equipment and the readable storage medium storing program for executing of interface | |
CN114600094A (en) | Generating hash trees for database architectures | |
CN104778179A (en) | Data migration test method and system | |
CN108519856A (en) | Based on the data block copy laying method under isomery Hadoop cluster environment | |
CN107122238A (en) | Efficient iterative Mechanism Design method based on Hadoop cloud Computational frame | |
CN111026709B (en) | Data processing method and device based on cluster access | |
CN105868956A (en) | Data processing method and device | |
CN112948473A (en) | Data processing method, device and system of data warehouse and storage medium | |
CN105335459B (en) | Consolidated accounts data pick-up method based on XBRL intelligence reporting platform | |
CN115329011A (en) | Data model construction method, data query method, data model construction device and data query device, and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180209 |
|
RJ01 | Rejection of invention patent application after publication |