CN104820720A - Data quality detecting method and device - Google Patents
Data quality detecting method and device Download PDFInfo
- Publication number
- CN104820720A CN104820720A CN201510272664.8A CN201510272664A CN104820720A CN 104820720 A CN104820720 A CN 104820720A CN 201510272664 A CN201510272664 A CN 201510272664A CN 104820720 A CN104820720 A CN 104820720A
- Authority
- CN
- China
- Prior art keywords
- data
- quality
- rule
- quality testing
- testing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/25—Integrating or interfacing systems involving database management systems
- G06F16/254—Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Quality & Reliability (AREA)
- Testing And Monitoring For Control Systems (AREA)
Abstract
The invention discloses a data quality detecting method and device. The method includes first extracting data from a source base; then conducting quality detection on the extracted data according to a preset quality rule; finally integrating the data passing the quality detection. By means of the method and device, data extracting efficiency and data mass are improved, and extension of the quality detection rule is facilitated.
Description
Technical field
The present invention relates to data warehouse technology, particularly relate to a kind of data quality checking method and apparatus.
Background technology
Data warehouse (Data Warehouse, DW or DWH) is the strategy set of all types data provided support for the decision-making process of all ranks of enterprise.It is that individual data stores, and creates for the object of analytical presentation and decision support.Business intelligence is needed to come service guidance flow scheme improvements and Looking Out Time, cost, quality and control for enterprise provides.
Extraction-conversion-(Extract-Transform-Load, the ETL) process that loads is that data are passed through extraction (extract), conversion (transform) from source terminal, loaded (load) data handling procedure to destination by one.ETL is comparatively commonly used in data warehouse technology.Wherein, extraction can be understood as is in data warehouse the data pick-up of source data; Conversion refers to and the data that developer will extract is converted to target data structure according to service needed, and realizes gathering; Loading is in target data warehouse through conversion and the data that gather.
Along with the widespread use of large data, data become organizes one of most valuable assets.Also exist between the quality of data of enterprise and Professional performance and contact directly, high-quality data can make company keep competitive power and establish oneself in an unassailable position period in economic turmoil.Had the pervasive quality of data, enterprise at any time can trust all data meeting all demands.
The detection of current Data Warehouse quality is with the extraction link close association in data warehouse ETL process, namely while data pick-up, quality of data inspection is carried out, that is, together with data quality checking rubs up with the scripted code of data pick-up.Therefore, existing quality testing scheme is too high with the extraction link degree of coupling, so, will cause there is following problems.
1, the success or failure of quality of data verification can affect the speed of data pick-up largely.If the quality of data verifies unsuccessfully, then need to re-start quality inspection, and quality indicator and extraction bundle, like this, needs also re-execute by the extraction step of data, and therefore, the failure of quality of data verification can cause the entirety of data extraction speed low.
2, the quality of data is low.Because quality indicator and extraction bundle, in order to improve the efficiency of data pick-up, the field considered in corresponding SQL statement will be less, like this, constraint condition in SQL statement will be very loose, thus the data be drawn into based on this can be caused may not to meet strict technology and service logic rule, and then reduce the quality of institute's extracted data.
3, the extendability of data quality checking rule is poor, because quality indicator and both extractions bundle, when data quality checking rule needs to upgrade, also need to change the associated script code extracting link simultaneously, like this, during data quality checking Policy Updates, the revision amount of corresponding scripted code can be comparatively large, thus be unfavorable for that data quality checking rule safeguards expansion.
As can be seen here, existing data quality checking method existence affects data pick-up efficiency, reduces the quality of data and is difficult to the problems such as service data quality testing rule.
Summary of the invention
In view of this, fundamental purpose of the present invention is to provide a kind of data quality checking method and apparatus, can improve data pick-up efficiency and the quality of data, and be easy to the expansion of data quality checking rule.
In order to achieve the above object, the technical scheme that the present invention proposes is:
A kind of data quality checking method, comprising:
A, from storehouse, source extracted data;
The quality rule that b, basis are preset, carries out quality testing to the data that described extraction obtains;
C, by the data by described quality testing, carry out integration process.
A kind of data quality checking device, comprising:
First extracting unit, for extracted data from storehouse, source;
Quality inspection unit, for according to the quality rule preset, carries out quality testing to the data that described extraction obtains;
Second extracting unit, for by the data by described quality testing, carries out integration process.
In sum, the data quality checking method and apparatus that the present invention proposes, first carry out separately data pick-up and again quality inspection is carried out to extraction result, so, scripted code corresponding to quality testing will independent of the scripted code of described extraction, thus the problems existing for existing quality testing scheme can be avoided, data pick-up efficiency and the quality of data can be improved, and be easy to the expansion of data quality checking rule.
Accompanying drawing explanation
Fig. 1 is the method flow schematic diagram of the embodiment of the present invention;
Fig. 2 is the apparatus structure schematic diagram of the embodiment of the present invention.
Embodiment
For making the object, technical solutions and advantages of the present invention clearly, the present invention is described in further detail below in conjunction with the accompanying drawings and the specific embodiments.
Core concept of the present invention is: the data pick-up in extraction process and Data Integration are separated, and first carries out data pick-up separately, before Data Integration, carries out quality inspection to data pick-up result.Like this, the scripted code extracted can independent of the scripted code of quality testing, the problems existed when both bind together would not be there are, namely the success or failure of quality testing can not affect the overall progress of data pick-up again, can also improve the accuracy of quality testing simultaneously, guarantee the quality of data, the expansion of data quality checking rule is also easy to realize.
Fig. 1 is the data quality checking method flow schematic diagram of the embodiment of the present invention, and as shown in Figure 1, this embodiment mainly comprises:
Step 101, from storehouse, source extracted data.
This step is used for separately extracted data from storehouse, source, and like this, the scripted code of data pick-up can independent of the scripted code of quality testing, thus the problems produced when both can be avoided to bundle.
In actual applications, existing abstracting method extracted data from storehouse, source can be adopted.Preferably, in order to ensure that the data that extract and source database data are consistent, the mode extracted one to one can be adopted to carry out described extraction, and concrete abstracting method is one to one grasped by those skilled in the art, does not repeat them here.
The quality rule that step 102, basis are preset, carries out quality testing to the data that described extraction obtains.
This step, for carrying out quality testing to the data after extraction.Here, because the scripted code of quality testing is independent of the scripted code of data pick-up in step 101, therefore, the needs considering to improve data pick-up efficiency will do not needed during quality testing, like this, constraint condition in corresponding quality inspection SQL statement can comprise more multi-field, thus can match with the quality rule of reality, and then can guarantee the quality of data.
In actual applications, described quality rule can be needed to arrange according to practical application by those skilled in the art, specifically can comprise technical regulation and service logic rule.In the case, in order to make the scripted code of quality testing, there is stronger maintainability, namely quality rule is made to be easy to safeguard expansion, preferably, can make there is certain logicality between rule, particularly, after can detecting according to first technical regulation, the principle of service logic rule detection, carries out described quality testing.
In addition, preferably, the maintenance for ease of quality rule is expanded, and when needing to increase new quality rule, according to the principle keeping existing quality rule independence, can increase the scripted code of described new quality rule.
Further, in order to improve the integrality that data store further, for detecting the data do not passed through in this step, the Backup Data in storehouse, source can be it can be used as to be kept in data warehouse.
Preferably, detected quality of data abnormal problem can be known in time to make data warehouse maintainer, and carry out corresponding abnormality processing, in this step when the data not by quality testing being detected, trigger corresponding Data Detection exception handling procedure.Concrete abnormality eliminating method is grasped by those skilled in the art, does not repeat them here.
Step 103, by the data by described quality testing, carry out integration process.
This step is used for integrating the data that quality inspection is passed through, and like this, can guarantee to carry out based on this accuracy changed in ETL process.Concrete integration method is grasped by those skilled in the art, does not repeat them here.
Below in conjunction with concrete SQL scripted code, specific implementation of the present invention is further elaborated:
Below in example, will be B for Table A, data warehouse, field will be all a, b, c and is described.
Step x1, execution inert into B as select*from A
This step is used for extracted data (namely extracting source database data) from Table A, is stored in data warehouse B table
Step x2, Select count (1) from B where a>9999
This step, for carrying out technical logic verification (supposing maximal value 9999), if result is more than or equal to 1 explanation a and exceedes maximal value verification, then illustrate there is technical logic abnormal data, corresponding exception handling procedure (if the then notification source storehouse responsible official adjustment of source database data mistake, if coupling extracts mistake then again mate extraction) will be triggered, if result is less than 1 explanation, there is no technical logic abnormal data, enter next step.
Step x3, Select count (1) from B where a>b
This step, for carrying out service logic verification (a field is greater than b field), if result is more than or equal to 1 explanation and meets service logic and verification succeeds, then can carry out next step Data Integration, otherwise, illustrate that service logic verifies unsuccessfully, follow-up Data Integration process cannot be carried out.
Step x4, Select d=a+b from B
This step, for carrying out Data Integration process.
In addition, if think that increasing verification business a field is greater than c field simultaneously, then need in rule script code, increase corresponding SQL to go here and there: Select count (1) from B where a>c, quality testing can be verified (rule base has order priority field control execution to detect logic) by order executing rule, namely after the 3rd step, increase Selectcount (1) from B where a>c, and do not need the corresponding SQL of original 2,3 step of amendment.
Pass through technique scheme, can find out that the present invention passes through the scripted code making the scripted code of extraction link independent of quality testing, the problems produced when both can be avoided to bundle, thus data pick-up efficiency and the quality of data can be improved, and be easy to the expansion of data quality checking rule.
Fig. 2 is the data quality checking apparatus structure schematic diagram corresponding with said method, and as shown in the figure, this device comprises:
First extracting unit, for extracted data from storehouse, source;
Quality inspection unit, for according to the quality rule preset, carries out quality testing to the data that described extraction obtains;
Second extracting unit, for by the data by described quality testing, carries out integration process.
Preferably, described first extracting unit adopts the mode extracted one to one to carry out described extraction.
Preferably, described quality rule comprises technical regulation and service logic rule, and after described quality inspection unit detects according to first technical regulation, the principle of service logic rule detection, carries out described quality testing.
Preferably, described quality inspection unit, is further used for not being kept in data warehouse by the data of the described quality testing Backup Data as storehouse, described source.
Preferably, described quality inspection unit, is further used for, when the data not by described quality testing being detected, triggering corresponding Data Detection exception handling procedure.
Preferably, described quality inspection unit, is further used for, when needing to increase new quality rule, according to the principle keeping existing quality rule independence, increasing the scripted code of described new quality rule.
In sum, these are only preferred embodiment of the present invention, be not intended to limit protection scope of the present invention.Within the spirit and principles in the present invention all, any amendment done, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.
Claims (12)
1. a data quality checking method, is characterized in that, comprising:
A, from storehouse, source extracted data;
The quality rule that b, basis are preset, carries out quality testing to the data that described extraction obtains;
C, by the data by described quality testing, carry out integration process.
2. method according to claim 1, is characterized in that, adopts the mode extracted one to one to carry out described extraction in step a.
3. method according to claim 1, is characterized in that, described quality rule comprises technical regulation and service logic rule, and detects the principle of rear service logic rule detection according to first technical regulation, carries out described quality testing.
4. method according to claim 1, is characterized in that, described step b comprises further: will be kept in data warehouse by the data of the described quality testing Backup Data as storehouse, described source.
5. method according to claim 1, is characterized in that, described step b comprises further: when the data not by described quality testing being detected, triggers corresponding Data Detection exception handling procedure.
6. method according to claim 1, is characterized in that, described method comprises further: when needing to increase new quality rule, according to the principle keeping existing quality rule independence, increases the scripted code of described new quality rule.
7. a data quality checking device, is characterized in that, comprising:
First extracting unit, for extracted data from storehouse, source;
Quality inspection unit, for according to the quality rule preset, carries out quality testing to the data that described extraction obtains;
Second extracting unit, for by the data by described quality testing, carries out integration process.
8. device according to claim 7, is characterized in that, described first extracting unit adopts the mode extracted one to one to carry out described extraction.
9. device according to claim 7, is characterized in that, described quality rule comprises technical regulation and service logic rule, and after described quality inspection unit detects according to first technical regulation, the principle of service logic rule detection, carries out described quality testing.
10. device according to claim 7, is characterized in that, described quality inspection unit, is further used for not being kept in data warehouse by the data of the described quality testing Backup Data as storehouse, described source.
11. devices according to claim 7, is characterized in that, described quality inspection unit, are further used for, when the data not by described quality testing being detected, triggering corresponding Data Detection exception handling procedure.
12. devices according to claim 7, is characterized in that, described quality inspection unit, and being further used for, when needing to increase new quality rule, according to the principle keeping existing quality rule independence, increases the scripted code of described new quality rule.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510272664.8A CN104820720A (en) | 2015-05-26 | 2015-05-26 | Data quality detecting method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510272664.8A CN104820720A (en) | 2015-05-26 | 2015-05-26 | Data quality detecting method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN104820720A true CN104820720A (en) | 2015-08-05 |
Family
ID=53731015
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510272664.8A Pending CN104820720A (en) | 2015-05-26 | 2015-05-26 | Data quality detecting method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104820720A (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107333014A (en) * | 2017-06-29 | 2017-11-07 | 上海澄美信息服务有限公司 | A kind of intelligence recording quality inspection system |
CN107895003A (en) * | 2017-10-31 | 2018-04-10 | 山东浪潮云服务信息科技有限公司 | A kind of data quality checking method and apparatus |
CN108875056A (en) * | 2018-06-28 | 2018-11-23 | 中国建设银行股份有限公司 | Data validation method, apparatus, electronic equipment and readable storage medium storing program for executing |
CN109491990A (en) * | 2018-09-17 | 2019-03-19 | 武汉达梦数据库有限公司 | A kind of method of detection data quality and the device of detection data quality |
CN109656812A (en) * | 2018-11-19 | 2019-04-19 | 平安科技(深圳)有限公司 | Data quality checking method, apparatus and storage medium |
CN111241073A (en) * | 2018-11-29 | 2020-06-05 | 阿里巴巴集团控股有限公司 | Data quality inspection method and device |
CN112734281A (en) * | 2021-01-21 | 2021-04-30 | 山东健康医疗大数据有限公司 | Decoupling processing method for quality control and task scheduling in medical data processing |
CN113128943A (en) * | 2019-12-30 | 2021-07-16 | 北京懿医云科技有限公司 | Data quality monitoring method and device, electronic equipment and storage medium |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080195430A1 (en) * | 2007-02-12 | 2008-08-14 | Yahoo! Inc. | Data quality measurement for etl processes |
CN101515290A (en) * | 2009-03-25 | 2009-08-26 | 中国工商银行股份有限公司 | Metadata management system with bidirectional interactive characteristics and implementation method thereof |
CN101533407A (en) * | 2009-04-10 | 2009-09-16 | 中国科学院软件研究所 | Method for detecting exceptional data in ETL flow |
CN101576893A (en) * | 2008-05-09 | 2009-11-11 | 北京世纪拓远软件科技发展有限公司 | Method and system for analyzing data quality |
CN102117306A (en) * | 2010-01-04 | 2011-07-06 | 阿里巴巴集团控股有限公司 | Method and system for monitoring ETL (extract-transform-load) data processing process |
CN102609537A (en) * | 2012-02-17 | 2012-07-25 | 广东电网公司电力科学研究院 | Data quality audit method based on database schema |
US20120197887A1 (en) * | 2011-01-28 | 2012-08-02 | Ab Initio Technology Llc | Generating data pattern information |
-
2015
- 2015-05-26 CN CN201510272664.8A patent/CN104820720A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080195430A1 (en) * | 2007-02-12 | 2008-08-14 | Yahoo! Inc. | Data quality measurement for etl processes |
CN101576893A (en) * | 2008-05-09 | 2009-11-11 | 北京世纪拓远软件科技发展有限公司 | Method and system for analyzing data quality |
CN101515290A (en) * | 2009-03-25 | 2009-08-26 | 中国工商银行股份有限公司 | Metadata management system with bidirectional interactive characteristics and implementation method thereof |
CN101533407A (en) * | 2009-04-10 | 2009-09-16 | 中国科学院软件研究所 | Method for detecting exceptional data in ETL flow |
CN102117306A (en) * | 2010-01-04 | 2011-07-06 | 阿里巴巴集团控股有限公司 | Method and system for monitoring ETL (extract-transform-load) data processing process |
US20120197887A1 (en) * | 2011-01-28 | 2012-08-02 | Ab Initio Technology Llc | Generating data pattern information |
CN102609537A (en) * | 2012-02-17 | 2012-07-25 | 广东电网公司电力科学研究院 | Data quality audit method based on database schema |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107333014A (en) * | 2017-06-29 | 2017-11-07 | 上海澄美信息服务有限公司 | A kind of intelligence recording quality inspection system |
CN107895003A (en) * | 2017-10-31 | 2018-04-10 | 山东浪潮云服务信息科技有限公司 | A kind of data quality checking method and apparatus |
CN108875056A (en) * | 2018-06-28 | 2018-11-23 | 中国建设银行股份有限公司 | Data validation method, apparatus, electronic equipment and readable storage medium storing program for executing |
CN108875056B (en) * | 2018-06-28 | 2021-08-13 | 中国建设银行股份有限公司 | Data checking method and device, electronic equipment and readable storage medium |
CN109491990A (en) * | 2018-09-17 | 2019-03-19 | 武汉达梦数据库有限公司 | A kind of method of detection data quality and the device of detection data quality |
CN109656812A (en) * | 2018-11-19 | 2019-04-19 | 平安科技(深圳)有限公司 | Data quality checking method, apparatus and storage medium |
CN111241073A (en) * | 2018-11-29 | 2020-06-05 | 阿里巴巴集团控股有限公司 | Data quality inspection method and device |
CN111241073B (en) * | 2018-11-29 | 2023-06-20 | 阿里巴巴集团控股有限公司 | Data quality inspection method and device |
CN113128943A (en) * | 2019-12-30 | 2021-07-16 | 北京懿医云科技有限公司 | Data quality monitoring method and device, electronic equipment and storage medium |
CN113128943B (en) * | 2019-12-30 | 2023-12-05 | 北京懿医云科技有限公司 | Data quality monitoring method, device, electronic equipment and storage medium |
CN112734281A (en) * | 2021-01-21 | 2021-04-30 | 山东健康医疗大数据有限公司 | Decoupling processing method for quality control and task scheduling in medical data processing |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104820720A (en) | Data quality detecting method and device | |
CN104115154B (en) | Secure data is maintained to be isolated with dangerous access when switching between domain | |
CN101601013B (en) | Controlling instruction execution in a processing environment | |
US20170147469A1 (en) | Correlation of source code with system dump information | |
CN103782573A (en) | Masking server outages from clients and applications | |
US20200304366A1 (en) | Routing configuration method of view files, storage medium, terminal device and apparatus | |
US10536380B2 (en) | Method and system for intelligent link load balancing | |
US9396060B2 (en) | Information processing method, information processing device and recording medium | |
KR20150077474A (en) | Rule distribution server, as well as event processing system, method, and program | |
CN107463492A (en) | Application failure localization method and device | |
US8290916B2 (en) | Rule-based record profiles to automate record declaration of electronic documents | |
CN105426128A (en) | Index maintenance method and device | |
CN104915593A (en) | Binding removing processing method and system for software | |
CN107741891B (en) | Object reconstruction method, medium, device and computing equipment | |
US11556497B2 (en) | Real-time archiving method and system based on hybrid cloud | |
CN103218298B (en) | Test case screening, correlation strategy method of testing and the device of search engine | |
CN110362416A (en) | Page assembly loading method and device, electronic equipment, storage medium | |
CN112579330B (en) | Processing method, device and equipment for abnormal data of operating system | |
US20170286440A1 (en) | Method, business processing server and data processing server for storing and searching transaction history data | |
CN103167545B (en) | Be correlated with the method for IP cutover and device in a kind of base station | |
CN107705089B (en) | Service processing method, device and equipment | |
CN112306371A (en) | Method, apparatus and computer program product for storage management | |
CN109933459A (en) | A kind of execution method and apparatus of multitask | |
CN107992992A (en) | Unionpay's IC card transaction data analysis system and method | |
CN103684859B (en) | Method and system for upgrading network cell equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
EXSB | Decision made by sipo to initiate substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20150805 |