CN104820720A - Data quality detecting method and device - Google Patents

Data quality detecting method and device Download PDF

Info

Publication number
CN104820720A
CN104820720A CN201510272664.8A CN201510272664A CN104820720A CN 104820720 A CN104820720 A CN 104820720A CN 201510272664 A CN201510272664 A CN 201510272664A CN 104820720 A CN104820720 A CN 104820720A
Authority
CN
China
Prior art keywords
data
quality
rule
quality testing
testing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510272664.8A
Other languages
Chinese (zh)
Inventor
白贤锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Jingdong Shangke Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN201510272664.8A priority Critical patent/CN104820720A/en
Publication of CN104820720A publication Critical patent/CN104820720A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Testing And Monitoring For Control Systems (AREA)

Abstract

The invention discloses a data quality detecting method and device. The method includes first extracting data from a source base; then conducting quality detection on the extracted data according to a preset quality rule; finally integrating the data passing the quality detection. By means of the method and device, data extracting efficiency and data mass are improved, and extension of the quality detection rule is facilitated.

Description

A kind of data quality checking method and apparatus
Technical field
The present invention relates to data warehouse technology, particularly relate to a kind of data quality checking method and apparatus.
Background technology
Data warehouse (Data Warehouse, DW or DWH) is the strategy set of all types data provided support for the decision-making process of all ranks of enterprise.It is that individual data stores, and creates for the object of analytical presentation and decision support.Business intelligence is needed to come service guidance flow scheme improvements and Looking Out Time, cost, quality and control for enterprise provides.
Extraction-conversion-(Extract-Transform-Load, the ETL) process that loads is that data are passed through extraction (extract), conversion (transform) from source terminal, loaded (load) data handling procedure to destination by one.ETL is comparatively commonly used in data warehouse technology.Wherein, extraction can be understood as is in data warehouse the data pick-up of source data; Conversion refers to and the data that developer will extract is converted to target data structure according to service needed, and realizes gathering; Loading is in target data warehouse through conversion and the data that gather.
Along with the widespread use of large data, data become organizes one of most valuable assets.Also exist between the quality of data of enterprise and Professional performance and contact directly, high-quality data can make company keep competitive power and establish oneself in an unassailable position period in economic turmoil.Had the pervasive quality of data, enterprise at any time can trust all data meeting all demands.
The detection of current Data Warehouse quality is with the extraction link close association in data warehouse ETL process, namely while data pick-up, quality of data inspection is carried out, that is, together with data quality checking rubs up with the scripted code of data pick-up.Therefore, existing quality testing scheme is too high with the extraction link degree of coupling, so, will cause there is following problems.
1, the success or failure of quality of data verification can affect the speed of data pick-up largely.If the quality of data verifies unsuccessfully, then need to re-start quality inspection, and quality indicator and extraction bundle, like this, needs also re-execute by the extraction step of data, and therefore, the failure of quality of data verification can cause the entirety of data extraction speed low.
2, the quality of data is low.Because quality indicator and extraction bundle, in order to improve the efficiency of data pick-up, the field considered in corresponding SQL statement will be less, like this, constraint condition in SQL statement will be very loose, thus the data be drawn into based on this can be caused may not to meet strict technology and service logic rule, and then reduce the quality of institute's extracted data.
3, the extendability of data quality checking rule is poor, because quality indicator and both extractions bundle, when data quality checking rule needs to upgrade, also need to change the associated script code extracting link simultaneously, like this, during data quality checking Policy Updates, the revision amount of corresponding scripted code can be comparatively large, thus be unfavorable for that data quality checking rule safeguards expansion.
As can be seen here, existing data quality checking method existence affects data pick-up efficiency, reduces the quality of data and is difficult to the problems such as service data quality testing rule.
Summary of the invention
In view of this, fundamental purpose of the present invention is to provide a kind of data quality checking method and apparatus, can improve data pick-up efficiency and the quality of data, and be easy to the expansion of data quality checking rule.
In order to achieve the above object, the technical scheme that the present invention proposes is:
A kind of data quality checking method, comprising:
A, from storehouse, source extracted data;
The quality rule that b, basis are preset, carries out quality testing to the data that described extraction obtains;
C, by the data by described quality testing, carry out integration process.
A kind of data quality checking device, comprising:
First extracting unit, for extracted data from storehouse, source;
Quality inspection unit, for according to the quality rule preset, carries out quality testing to the data that described extraction obtains;
Second extracting unit, for by the data by described quality testing, carries out integration process.
In sum, the data quality checking method and apparatus that the present invention proposes, first carry out separately data pick-up and again quality inspection is carried out to extraction result, so, scripted code corresponding to quality testing will independent of the scripted code of described extraction, thus the problems existing for existing quality testing scheme can be avoided, data pick-up efficiency and the quality of data can be improved, and be easy to the expansion of data quality checking rule.
Accompanying drawing explanation
Fig. 1 is the method flow schematic diagram of the embodiment of the present invention;
Fig. 2 is the apparatus structure schematic diagram of the embodiment of the present invention.
Embodiment
For making the object, technical solutions and advantages of the present invention clearly, the present invention is described in further detail below in conjunction with the accompanying drawings and the specific embodiments.
Core concept of the present invention is: the data pick-up in extraction process and Data Integration are separated, and first carries out data pick-up separately, before Data Integration, carries out quality inspection to data pick-up result.Like this, the scripted code extracted can independent of the scripted code of quality testing, the problems existed when both bind together would not be there are, namely the success or failure of quality testing can not affect the overall progress of data pick-up again, can also improve the accuracy of quality testing simultaneously, guarantee the quality of data, the expansion of data quality checking rule is also easy to realize.
Fig. 1 is the data quality checking method flow schematic diagram of the embodiment of the present invention, and as shown in Figure 1, this embodiment mainly comprises:
Step 101, from storehouse, source extracted data.
This step is used for separately extracted data from storehouse, source, and like this, the scripted code of data pick-up can independent of the scripted code of quality testing, thus the problems produced when both can be avoided to bundle.
In actual applications, existing abstracting method extracted data from storehouse, source can be adopted.Preferably, in order to ensure that the data that extract and source database data are consistent, the mode extracted one to one can be adopted to carry out described extraction, and concrete abstracting method is one to one grasped by those skilled in the art, does not repeat them here.
The quality rule that step 102, basis are preset, carries out quality testing to the data that described extraction obtains.
This step, for carrying out quality testing to the data after extraction.Here, because the scripted code of quality testing is independent of the scripted code of data pick-up in step 101, therefore, the needs considering to improve data pick-up efficiency will do not needed during quality testing, like this, constraint condition in corresponding quality inspection SQL statement can comprise more multi-field, thus can match with the quality rule of reality, and then can guarantee the quality of data.
In actual applications, described quality rule can be needed to arrange according to practical application by those skilled in the art, specifically can comprise technical regulation and service logic rule.In the case, in order to make the scripted code of quality testing, there is stronger maintainability, namely quality rule is made to be easy to safeguard expansion, preferably, can make there is certain logicality between rule, particularly, after can detecting according to first technical regulation, the principle of service logic rule detection, carries out described quality testing.
In addition, preferably, the maintenance for ease of quality rule is expanded, and when needing to increase new quality rule, according to the principle keeping existing quality rule independence, can increase the scripted code of described new quality rule.
Further, in order to improve the integrality that data store further, for detecting the data do not passed through in this step, the Backup Data in storehouse, source can be it can be used as to be kept in data warehouse.
Preferably, detected quality of data abnormal problem can be known in time to make data warehouse maintainer, and carry out corresponding abnormality processing, in this step when the data not by quality testing being detected, trigger corresponding Data Detection exception handling procedure.Concrete abnormality eliminating method is grasped by those skilled in the art, does not repeat them here.
Step 103, by the data by described quality testing, carry out integration process.
This step is used for integrating the data that quality inspection is passed through, and like this, can guarantee to carry out based on this accuracy changed in ETL process.Concrete integration method is grasped by those skilled in the art, does not repeat them here.
Below in conjunction with concrete SQL scripted code, specific implementation of the present invention is further elaborated:
Below in example, will be B for Table A, data warehouse, field will be all a, b, c and is described.
Step x1, execution inert into B as select*from A
This step is used for extracted data (namely extracting source database data) from Table A, is stored in data warehouse B table
Step x2, Select count (1) from B where a>9999
This step, for carrying out technical logic verification (supposing maximal value 9999), if result is more than or equal to 1 explanation a and exceedes maximal value verification, then illustrate there is technical logic abnormal data, corresponding exception handling procedure (if the then notification source storehouse responsible official adjustment of source database data mistake, if coupling extracts mistake then again mate extraction) will be triggered, if result is less than 1 explanation, there is no technical logic abnormal data, enter next step.
Step x3, Select count (1) from B where a>b
This step, for carrying out service logic verification (a field is greater than b field), if result is more than or equal to 1 explanation and meets service logic and verification succeeds, then can carry out next step Data Integration, otherwise, illustrate that service logic verifies unsuccessfully, follow-up Data Integration process cannot be carried out.
Step x4, Select d=a+b from B
This step, for carrying out Data Integration process.
In addition, if think that increasing verification business a field is greater than c field simultaneously, then need in rule script code, increase corresponding SQL to go here and there: Select count (1) from B where a>c, quality testing can be verified (rule base has order priority field control execution to detect logic) by order executing rule, namely after the 3rd step, increase Selectcount (1) from B where a>c, and do not need the corresponding SQL of original 2,3 step of amendment.
Pass through technique scheme, can find out that the present invention passes through the scripted code making the scripted code of extraction link independent of quality testing, the problems produced when both can be avoided to bundle, thus data pick-up efficiency and the quality of data can be improved, and be easy to the expansion of data quality checking rule.
Fig. 2 is the data quality checking apparatus structure schematic diagram corresponding with said method, and as shown in the figure, this device comprises:
First extracting unit, for extracted data from storehouse, source;
Quality inspection unit, for according to the quality rule preset, carries out quality testing to the data that described extraction obtains;
Second extracting unit, for by the data by described quality testing, carries out integration process.
Preferably, described first extracting unit adopts the mode extracted one to one to carry out described extraction.
Preferably, described quality rule comprises technical regulation and service logic rule, and after described quality inspection unit detects according to first technical regulation, the principle of service logic rule detection, carries out described quality testing.
Preferably, described quality inspection unit, is further used for not being kept in data warehouse by the data of the described quality testing Backup Data as storehouse, described source.
Preferably, described quality inspection unit, is further used for, when the data not by described quality testing being detected, triggering corresponding Data Detection exception handling procedure.
Preferably, described quality inspection unit, is further used for, when needing to increase new quality rule, according to the principle keeping existing quality rule independence, increasing the scripted code of described new quality rule.
In sum, these are only preferred embodiment of the present invention, be not intended to limit protection scope of the present invention.Within the spirit and principles in the present invention all, any amendment done, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.

Claims (12)

1. a data quality checking method, is characterized in that, comprising:
A, from storehouse, source extracted data;
The quality rule that b, basis are preset, carries out quality testing to the data that described extraction obtains;
C, by the data by described quality testing, carry out integration process.
2. method according to claim 1, is characterized in that, adopts the mode extracted one to one to carry out described extraction in step a.
3. method according to claim 1, is characterized in that, described quality rule comprises technical regulation and service logic rule, and detects the principle of rear service logic rule detection according to first technical regulation, carries out described quality testing.
4. method according to claim 1, is characterized in that, described step b comprises further: will be kept in data warehouse by the data of the described quality testing Backup Data as storehouse, described source.
5. method according to claim 1, is characterized in that, described step b comprises further: when the data not by described quality testing being detected, triggers corresponding Data Detection exception handling procedure.
6. method according to claim 1, is characterized in that, described method comprises further: when needing to increase new quality rule, according to the principle keeping existing quality rule independence, increases the scripted code of described new quality rule.
7. a data quality checking device, is characterized in that, comprising:
First extracting unit, for extracted data from storehouse, source;
Quality inspection unit, for according to the quality rule preset, carries out quality testing to the data that described extraction obtains;
Second extracting unit, for by the data by described quality testing, carries out integration process.
8. device according to claim 7, is characterized in that, described first extracting unit adopts the mode extracted one to one to carry out described extraction.
9. device according to claim 7, is characterized in that, described quality rule comprises technical regulation and service logic rule, and after described quality inspection unit detects according to first technical regulation, the principle of service logic rule detection, carries out described quality testing.
10. device according to claim 7, is characterized in that, described quality inspection unit, is further used for not being kept in data warehouse by the data of the described quality testing Backup Data as storehouse, described source.
11. devices according to claim 7, is characterized in that, described quality inspection unit, are further used for, when the data not by described quality testing being detected, triggering corresponding Data Detection exception handling procedure.
12. devices according to claim 7, is characterized in that, described quality inspection unit, and being further used for, when needing to increase new quality rule, according to the principle keeping existing quality rule independence, increases the scripted code of described new quality rule.
CN201510272664.8A 2015-05-26 2015-05-26 Data quality detecting method and device Pending CN104820720A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510272664.8A CN104820720A (en) 2015-05-26 2015-05-26 Data quality detecting method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510272664.8A CN104820720A (en) 2015-05-26 2015-05-26 Data quality detecting method and device

Publications (1)

Publication Number Publication Date
CN104820720A true CN104820720A (en) 2015-08-05

Family

ID=53731015

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510272664.8A Pending CN104820720A (en) 2015-05-26 2015-05-26 Data quality detecting method and device

Country Status (1)

Country Link
CN (1) CN104820720A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107333014A (en) * 2017-06-29 2017-11-07 上海澄美信息服务有限公司 A kind of intelligence recording quality inspection system
CN107895003A (en) * 2017-10-31 2018-04-10 山东浪潮云服务信息科技有限公司 A kind of data quality checking method and apparatus
CN108875056A (en) * 2018-06-28 2018-11-23 中国建设银行股份有限公司 Data validation method, apparatus, electronic equipment and readable storage medium storing program for executing
CN109491990A (en) * 2018-09-17 2019-03-19 武汉达梦数据库有限公司 A kind of method of detection data quality and the device of detection data quality
CN109656812A (en) * 2018-11-19 2019-04-19 平安科技(深圳)有限公司 Data quality checking method, apparatus and storage medium
CN111241073A (en) * 2018-11-29 2020-06-05 阿里巴巴集团控股有限公司 Data quality inspection method and device
CN112734281A (en) * 2021-01-21 2021-04-30 山东健康医疗大数据有限公司 Decoupling processing method for quality control and task scheduling in medical data processing
CN113128943A (en) * 2019-12-30 2021-07-16 北京懿医云科技有限公司 Data quality monitoring method and device, electronic equipment and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080195430A1 (en) * 2007-02-12 2008-08-14 Yahoo! Inc. Data quality measurement for etl processes
CN101515290A (en) * 2009-03-25 2009-08-26 中国工商银行股份有限公司 Metadata management system with bidirectional interactive characteristics and implementation method thereof
CN101533407A (en) * 2009-04-10 2009-09-16 中国科学院软件研究所 Method for detecting exceptional data in ETL flow
CN101576893A (en) * 2008-05-09 2009-11-11 北京世纪拓远软件科技发展有限公司 Method and system for analyzing data quality
CN102117306A (en) * 2010-01-04 2011-07-06 阿里巴巴集团控股有限公司 Method and system for monitoring ETL (extract-transform-load) data processing process
CN102609537A (en) * 2012-02-17 2012-07-25 广东电网公司电力科学研究院 Data quality audit method based on database schema
US20120197887A1 (en) * 2011-01-28 2012-08-02 Ab Initio Technology Llc Generating data pattern information

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080195430A1 (en) * 2007-02-12 2008-08-14 Yahoo! Inc. Data quality measurement for etl processes
CN101576893A (en) * 2008-05-09 2009-11-11 北京世纪拓远软件科技发展有限公司 Method and system for analyzing data quality
CN101515290A (en) * 2009-03-25 2009-08-26 中国工商银行股份有限公司 Metadata management system with bidirectional interactive characteristics and implementation method thereof
CN101533407A (en) * 2009-04-10 2009-09-16 中国科学院软件研究所 Method for detecting exceptional data in ETL flow
CN102117306A (en) * 2010-01-04 2011-07-06 阿里巴巴集团控股有限公司 Method and system for monitoring ETL (extract-transform-load) data processing process
US20120197887A1 (en) * 2011-01-28 2012-08-02 Ab Initio Technology Llc Generating data pattern information
CN102609537A (en) * 2012-02-17 2012-07-25 广东电网公司电力科学研究院 Data quality audit method based on database schema

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107333014A (en) * 2017-06-29 2017-11-07 上海澄美信息服务有限公司 A kind of intelligence recording quality inspection system
CN107895003A (en) * 2017-10-31 2018-04-10 山东浪潮云服务信息科技有限公司 A kind of data quality checking method and apparatus
CN108875056A (en) * 2018-06-28 2018-11-23 中国建设银行股份有限公司 Data validation method, apparatus, electronic equipment and readable storage medium storing program for executing
CN108875056B (en) * 2018-06-28 2021-08-13 中国建设银行股份有限公司 Data checking method and device, electronic equipment and readable storage medium
CN109491990A (en) * 2018-09-17 2019-03-19 武汉达梦数据库有限公司 A kind of method of detection data quality and the device of detection data quality
CN109656812A (en) * 2018-11-19 2019-04-19 平安科技(深圳)有限公司 Data quality checking method, apparatus and storage medium
CN111241073A (en) * 2018-11-29 2020-06-05 阿里巴巴集团控股有限公司 Data quality inspection method and device
CN111241073B (en) * 2018-11-29 2023-06-20 阿里巴巴集团控股有限公司 Data quality inspection method and device
CN113128943A (en) * 2019-12-30 2021-07-16 北京懿医云科技有限公司 Data quality monitoring method and device, electronic equipment and storage medium
CN113128943B (en) * 2019-12-30 2023-12-05 北京懿医云科技有限公司 Data quality monitoring method, device, electronic equipment and storage medium
CN112734281A (en) * 2021-01-21 2021-04-30 山东健康医疗大数据有限公司 Decoupling processing method for quality control and task scheduling in medical data processing

Similar Documents

Publication Publication Date Title
CN104820720A (en) Data quality detecting method and device
CN104115154B (en) Secure data is maintained to be isolated with dangerous access when switching between domain
CN101601013B (en) Controlling instruction execution in a processing environment
US20170147469A1 (en) Correlation of source code with system dump information
CN103782573A (en) Masking server outages from clients and applications
US20200304366A1 (en) Routing configuration method of view files, storage medium, terminal device and apparatus
US10536380B2 (en) Method and system for intelligent link load balancing
US9396060B2 (en) Information processing method, information processing device and recording medium
KR20150077474A (en) Rule distribution server, as well as event processing system, method, and program
CN107463492A (en) Application failure localization method and device
US8290916B2 (en) Rule-based record profiles to automate record declaration of electronic documents
CN105426128A (en) Index maintenance method and device
CN104915593A (en) Binding removing processing method and system for software
CN107741891B (en) Object reconstruction method, medium, device and computing equipment
US11556497B2 (en) Real-time archiving method and system based on hybrid cloud
CN103218298B (en) Test case screening, correlation strategy method of testing and the device of search engine
CN110362416A (en) Page assembly loading method and device, electronic equipment, storage medium
CN112579330B (en) Processing method, device and equipment for abnormal data of operating system
US20170286440A1 (en) Method, business processing server and data processing server for storing and searching transaction history data
CN103167545B (en) Be correlated with the method for IP cutover and device in a kind of base station
CN107705089B (en) Service processing method, device and equipment
CN112306371A (en) Method, apparatus and computer program product for storage management
CN109933459A (en) A kind of execution method and apparatus of multitask
CN107992992A (en) Unionpay's IC card transaction data analysis system and method
CN103684859B (en) Method and system for upgrading network cell equipment

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
EXSB Decision made by sipo to initiate substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20150805