CN101576893A - Method and system for analyzing data quality - Google Patents

Method and system for analyzing data quality Download PDF

Info

Publication number
CN101576893A
CN101576893A CNA2008101061878A CN200810106187A CN101576893A CN 101576893 A CN101576893 A CN 101576893A CN A2008101061878 A CNA2008101061878 A CN A2008101061878A CN 200810106187 A CN200810106187 A CN 200810106187A CN 101576893 A CN101576893 A CN 101576893A
Authority
CN
China
Prior art keywords
data
quality
loading
interface
situation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CNA2008101061878A
Other languages
Chinese (zh)
Inventor
廖子懿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Shiji Tuoyuan Software Technology Development Co Ltd
Original Assignee
Beijing Shiji Tuoyuan Software Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Shiji Tuoyuan Software Technology Development Co Ltd filed Critical Beijing Shiji Tuoyuan Software Technology Development Co Ltd
Priority to CNA2008101061878A priority Critical patent/CN101576893A/en
Priority to SG200804013-1A priority patent/SG157252A1/en
Publication of CN101576893A publication Critical patent/CN101576893A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention puts forward a method and a system for analyzing data quality of data warehouse systems, wherein the method for analyzing data quality of data warehouse systems according to one aspect of the invention comprises the following: step 1, analyzing the process of loading data in interfaces of the data warehouse systems to obtain interface-quality analysis indexes; step 2, analyzing the process of extracting, transforming and loading data sets corresponding to data in the data warehouse systems to obtain data-quality analysis indexes; and step 3, grading data quality according to the interface-quality analysis indexes and the data-quality analysis indexes so as to obtain grading results. The invention aims to effectively guarantee the data quality basically without affecting the normal operation of production systems and the data warehouse systems under the circumstances of not only ensuring measure effectiveness but also affecting the execution efficiency of the systems as little as possible.

Description

Quality of data analytical approach and system
Technical field
The present invention relates to quality of data analysis, relate in particular to a kind of quality of data analytical approach and quality of data analytic system about data warehouse.
Background technology
Data warehouse is the important tool that assisted user carries out market analysis and decision-making, and can the quality of data have determined data warehouse to be admitted by the user and accept to a great extent, and then directly influence the promotion and application of data warehouse.
The reason that data quality problem occurs is many-sided, guarantee the quality of data of data warehouse, not only needs the improvement of data warehouse, also needs the perfect of production system.Herein will be from technical standpoint summary data warehouse system the appearance reason of data quality problem and requirement and the measure that guarantees the quality of data.
When choosing concrete quality assurance measure, take into account the validity of consideration measure and system is carried out the influence of efficient, strive for effectively guaranteeing the quality of data, do not influence the normal operation of production system and data warehouse again substantially.
The reason of data quality problem can reduce following three aspects: external cause, inner reason and understand difference.
External cause
Mainly comprise two aspects (any two aspects): data warehouse with peripheral system mutual in, technology realize and management support process in some data quality problems of existing, comprising:
Interface does not provide by the cycle data of agreement, passes late or leak to pass;
Data source systems list structure or interface specification change and not synchronous, and interface can't be visited.
The interface data content is lack of standardization, causes transcription error, interrupts ETL (that is, extracting conversion loads) or refusal data; Because data source lacks consistency checks such as null value inspection, foreign key constraint, or reason such as manual data, cause much failures in the data conversion process:
Null value appears in the non-NULL field;
External key itself does not have corresponding value with reference to disappearance in the respective code table, perhaps because of nonstandard data value, need through the reference of conversion (for example trim handles) ability;
Major key repeats, because origin system does not define the physics major key, lacks the inspection of major key uniqueness, but defined major key in goal systems, causes inserting failure;
The data type mistake, for example the numeric type data comprise character, date type data are not the date;
Data layout and service logic are not inconsistent, and for example passport NO. is illegal, and phone number is illegal;
The data logic is illegal, and for example certain two field value addition equals the another one field value, but real data and unequal.
Internal problem
Mainly being the problem of this quality of data of occurring in the process of ETL processings, business logic processing etc. of data warehouse, is to understand deviation and technology realization deviation causes by the business that inner each node data of data warehouse is handled, and specifically comprises:
ETL rule mistake.Since in the original date source to the mapping relations of target data understand, the statement mistake, the data correctness problem after causing data to be packed into.
ETL realizes mistake.Under set ETL rule, specific implementation is not slipped according to Rule Design or details, and there is correctness problem in the data that cause finally packing into.
ETL abandons data.Owing to the data source data is considered insufficient, perhaps simplify processing links for the consideration of performance, cause that data are uncontrolled to be abandoned.
ETL refuses data.For the invalid data of data source in range of control, refuse illegal record.
Index is understood problem
Because the imprecision of data warehouse index definition, cause the difference of understanding, different with the production system statistical standard, cause can not setting up between data warehouse and each production system rationally, active data checks way and means.
According to modern data quality assurance idea, the quality of data is a process rather than result.The Data Warehouse quality should embody in whole data warehouse planning, design, construction, maintenance and realize.In view of the present construction situation of data warehouse, emphasis guarantees and three aspects considerations of quality of data authentication mechanism from quality of data organizational structure, the data quality of data, corresponding method support is provided.
The meaning that the data quality of data guarantees is:
Define responsibility; Use the technological means of data analysis, for the quality of data of data-interface provider and data warehouse is distinguished responsibility.
Process improvement; By means such as data auditing, data stagings, for data warehouse ETL process improvement provides foundation.
Quality examination; The service property (quality) assessment report is examined the data warehouse system.
Be the lasting improvement of the process of ensuring the quality of products, guarantee that all known mistakes do not repeat to take place in system, set up perfect quality of data document system, the quality of data activity in the whole data warehouse all requires perfect record, finally sets up or improve the quality examination system successively.Comprise following quality assurance emphasis: integrality comprises: under the situation that data source can be supported, available data should be able to satisfy required business diagnosis requirement; Dealing carefully with of the complete and dirty data of metadata; Consistance comprises: available data should have consistent value when describing the same attribute of same things; The inherent consistance of inter-related data should obtain guaranteeing; The consistance of data layout etc.; Accuracy comprises form, subject analysis and data mining three classes, and wherein the accuracy requirement with form is the highest, and subject analysis is taken second place, and data mining is the most weak; Ageing, the time that the affirmation data produce meets the service logic of relevant field, and for example new transaction record can not mate the account who has closed to; And promptness, when guaranteeing integrality and accuracy, require the timely of data processing.
Summary of the invention
One or more problems in view of the above the present invention proposes a kind of quality of data analytical approach and system.Be intended to the validity of the measure that takes into account and influence less under the situation of system execution efficient as far as possible, can effectively guarantee the quality of data, do not influence the normal operation of production system and data warehouse again substantially.
The quality of data analytical approach that is used for data warehouse according to an aspect of the present invention may further comprise the steps: step 1, and analyze interface at data warehouse and data are carried out process of loading obtain interface-quality analysis indexes; Step 2, analysis extract the conversion process of loading to the data set corresponding to data and obtain quality of data analysis indexes in data warehouse; And step 3, according to interface-quality analysis indexes and the quality of data analysis indexes quality of data is carried out classification and obtain classification results.
Wherein, interface-quality analysis indexes comprises one or more in the following index: be illustrated in the interface promptness rate that data in the process of loading data arrive the situation of interface on time; Be illustrated in the interface percentage of head rice of the situation of data fit predetermined interface specification in the process of loading data; And the legal rate of interface that is illustrated in the situation of data fit service logic in the process of loading data.
Wherein, quality of data analysis indexes comprises: be illustrated in extract phase in the conversion process of loading in the same time respectively corresponding to the contemporaneous errors rate of the error between a plurality of metrics of a plurality of data sets; Be illustrated in extract phase in the conversion process of loading in the same time respectively corresponding to the stability bandwidth same period of the ratio between a plurality of metrics of a plurality of data sets; Be illustrated in the Historic Volatility that extracts in the conversion process of loading respectively corresponding to the situation of change of a plurality of metrics in different times of a plurality of data sets; Be illustrated in and extract in the conversion process of loading a plurality of metrics respectively and the standard error rate of the error between the corresponding separately predetermined value; Be illustrated in the logical error rate that a plurality of metrics in the conversion process of loading meet the situation of logic business relation that extracts; Be illustrated in the register reject rate that data set in the conversion process of loading is rejected the situation of loading that extracts; And the record Loss Rate that is illustrated in the situation that data set is dropped in the extraction conversion process of loading.
Wherein, classification results is that data are accurate, data are suspicious or error in data.
In step 3, whether meet predetermined value according to described quality of data analysis indexes and determine that classification results is that data are accurate, data are suspicious or error in data.
The quality of data analytic system that is used for data warehouse according to another aspect of the present invention comprises: the interface-quality analysis module is used to analyze interface at data warehouse and data are carried out process of loading obtains interface-quality analysis indexes; Quality of data analysis module, be used for analyzing data warehouse to the data set corresponding to data extract the conversion process of loading obtain quality of data analysis indexes; And diversity module, according to interface-quality analysis indexes and the quality of data analysis indexes quality of data is carried out classification and obtain classification results.
Wherein, interface-quality analysis indexes comprises one or more in the following index: be illustrated in the interface promptness rate that data in the process of loading data arrive the situation of interface on time; Be illustrated in the interface percentage of head rice of the situation of data fit predetermined interface specification in the process of loading data; And the legal rate of interface that is illustrated in the situation of data fit service logic in the process of loading data.
Wherein, quality of data analysis indexes comprises: be illustrated in extract phase in the conversion process of loading in the same time respectively corresponding to the contemporaneous errors rate of the error between a plurality of metrics of a plurality of data sets; Be illustrated in extract phase in the conversion process of loading in the same time respectively corresponding to the stability bandwidth same period of the ratio between a plurality of metrics of a plurality of data sets; Be illustrated in the Historic Volatility that extracts in the conversion process of loading respectively corresponding to the situation of change of a plurality of metrics in different times of a plurality of data sets; Be illustrated in and extract in the conversion process of loading a plurality of metrics respectively and the standard error rate of the error between the corresponding separately predetermined value; Be illustrated in the logical error rate that a plurality of metrics in the conversion process of loading meet the situation of logic business relation that extracts; Be illustrated in the register reject rate that data set in the conversion process of loading is rejected the situation of loading that extracts; And the record Loss Rate that is illustrated in the situation that data set is dropped in the extraction conversion process of loading.
Wherein, classification results is that data are accurate, data are suspicious or error in data.
Whether diversity module meets predetermined value according to described quality of data analysis indexes is determined that classification results is that data are accurate, data are suspicious or error in data.
By the present invention, can be in the validity of taking into account measure and influence less under the situation of system execution efficient as far as possible, can effectively guarantee the quality of data, do not influence the normal operation of production system and data warehouse again substantially.
Description of drawings
Accompanying drawing described herein is used to provide further understanding of the present invention, constitutes the application's a part, and illustrative examples of the present invention and explanation thereof are used to explain the present invention, do not constitute improper qualification of the present invention.In the accompanying drawings:
Fig. 1 is the process flow diagram according to the quality of data analytical approach of the embodiment of the invention; And
Fig. 2 is the block diagram according to the quality of data analytic system of the embodiment of the invention.
Embodiment
Below with reference to accompanying drawing, describe the specific embodiment of the present invention in detail.
Fig. 1 is the process flow diagram according to the quality of data analytical approach of the embodiment of the invention.As shown in Figure 1, the quality of data analytical approach that is used for data warehouse according to the embodiment of the invention may further comprise the steps:
Step S102 analyzes interface at data warehouse and data are carried out process of loading obtains interface-quality analysis indexes;
Step S104, analysis is extracted the conversion process of loading to the data set corresponding to data and is obtained quality of data analysis indexes in data warehouse; And
Step S106 carries out classification according to interface-quality analysis indexes and quality of data analysis indexes to the quality of data and obtains classification results.
Wherein, interface-quality analysis indexes comprises: be illustrated in the interface promptness rate that data in the process of loading data arrive the situation of interface on time; Be illustrated in the interface percentage of head rice of the situation of data fit predetermined interface specification in the process of loading data; And the legal rate of interface that is illustrated in the situation of data fit service logic in the process of loading data.
Wherein, quality of data analysis indexes comprises one or more in the following index: be illustrated in extract phase in the conversion process of loading in the same time respectively corresponding to the contemporaneous errors rate of the error between a plurality of metrics of a plurality of data sets; Be illustrated in extract phase in the conversion process of loading in the same time respectively corresponding to the stability bandwidth same period of the ratio between a plurality of metrics of a plurality of data sets; Be illustrated in the Historic Volatility that extracts in the conversion process of loading respectively corresponding to the situation of change of a plurality of metrics in different times of a plurality of data sets; Be illustrated in and extract in the conversion process of loading a plurality of metrics respectively and the standard error rate of the error between the corresponding separately predetermined value; Be illustrated in the logical error rate that a plurality of metrics in the conversion process of loading meet the situation of logic business relation that extracts; Be illustrated in the register reject rate that data set in the conversion process of loading is rejected the situation of loading that extracts; And the record Loss Rate that is illustrated in the situation that data set is dropped in the extraction conversion process of loading.
Wherein, classification results is that data are accurate, data are suspicious or error in data.
In step S106, whether meet predetermined value according to described quality of data analysis indexes and determine that classification results is that data are accurate, data are suspicious or error in data.
Fig. 2 is the block diagram according to the quality of data analytic system of the embodiment of the invention.As shown in Figure 2, the quality of data analytic system that is used for data warehouse according to the embodiment of the invention comprises: interface-quality analysis module 202 is used to analyze interface at data warehouse and data are carried out process of loading obtains interface-quality analysis indexes; Quality of data analysis module 204, be used for analyzing data warehouse to the data set corresponding to data extract the conversion process of loading obtain quality of data analysis indexes; And diversity module 206, according to interface-quality analysis indexes and the quality of data analysis indexes quality of data is carried out classification and obtain classification results.
Wherein, interface-quality analysis indexes comprises one or more in the following index: be illustrated in the interface promptness rate that data in the process of loading data arrive the situation of interface on time; Be illustrated in the interface percentage of head rice of the situation of data fit predetermined interface specification in the process of loading data; And the legal rate of interface that is illustrated in the situation of data fit service logic in the process of loading data.
Wherein, quality of data analysis indexes comprises: be illustrated in extract phase in the conversion process of loading in the same time respectively corresponding to the contemporaneous errors rate of the error between a plurality of metrics of a plurality of data sets; Be illustrated in extract phase in the conversion process of loading in the same time respectively corresponding to the stability bandwidth same period of the ratio between a plurality of metrics of a plurality of data sets; Be illustrated in the Historic Volatility that extracts in the conversion process of loading respectively corresponding to the situation of change of a plurality of metrics in different times of a plurality of data sets; Be illustrated in and extract in the conversion process of loading a plurality of metrics respectively and the standard error rate of the error between the corresponding separately predetermined value; Be illustrated in the logical error rate that a plurality of metrics in the conversion process of loading meet the situation of logic business relation that extracts; Be illustrated in the register reject rate that data set in the conversion process of loading is rejected the situation of loading that extracts; And the record Loss Rate that is illustrated in the situation that data set is dropped in the extraction conversion process of loading.
Wherein, classification results is that data are accurate, data are suspicious or error in data.
Whether diversity module 206 meets predetermined value according to described quality of data analysis indexes is determined that classification results is that data are accurate, data are suspicious or error in data.
It is long-term a, job repeatedly that the quality of data guarantees, is mutually promoted by quality verification, mass modified, three links of quality monitoring.
For guaranteeing the normal operation of THE QUALITY GUARANTEE SYSTEM FOR DATA, quality of data analytic system is set up in suggestion in data warehouse project organization structure, this system mainly is made up of three kinds of modules: the quality of data guarantees module, and data warehouse system conceptual data quality is responsible for; The interface data quality of data guarantees module, and the complete, legal of docking port is responsible for; And ETL () quality of data assurance module, the data warehouse system Data Loading quality of data, procedure quality are responsible for.
Quality verification
Quality verification is to expose the problem that the quality of data exists in the data warehouse.Whole process is divided three steps---from the data analysis, to data auditing, to data staging.
The analysis of first step data generates interface and checks report, and reflection interface quality problems are submitted to the interface number formulary according to quality assurance person, change the mass modified flow process over to; Second step data is checked, generates the data auditing report, and the quality problems in the reflection data warehouse ETL processing procedure are submitted to data warehouse data quality of data assurance person and ETL quality of data assurance person, change the mass modified flow process over to; The 3rd step quality grading, the aggregation of data quality that reflection current period ETL disposes is submitted to data quality of data assurance person.
This process produces a series of quality reports, comprising:
1, " the interface quality is checked report "
After this report reflects that each ETL disposes, the quality problems that exist in the data source interface, by following index expression interface quality: the interface promptness rate, weigh interface document and whether arrive on time; The interface percentage of head rice is weighed interface document and whether is met the predefine interface specification fully; The legal rate of interface is weighed the interface data record and whether is satisfied service logic; And the register reject rate, weigh in each interface loading, under controlled situation, write down unaccepted situation.
2, " quality of data is checked report "
After this report reflects that each ETL disposes, the quality problems of data processing link in the data warehouse, by following index expression: the record loss ratio, weigh each ETL and load uncontrolled what of record of abandoning; The contemporaneous errors rate is weighed in whole ETL flow process the error condition that identical degree value (its value should definitely keep identical) is concentrated in different pieces of information; The same period, stability bandwidth was weighed in whole ETL flow process, the ratio fluctuation situation of identical degree value (its value should keep certain ratio between the data set) between the different pieces of information collection; Historic Volatility is weighed in historical ETL flow process, the identical degree value, and in homometric(al) not period, the fluctuation situation of its value; The standard error rate is weighed in whole ETL flow process the error condition of tolerance and respective standard amount; And the logical error rate, to weigh in whole ETL flow process, the logical operation relation between the tolerance satisfies situation.
3, " quality of data graded reporting "
After this report reflected that each ETL disposes, data were from interface, and to ODS, to DW, to finally representing or uploading, the data of this batch loading can received degree.Fall into three classes:
The first order: accurately, can pay;
The second level: suspicious, unknown origin; And
The third level: mistake, reload.
Mass modified
By the quality verification flow process, expose the quality of data that data warehouse exists, next these problems are made improvements by the mass modified flow process.According to outside, the inside of quality problems with understand reason of discrepancies, comprising:
Interface mass modified flow process
Untimely, imperfect or illegal situation appears when " the interface quality is checked report " reflects interface, interface data quality assurance person need coordinate the related data sources provider and investigate the cause, revise interface routine or interface data is provided again, and fill in " report of interface Quality Feedback " and give data warehouse data quality of data assurance person.
Internal soundness correction flow process
When reflecting, " quality of data is checked report " enter internal soundness correction flow process when existing contemporaneous errors rate, same period stability bandwidth, Historic Volatility and logical error rate to be higher than certain threshold values in the data warehouse ETL processing procedure.Data quality of data assurance person need coordinate correlation module and confirm reason, is that regular deviation, program realize deviation, refusal still occurs or abandons record.When regular deviation takes place, enter change and revise flow process; When program realizes deviation, coordinate ETL module revision program, carry out Data Loading again; When refusal occurring or abandoning record, enter misdata correction flow process.Finally finish " quality of data correction report " by data quality of data assurance person.
Misdata correction sub-process
When refusal occurring or abandoning record, enter this flow process.By the ETL module error reason that is in charge of the investigation, and fill in " misdata correction report ", consign to data quality of data assurance person.
Sub-process is revised in change
The reason that has 3 types causes entering this flow process:
1, the data warehouse internal data is unusual: if data quality problem takes place in the data warehouse, and can confirm that data quality problem is because due to the processing mistake of data warehouse inside.This flow process can both obtain correctly handling timely to guarantee any data exception incident in the whole flow process from the treatment scheme of management and two aspects of technology authority data anomalous event, and guarantees that similar quality problems can not repeat to take place.
2, in order to tackle the variation of market environment, data warehouse increases functional module, and edition upgrading etc. can cause data warehouse change treatment scheme.
3, operation system also is to be in the continuous development and change always, and the variation of operation system also may cause this flow process.The change of operation system will produce the operation system Notification of Changes, and requiring notice must be timely, and content of announcement must be enough.
Fill in " system variation report " by data quality of data assurance person, this report fundamental is drawn together change and is initiated operation system, the change time, the data structure of change reason, change influence, the data-interface file of change influence, (data field that increases newly increases service code newly to changed content, change service codes etc.), business change is described in detail.
Statistical Criteria difference control flow
Statistical Criteria has great importance for data warehouse, and the difference of the Statistical Criteria between the system can produce a very large impact the enforcement of data warehouse system.
The Statistical Criteria difference problem taking place enters this flow process, mainly comprises following several situation:
1, certain indicator-specific statistics result of data warehouse form there are differences with the corresponding indicator-specific statistics result of operation system form, data warehouse passes through autoscopy, think the calculating and the inerrancy of data warehouse, and by quality of data system coordination activity, determine that operation system is also no problem to the data-interface file that data warehouse provides, data warehouse suspects it is due to the Statistical Criteria difference, submits the report of Statistical Criteria difference problem to quality of data system.
2, between the user of data warehouse and the data warehouse construction or the unit of maintenance the understanding of Statistical Criteria is led to divergence, wherein either party can submit the report of Statistical Criteria difference problem to data quality of data assurance person.
3, data quality of data assurance person accepts Statistical Criteria difference problem report, the Statistical Criteria that judges whether to have determined (determined/have clear and definite document clearly to stipulate in the enterprise), if not, then need by quality of data system coordination.Comprised the representative of all departments in the quality of data system, data quality of data assurance person requires the system that is correlated with or user's representative that the definition of this department to this Statistical Criteria is provided, and understands calculating means, computing formula, and definite department that pays close attention to and use this Statistical Criteria.Statistical Criteria should be as the criterion with the definition of this Statistical Criteria and the explanation of the department of use.If have a plurality of departments to providing of same Statistical Criteria different definition, and used this Statistical Criteria, data quality of data assurance person should apply for higher level coordination.
Submit to " Statistical Criteria variance reports ", fundamental to comprise department, and the Statistical Criteria difference condition of report time, speaker, Statistical Criteria title, the Statistical Criteria computing formula that Statistical Criteria is described, the speaker understands, the department that pays close attention to Statistical Criteria, definition Statistical Criteria by data quality of data assurance person.
Quality monitoring
Quality monitoring is to the execution monitoring of above flow process and the assessment of oeverall quality, to reach the purpose of improving the quality of data and optimizing the ETL process.It is mainly reflected in some reports:
1, " workflow management report "
The implementation status of reflection mass modified flow process, after comprising number of times, the cycle of various flow performing and revising, the situation of change of quality index.
2, " month (season) degree interface quality report "
To the interface quality index in each ETL flow process monthly (season) degree gather, and embody these indexs and last situation of change.
3, " month (season) degrees of data quality report "
To the quality of data index in each ETL flow process monthly (season) degree gather, and embody these indexs and last situation of change.
4, " month (season) degree system variation report "
The system variation that a variety of causes is caused carries out month (season) degree and gathers the number of times of record change, the business module of cycle and change.
The assurance of the quality of data mainly is divided into three kinds of technology, and the one, the quality problems exposure to the data source interface is called the data analysis; The 2nd, to checking of the data in ETL processing rule, process and the data warehouse, be called data auditing.Last and based on these two kinds of technology, set up a cover quality index system, improve quality and optimizing process from management view.
The data analysis
Whether the data analysis is the means of the quality assurance of docking port file, and whether it meets predefined specification by the inspection interface document, and exist some illegal records to generate 3 quality index: interface promptness rate, percentage of head rice and legal rate.
It comprises the time window of interface transmission, catalogue, interface field tabulation, field type and the separator etc. of transmission according to predefined interface specification, and the owner record of interface data file, carries out following verification:
Existence is checked: whether window inner joint file is present in the regulation catalogue at the appointed time;
Whether integrity checking: whether the data content of interface document meets the line number stipulated in the owner record, interface identifier etc., and according to the predefine interface specification, whether this interface document has the list of fields that meets, use and specify separator, quotation marks symbol.
Validity checking: every data recording of docking port file, carry out following rule inspection, do not satisfy if any arbitrary rule, then should write down illegally, comprising: whether null value appears in the non-NULL field; Whether foreign key field is with reference to disappearance; Whether the major key uniqueness is conflicted; Field type and specification do not match; Whether field value is not inconsistent service logic; And whether between the field value logical operation relation be false.
By above three kinds of inspection methods, can obtain following three interface quality index.
1, interface promptness rate: the file number number percent that is sent to data warehouse in the file delivery time window of in interface specification, stipulating.Computing formula is: (transmitted file at the appointed time in the window and should transmit the file number number/every day) * 100%.
2, interface percentage of head rice: in each ETL flow process, interface document meets the number percent of predefine specification fully.Computing formula is: (loading total number of correct interface document number/interface document) * 100%.
3, the legal rate of interface: the shared number percent of validation record in the interface document record.Computing formula is: (validation record number/total number of records) * 100%.
Data auditing
Data auditing is the quality assurance means to data ETL process, it is by to loading log analysis, the measure statistical of each link of ETL with check recently realizing 7 quality index, comprises contemporaneous errors rate, same period stability bandwidth, Historic Volatility, standard error rate, logical error rate, record loss ratio, register reject rate.
The process of checking is regarded whole ETL flow process as some ETL links compositions, and the ETL link is from the process of a data afflux to another data set.Therefore comprise the plurality of data collection in an ETL flow process, these data sets may be files, also may be tables of data.For example interface document, ODS subscriber's meter, index storehouse etc. all are the data sets in the ETL flow process.In these data centralizations, some are wherein arranged is that the tolerance that can calculate the common service implication is come out, for example can calculate " networked users' number " from the ODS subscriber's meter, and from DW user's summary sheet, also can calculate this " networked users' number ", the professional implication unanimity of these two tolerance, their value should be consistent simultaneously.
Therefore, be the some tolerance of each DSD, and for tolerance with common service implication with unified name identification, so just can generate quality of data index by checking these metrics.
In addition, be divided into total amount, component, standard volume and logic and check four kinds of modes in order to improve degree of accuracy, to check.
Total amount is checked and is meant that the measure statistical of data set not being had grouping, each data set can obtain a metric;
Component checks and is meant the measure statistical that data set is had grouping that each data set can obtain some metrics;
Standard volume checks it is standard volume of definition, and this standard volume can be checked with corresponding metric with this standard volume from the statistical value of form (manual input) or tent;
Logic checks it is that one group of metric is carried out logical operation, judges whether it satisfies predefined operation relation.
The result who more than checks generates four kinds of indexs:
1, contemporaneous errors rate:
In the ETL flow process, same metric period, identical degree value (the metric that refers to different pieces of information collection under the same group of professional theme, it is identical that its value should keep) in the percentage error that different pieces of information is concentrated, computing formula is: ((b data set metric-a data set metric)/a data set metric) * 100%
Wherein, data flow to b by a.
2, the same period stability bandwidth
In the ETL flow process, same metric period, at the ratio that different pieces of information is concentrated, computing formula is identical degree value (metric that refers to different pieces of information collection under the same group of professional theme): b data set metric/a data set metric * 100%.
To some metrics, the ratio between the data set metric is followed certain ratio, if it is apparent in view to fluctuate, can think that so data are suspicious.
3, Historic Volatility: historical data is being carried out in the ETL flow process, the same tolerance of same data centralization, in homometric(al) not period, the rate of change of its value.Computing formula: ((current period metric-last metric)/last metric) * 100%.
Bigger as Historic Volatility, illustrate that then this number of data sets is more suspicious according to mass ratio, causing the suspicious reason of the quality of data may be that the ETL processing has problem, also might be that there are quality problems in data itself before entering ETL, no matter be that because of what reason it is unacceptable that this batch data all may become.If need further location error reason, can compare the Historic Volatility of the identical degree value before the ETL, as basic identical then can the eliminating be that ETL handles problems, be that data have existed quality problems before entering ETL with positioning problems; Bigger as difference, and the Historic Volatility before the ETL is at tolerance interval, then can orientate ETL as and handle problem is arranged.
4, standard error rate: weigh in whole ETL flow process, the percentage error of tolerance and respective standard amount, computing formula is: ((metric-standard volume)/standard volume) * 100%.
Here Ding Yi standard volume is the authoritative statistical value that each side is assert, it also is certain tolerance, and its value is by tolerance title and tolerance decision in period.Standard volume produces by statistics from data or manual input.
5, logical error rate: in whole ETL flow process, the logical operation relation between the tolerance satisfies number percent, and computing formula is: (the satisfied number/logical operation control-general number of checking of logical operation) * 100%
This index has reflected the service logic that data must be followed, if be not inconsistent, may be the reason that ETL handles, and also may be the reason of data source interface, needs further to locate reason in conjunction with indexs such as contemporaneous errors rate, stability bandwidths.
Another function of data auditing is to analyze the ETL daily record, obtain that the loading data collection is dropped or unaccepted record number, abandoning is not to be subjected to the programme controlled result of ETL, and refusal is the behavior in range of control, unaccepted record may be put in storage, also may not put in storage.This checks and can obtain two quality index:
6, register reject rate: during each interface loads, under controlled situation, be rejected the record number and account for total number of records number percent, computing formula is: (refusal record number/total number of records) * 100%.
7, record loss ratio: during each ETL loaded, uncontrolled abandoning write down the number percent that accounts for the total number of records, and computing formula is: (abandoning record number/total number of records) * 100%.
Data auditing for example
1, contemporaneous errors rate:
Applicating example:
The account balance classification gathers value (can classify by account institutional affiliation or product category and gather), the classification of account number gathers (can classify by account institutional affiliation or product category and gather), day ledger balance (pressing account classification), month ledger balance (pressing account classification), year ledger balance (pressing account classification) or the like.
To the metric of this class amount of money involved, strict demand is consistent before and after ETL fully, as long as value is unequal, promptly the contemporaneous errors rate just can think that greater than 0 this batch data quality of checking is suspicious, need enter the mass modified flow process.Can be used to weigh the order of severity of data quality problem according to the size of the error rate of checking, error rate is big more, and problematic data acknowledgment number is many more in the data.If error rate is very low, as be lower than 0.01%, then might be to carry out in the ETL process noise that causes owing to hardware or software systems fault in system, rather than the business process rule of ETL itself has problem, whether can locate ETL this moment in conjunction with log analysis problem.
Other: Product Status classification gather value, Asset State classification gather value, account status classification gather value, contract status classification gather value, client's sex classify gather value, client's education degree classify gather value, client's industry type is classified and is gathered value or the like.The reason that the contemporaneous errors rate of this class metric produces is often because carrying out the standard code conversion or removing conversion process such as repetition values or null value replacement and make classification gather value to change to data, variation within the specific limits can be thought the effect that processing rule works, if but variation range is too big, have reason then to think that this quality of lot is suspicious, need with entering the mass modified flow process.
2, the same period stability bandwidth
Applicating example: when customer data is carried out data auditing, can be with client's sum as metric (also can divide into groups to the gender in the customer data to gather) as metric, by the same period stability bandwidth check, according to business experience, the fluctuation range of client's sum should fluctuate within the specific limits.If the same period, stability bandwidth was bigger, illustrate that then the customer data mass ratio after ETL handles is more suspicious, causing the suspicious reason of the quality of data can orientate the ETL processing rule as has problem.
3, Historic Volatility
Applicating example: as with client's sum as metric, every month client sum should be more or less the same with the ratio of client's sum last month, more then may exist the quality of data suspicious as stability bandwidth;
4, standard error rate
Applicating example: rate of return on investment desired value.
5, logical error rate
Applicating example:
Total points is checked, and promptly the subject ledgers in the ledgers table should equal in the ledger the value that gathers that should subject;
The debt-credit balance, promptly the debit's amount in the ledgers table should equal credit amount;
History is checked: the value that gathers that should equal day ledgers as the moon ledgers in the ledgers table; Season, ledgers should equal the value that gathers of moon ledgers;
Classification checks, promptly parent gathers value and should equal subclass and gather value, and for example the product number of parent product type gathers value and should equal subclass product number and gather in the product table, and the account balance that mechanism's account overall balance should equal each branch offices of subordinate gathers etc.
More than for example, from data correctness, do not allow the existence of error, promptly the value on logical relation both sides should be equal fully.If in the data auditing process, error rate occurs greater than 0, illustrate that then the quality of data has problem, need enter the mass modified flow process.Can be used to weigh the order of severity of data quality problem according to the size of checking error rate among the result, error rate is big more, and problematic data acknowledgment number is just many more in the data.
Quality grading
After weighing an ETL processing procedure and finishing, whether accurately data only can't be represented by single index, need comprehensively judge.Therefore, the service property (quality) classification realizes the rank of such " accuracy rate ", can divide three accuracy rate grades to the data of final loading.The first order is that accurately scale error can be accepted, and need not go into seriously reason, can pay front end and represent, and can issue form; The second level is suspicious, can pay front end and represent, but need continue to find out quality of data reason, gives an explaination or revises; The third level is a mistake, can not pay front end and represent, and ascertains the reason the data of revising and reload.
Comprehensively forming according to mentioning some quality index that data are analyzed and data auditing is produced above relying on of quality grading provides grade scale below.This standard is a reference role, can finely tune parameter according to each province's divided data situation.
Class of accuracy
When quality index satisfied following condition simultaneously, decision data warehouse loading data was accurate.
Interface percentage of head rice=100%;
The average legal rate of interface>99.99%, and do not have the legal rate of interface<99%;
Average record loss ratio<0.01%, and do not have record loss ratio>0.1%;
Average record reject rate<2%, and do not have register reject rate>5%;
Standard error of mean rate<1%, and do not have standard error rate>5%;
Average contemporaneous errors rate<0.01%, and do not have contemporaneous errors rate>1%; And
Logical error rate<1%.
Suspicious level
When quality index satisfied following condition simultaneously and do not satisfy the class of accuracy condition, decision data warehouse loading data was suspicious.
Interface percentage of head rice>90%;
The average legal rate of interface>99%, and do not have the legal rate of interface<95%;
Average record loss ratio<0.1%, and do not have record loss ratio>1%;
Average record reject rate<5%, and do not have register reject rate>10%;
Standard error of mean rate<1%, and do not have standard error rate>10%;
Average contemporaneous errors rate<1%, and do not have contemporaneous errors rate>5%;
Logical error rate<5%; And
Averaged historical stability bandwidth<5%, and do not have Historic Volatility>100%.
Fault levels
When quality index satisfies following condition simultaneously and discontented can doubt grade condition completely the time, decision data warehouse loading data is a mistake.
Interface percentage of head rice<=90%;
There is the legal rate of interface<=95%, the perhaps average legal rate of interface<=99%;
There are record loss ratio>=1%, perhaps average record loss ratio>=0.1%;
There is register reject rate>=10%, perhaps average record reject rate<=5%;
There is standard error rate>=10%, perhaps standard error of mean rate>=1%;
There is contemporaneous errors rate>=5%, perhaps average contemporaneous errors rate<=1%;
Logical error rate>=5%; And
There is Historic Volatility>=100%, perhaps averaged historical stability bandwidth>=5%.
The above is the preferred embodiments of the present invention only, is not limited to the present invention, and for a person skilled in the art, the present invention can have various changes and variation.Within the spirit and principles in the present invention all, any modification of being done, be equal to replacement, improvement etc., all should be included within protection scope of the present invention.

Claims (10)

1. a quality of data analytical approach that is used for data warehouse is characterized in that, may further comprise the steps:
Step 1 is analyzed interface at described data warehouse and data are carried out process of loading is obtained interface-quality analysis indexes;
Step 2, analysis extract the conversion process of loading to the data set corresponding to described data and obtain quality of data analysis indexes in described data warehouse; And
Step 3 is carried out classification according to described interface-quality analysis indexes and described quality of data analysis indexes to the quality of described data and is obtained classification results.
2. quality of data analytical approach according to claim 1 is characterized in that, described interface-quality analysis indexes comprises:
Be illustrated in the punctual interface promptness rate that arrives the situation of described interface of data described in the process of loading described data;
Be illustrated in the interface percentage of head rice of the situation of data fit predetermined interface specification described in the process of loading described data; And
Be illustrated in the legal rate of interface of the situation of data fit service logic described in the process of loading described data.
3. quality of data analytical approach according to claim 2 is characterized in that, described quality of data analysis indexes comprises one or more in the following index:
Be illustrated in phase in the described extraction conversion process of loading in the same time respectively corresponding to the contemporaneous errors rate of the error between a plurality of metrics of a plurality of described data sets;
Be illustrated in phase in the described extraction conversion process of loading in the same time respectively corresponding to the stability bandwidth same period of the ratio between described a plurality of metrics of a plurality of described data sets;
Be illustrated in the described extraction conversion process of loading respectively Historic Volatility corresponding to the situation of change of described a plurality of metrics in different times of a plurality of described data sets;
Be illustrated in a plurality of metrics described in the described extraction conversion process of loading respectively and the standard error rate of the error between the corresponding separately predetermined value;
Be illustrated in the logical error rate that a plurality of metrics described in the described extraction conversion process of loading meet the situation of logic business relation;
Be illustrated in the register reject rate that data set described in the described extraction conversion process of loading is rejected the situation of loading; And
Be illustrated in the record Loss Rate of the situation that data set is dropped described in the described extraction conversion process of loading.
4. quality of data analytical approach according to claim 3 is characterized in that, described classification results is that data are accurate, data are suspicious or error in data.
5. quality of data analytical approach according to claim 4 is characterized in that, in described step 3, whether meets predetermined value according to described quality of data analysis indexes and determines that described classification results is that data are accurate, data are suspicious or error in data.
6. a quality of data analytic system that is used for data warehouse is characterized in that, comprising:
The interface-quality analysis module is used to analyze interface at described data warehouse and data are carried out process of loading obtains interface-quality analysis indexes;
Quality of data analysis module, be used for analyzing described data warehouse to the data set corresponding to described data extract the conversion process of loading obtain quality of data analysis indexes; And
Diversity module is carried out classification according to described interface-quality analysis indexes and described quality of data analysis indexes to the quality of described data and is obtained classification results.
7. quality of data analytic system according to claim 6 is characterized in that, described interface-quality analysis indexes comprises one or more in the following index:
Be illustrated in the punctual interface promptness rate that arrives the situation of described interface of data described in the process of loading described data;
Be illustrated in the interface percentage of head rice of the situation of data fit predetermined interface specification described in the process of loading described data; And
Be illustrated in the legal rate of interface of the situation of data fit service logic described in the process of loading described data.
8. quality of data analytic system according to claim 7 is characterized in that, described quality of data analysis indexes comprises:
Be illustrated in phase in the described extraction conversion process of loading in the same time respectively corresponding to the contemporaneous errors rate of the error between a plurality of metrics of a plurality of described data sets;
Be illustrated in phase in the described extraction conversion process of loading in the same time respectively corresponding to the stability bandwidth same period of the ratio between described a plurality of metrics of a plurality of described data sets;
Be illustrated in the described extraction conversion process of loading respectively Historic Volatility corresponding to the situation of change of described a plurality of metrics in different times of a plurality of described data sets;
Be illustrated in a plurality of metrics described in the described extraction conversion process of loading respectively and the standard error rate of the error between the corresponding separately predetermined value;
Be illustrated in the logical error rate that a plurality of metrics described in the described extraction conversion process of loading meet the situation of logic business relation;
Be illustrated in the register reject rate that data set described in the described extraction conversion process of loading is rejected the situation of loading; And
Be illustrated in the record Loss Rate of the situation that data set is dropped described in the described extraction conversion process of loading.
9. quality of data analytic system according to claim 8 is characterized in that, described classification results is that data are accurate, data are suspicious or error in data.
10. quality of data analytic system according to claim 9 is characterized in that, whether described diversity module meets predetermined value according to described quality of data analysis indexes is determined that described classification results is that data are accurate, data are suspicious or error in data.
CNA2008101061878A 2008-05-09 2008-05-09 Method and system for analyzing data quality Pending CN101576893A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CNA2008101061878A CN101576893A (en) 2008-05-09 2008-05-09 Method and system for analyzing data quality
SG200804013-1A SG157252A1 (en) 2008-05-09 2008-05-27 Data quality analyzing method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNA2008101061878A CN101576893A (en) 2008-05-09 2008-05-09 Method and system for analyzing data quality

Publications (1)

Publication Number Publication Date
CN101576893A true CN101576893A (en) 2009-11-11

Family

ID=41271827

Family Applications (1)

Application Number Title Priority Date Filing Date
CNA2008101061878A Pending CN101576893A (en) 2008-05-09 2008-05-09 Method and system for analyzing data quality

Country Status (2)

Country Link
CN (1) CN101576893A (en)
SG (1) SG157252A1 (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102609418A (en) * 2011-01-21 2012-07-25 北京世纪读秀技术有限公司 Data quality grade judging method
CN102658299A (en) * 2012-05-23 2012-09-12 河北省首钢迁安钢铁有限责任公司 System and method for detecting and analyzing transverse thickness difference of electric steel
CN104361437A (en) * 2014-10-31 2015-02-18 北京思特奇信息技术股份有限公司 Quality inspection and management method of diversified data interfaces and quality inspection and management system of diversified data interfaces
CN104820720A (en) * 2015-05-26 2015-08-05 北京京东尚科信息技术有限公司 Data quality detecting method and device
CN104866619A (en) * 2015-06-09 2015-08-26 北京京东尚科信息技术有限公司 Data monitoring method and system for data warehouse
CN105550511A (en) * 2015-12-11 2016-05-04 北京锐软科技股份有限公司 Data quality evaluation system and method based on data verification technique
CN105956036A (en) * 2016-04-26 2016-09-21 中国银联股份有限公司 Transaction quality analysis device and transaction quality analysis method
CN107748775A (en) * 2017-10-17 2018-03-02 上海计算机软件技术开发中心 A kind of data governing system based on the quality of data
CN108829750A (en) * 2018-05-24 2018-11-16 国信优易数据有限公司 A kind of quality of data determines system and method
CN109947746A (en) * 2017-10-26 2019-06-28 亿阳信通股份有限公司 A kind of quality of data management-control method and system based on ETL process
CN111367981A (en) * 2020-03-06 2020-07-03 北京思特奇信息技术股份有限公司 Method, system, medium and equipment for extracting data of automatic monitoring audit report
CN112380204A (en) * 2020-11-16 2021-02-19 浙江大华技术股份有限公司 Data quality evaluation method and device
CN112926852A (en) * 2021-02-19 2021-06-08 南京莱斯网信技术研究院有限公司 Atmospheric ecological environment analysis method based on data fusion
CN113010493A (en) * 2021-03-16 2021-06-22 北京云从科技有限公司 Data quality online analysis method and device, machine readable medium and equipment
CN114328700A (en) * 2022-03-16 2022-04-12 上海柯林布瑞信息技术有限公司 Data checking method and device in medical data ETL task

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102609418A (en) * 2011-01-21 2012-07-25 北京世纪读秀技术有限公司 Data quality grade judging method
CN102609418B (en) * 2011-01-21 2015-02-04 北京世纪读秀技术有限公司 Data quality grade judging method
CN102658299A (en) * 2012-05-23 2012-09-12 河北省首钢迁安钢铁有限责任公司 System and method for detecting and analyzing transverse thickness difference of electric steel
CN104361437A (en) * 2014-10-31 2015-02-18 北京思特奇信息技术股份有限公司 Quality inspection and management method of diversified data interfaces and quality inspection and management system of diversified data interfaces
CN104820720A (en) * 2015-05-26 2015-08-05 北京京东尚科信息技术有限公司 Data quality detecting method and device
CN104866619A (en) * 2015-06-09 2015-08-26 北京京东尚科信息技术有限公司 Data monitoring method and system for data warehouse
CN105550511A (en) * 2015-12-11 2016-05-04 北京锐软科技股份有限公司 Data quality evaluation system and method based on data verification technique
CN105550511B (en) * 2015-12-11 2018-02-09 北京锐软科技股份有限公司 A kind of quality of data evaluation system and method based on data check technology
CN105956036A (en) * 2016-04-26 2016-09-21 中国银联股份有限公司 Transaction quality analysis device and transaction quality analysis method
WO2017185977A1 (en) * 2016-04-26 2017-11-02 中国银联股份有限公司 Transaction quality analysis device and transaction quality analysis method
CN107748775A (en) * 2017-10-17 2018-03-02 上海计算机软件技术开发中心 A kind of data governing system based on the quality of data
CN109947746A (en) * 2017-10-26 2019-06-28 亿阳信通股份有限公司 A kind of quality of data management-control method and system based on ETL process
CN109947746B (en) * 2017-10-26 2023-12-26 亿阳信通股份有限公司 Data quality control method and system based on ETL flow
CN108829750A (en) * 2018-05-24 2018-11-16 国信优易数据有限公司 A kind of quality of data determines system and method
CN111367981A (en) * 2020-03-06 2020-07-03 北京思特奇信息技术股份有限公司 Method, system, medium and equipment for extracting data of automatic monitoring audit report
CN111367981B (en) * 2020-03-06 2023-08-22 北京思特奇信息技术股份有限公司 Method, system, medium and equipment for automatically monitoring audit report data extraction
CN112380204A (en) * 2020-11-16 2021-02-19 浙江大华技术股份有限公司 Data quality evaluation method and device
CN112926852A (en) * 2021-02-19 2021-06-08 南京莱斯网信技术研究院有限公司 Atmospheric ecological environment analysis method based on data fusion
CN113010493A (en) * 2021-03-16 2021-06-22 北京云从科技有限公司 Data quality online analysis method and device, machine readable medium and equipment
CN114328700A (en) * 2022-03-16 2022-04-12 上海柯林布瑞信息技术有限公司 Data checking method and device in medical data ETL task

Also Published As

Publication number Publication date
SG157252A1 (en) 2009-12-29

Similar Documents

Publication Publication Date Title
CN101576893A (en) Method and system for analyzing data quality
Han et al. The association between information technology investments and audit risk
US10783116B2 (en) Systems and methods for managing data
Kogan et al. Design and evaluation of a continuous data level auditing system
RU2549510C1 (en) Systems and methods of creating large-scale architecture for processing credit information
JP2005515522A (en) A method and system for validating data warehouse data integrity and applying wafer-housing data to a plurality of predefined analytical models.
US20080208780A1 (en) System and method for evaluating documents
US20220027380A1 (en) Data management system and method for general ledger
GB2492208A (en) On-line analysis of financial accounting data
Khalafallah et al. Change orders: Automating comparative data analysis and controlling impacts in public projects
CN114742432A (en) Market procurement trade wind control management method and management system thereof
US20140379417A1 (en) System and Method for Data Quality Business Impact Analysis
CN117273511A (en) Data analysis method and device
CN112819372A (en) Unified supervision submission platform system and equipment
JP2003036346A (en) Method for evaluating operational risk and its system
Roubtsova et al. A Practical Extension of Frameworks for Auditing with Process Mining.
CN107784578B (en) Bank foreign exchange data supervision method and device
US7970711B2 (en) Warranty management system and method
Yuwono et al. The impact of Information Technology Governance maturity level on corporate productivity: A case study at an Information Technology services company
Avlijaš Analysis of audit report lag on Serbian stock exchange
JP2003036343A (en) Method of operational risk management and its system
Michailidis et al. Revisiting the linkage between internal audit function characteristics and internal control quality
KR101903530B1 (en) Optimization diagnostic system of business and IT system
Xiuxia et al. A study on information quality maturity model in enterprise
English Total quality data management (TQdM)

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Open date: 20091111