CN112529677A - Automatic data quality evaluation method and readable storage medium - Google Patents

Automatic data quality evaluation method and readable storage medium Download PDF

Info

Publication number
CN112529677A
CN112529677A CN202011531178.0A CN202011531178A CN112529677A CN 112529677 A CN112529677 A CN 112529677A CN 202011531178 A CN202011531178 A CN 202011531178A CN 112529677 A CN112529677 A CN 112529677A
Authority
CN
China
Prior art keywords
data
data quality
quality assessment
analysis
quality evaluation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202011531178.0A
Other languages
Chinese (zh)
Inventor
徐顺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan XW Bank Co Ltd
Original Assignee
Sichuan XW Bank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan XW Bank Co Ltd filed Critical Sichuan XW Bank Co Ltd
Priority to CN202011531178.0A priority Critical patent/CN112529677A/en
Publication of CN112529677A publication Critical patent/CN112529677A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • G06Q10/06393Score-carding, benchmarking or key performance indicator [KPI] analysis

Landscapes

  • Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Engineering & Computer Science (AREA)
  • Strategic Management (AREA)
  • Development Economics (AREA)
  • Economics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Marketing (AREA)
  • Educational Administration (AREA)
  • General Business, Economics & Management (AREA)
  • Physics & Mathematics (AREA)
  • Tourism & Hospitality (AREA)
  • Quality & Reliability (AREA)
  • Operations Research (AREA)
  • Game Theory and Decision Science (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • Technology Law (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses an automatic data quality evaluation method, which comprises the following steps: s1: a data sorting step, namely sorting the sample quality label extracted locally by the financial institution and the data returned by the data supplier; s2: and a data quality evaluation step, namely performing data descriptive analysis, coverage rate analysis, distinguishing capability analysis and correlation analysis on each index provided by a calculation data supplier, generating a report and a data quality evaluation report, and displaying each analysis result. The automatic data quality evaluation method disclosed by the invention realizes the full automation of the data quality evaluation process, can automatically generate a data quality evaluation report, saves a large amount of time and labor cost, and saves at least 90% of time.

Description

Automatic data quality evaluation method and readable storage medium
Technical Field
The invention belongs to the field of credit risk of financial science and technology, and particularly relates to an automatic data quality evaluation method and a readable storage medium.
Background
With the continuous development of big data technology, more and more data are applied to the credit risk field. Before applying various types of data, financial institutions often need to heavily evaluate the effect of data provided by various data providers on credit risk so as to select an optimal three-party data source.
When data provided by a plurality of data suppliers are faced, related workers of a financial institution need to manually evaluate the data quality one by one, check the coverage rate, the distinguishing capability and the like of each data source, and the workload is large and the time is long.
Disclosure of Invention
The invention aims to provide an automatic data quality evaluation method and a readable storage medium for overcoming the defects of the prior art, the method realizes the full automation of the data quality evaluation process, can automatically generate a data quality evaluation report, and saves a large amount of time and labor cost.
The purpose of the invention is realized by the following technical scheme:
an automated data quality assessment method, the data quality assessment method comprising: s1: a data sorting step, namely sorting the sample quality label extracted locally by the financial institution and the data returned by the data supplier; s2: and a data quality evaluation step, namely performing data descriptive analysis, coverage rate analysis, distinguishing capability analysis and correlation analysis on each index provided by a calculation data supplier, generating a report and a data quality evaluation report, and displaying each analysis result.
According to a preferred embodiment, the data descriptive analysis in the data quality assessment step comprises: and calculating the value range, the average number, the median and the distribution histogram of each index.
According to a preferred embodiment, the data descriptive analysis obtains parameter indexes through calculation, and completes the check of whether the distribution condition of each index is abnormal or not and whether extreme values exist or not.
According to a preferred embodiment, the coverage analysis in the data quality evaluation step includes the analysis of the coverage of each index over different customer groups and different time periods.
According to a preferred embodiment, the ability to distinguish in the data quality evaluation step includes calculating the IV of each index in different customer groups and different time periods, and evaluating the ability to distinguish between good and bad customers for each field of each data source.
According to a preferred embodiment, the correlation analysis in the data quality evaluation step includes calculating the correlation between each index of the data source and the correlation of the own data, and evaluating the gain condition of the data source to the existing data of the mechanism.
According to a preferred embodiment, the data quality evaluation step further includes automatically establishing a LightGBM model, performing variable screening and model parameter adjustment, establishing an optimal model, calculating AUC and KS of the model in different customer groups and different time periods, and evaluating the effect of modeling by using the data source.
According to a preferred embodiment, in the LightGBM model, Bayesian Optimization and/or Early is utilized
And the Stopping method is used for variable screening and model parameter adjustment.
According to a preferred embodiment, in step S2, an Rmarkdown tool is used, and two programming languages, i.e., R and Python, are combined to write an automated data testing code, so as to automatically implement the data quality evaluation step.
A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the aforementioned automated data quality assessment method.
The main scheme and the further selection schemes can be freely combined to form a plurality of schemes which are all adopted and claimed by the invention; in the invention, the selection (each non-conflict selection) and other selections can be freely combined. The skilled person in the art can understand that there are many combinations, which are all the technical solutions to be protected by the present invention, according to the prior art and the common general knowledge after understanding the scheme of the present invention, and the technical solutions are not exhaustive herein.
The invention has the beneficial effects that: by the automatic data quality evaluation method disclosed by the invention, the full automation of the data quality evaluation process is realized, a data quality evaluation report can be automatically generated, a large amount of time and labor cost are saved, and at least 90% of time is saved; by utilizing the Rmakdown tool and combining the R programming language and the Python programming language, the advantages of the R programming language and the Python programming language can be flexibly applied.
Drawings
FIG. 1 is a flow chart of an automated data quality assessment method of the present invention.
Detailed Description
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.
It should be noted that, in order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention are clearly and completely described below, and it is obvious that the described embodiments are some embodiments of the present invention, but not all embodiments.
Thus, the following detailed description of the embodiments of the present invention is not intended to limit the scope of the invention as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In addition, it should be noted that, in the present invention, if the specific structures, connection relationships, position relationships, power source relationships, and the like are not written in particular, the structures, connection relationships, position relationships, power source relationships, and the like related to the present invention can be known by those skilled in the art without creative work on the basis of the prior art.
Example 1:
referring to fig. 1, the invention discloses an automated data quality assessment method for the credit risk field, comprising the following steps:
s1: and a data sorting step, wherein the quality label of the sample extracted locally by the financial institution and the data returned by the data supplier are sorted.
S2: and a data quality evaluation step, namely performing data descriptive analysis, coverage rate analysis, distinguishing capability analysis and correlation analysis on each index provided by a calculation data supplier, generating a report and a data quality evaluation report, and displaying each analysis result.
Preferably, the automated data quality assessment comprises the following:
1) and data descriptive analysis comprises the value range, the average number, the median and the distribution histogram of each index.
Preferably, whether the distribution of the respective indexes is abnormal or not, the existence of extreme values or not is checked by calculating the maximum and minimum values of each index or the like.
2) And the coverage rate analysis comprises the coverage rate of each index on different customer groups and different time periods.
For example, checking from multiple dimensions how many samples the data source can match the facility provides, such as the facility providing 10 ten thousand samples for testing, and if the data provider can return 8 ten thousand samples of data, the coverage is 80%.
3) And analyzing the distinguishing capability, namely calculating IV (information value) of each index on different customer groups and different time periods, and evaluating the distinguishing capability of each field of each data source on the good or bad customers.
For example, if the IV calculated by an index on each guest group is greater than 0.2, the index has good discriminative power.
4) The method comprises the steps of automatically establishing a LightGBM model, automatically performing variable screening and model parameter adjustment by using methods such as Bayesian Optimization, Early Stopping and the like, establishing an optimal model, calculating AUC (area Under customer), KS (Kolmogorov-Smirnov) and the like of the model in different passenger groups and different time periods, and evaluating the approximate effect of modeling by using the data source.
For example, if AUC of an index is greater than 0.7 and KS is greater than 0.3, the index is well-differentiated.
5) And (4) correlation analysis, namely calculating correlation among indexes of the data source and correlation of own data, and evaluating the gain condition of the data source on the existing data of the mechanism.
For example, if the correlation coefficient between an index of the data source and existing data of the mechanism is smaller than 0.5, the correlation between the index and the existing data of the mechanism is weaker, and the information repetition degree between the data source and the existing data of the mechanism is lower, which may be a better supplement to the existing data of the mechanism.
Based on the above analysis results, an assessment summary of the data source is automatically generated, giving a main conclusion.
For example, the conclusion is as follows: 1. the overall distribution of the data source is normal, and no abnormal value exists; 2. the coverage rate of the data source is high and reaches 80%; 3. the data source has good distinguishing capability, the univariate IV reaches 0.2, and the AUC of a model component obtained by modeling by adopting the data source reaches 0.8; 4. the data source has low correlation with the existing data of the mechanism, and can be a good supplement to the existing data of the mechanism.
Therefore, the automatic data quality evaluation method disclosed by the invention realizes the full automation of the data quality evaluation process, can automatically generate a data quality evaluation report, saves a large amount of time and labor cost, and saves at least 90% of time; by utilizing the Rmakdown tool and combining the R programming language and the Python programming language, the advantages of the R programming language and the Python programming language can be flexibly applied.
Example 2:
also provided in the present invention is a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the automated data quality assessment method described in embodiment 1 above.
The foregoing basic embodiments of the invention and their various further alternatives can be freely combined to form multiple embodiments, all of which are contemplated and claimed herein. In the scheme of the invention, each selection example can be combined with any other basic example and selection example at will. Numerous combinations will be known to those skilled in the art.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims (10)

1. An automated data quality assessment method, characterized in that the data quality assessment method comprises:
s1: a data sorting step, namely sorting the sample quality label extracted locally by the financial institution and the data returned by the data supplier;
s2: and a data quality evaluation step, namely performing data descriptive analysis, coverage rate analysis, distinguishing capability analysis and correlation analysis on each index provided by a calculation data supplier, generating a report and a data quality evaluation report, and displaying each analysis result.
2. The automated data quality assessment method according to claim 1, wherein the data descriptive analysis in the data quality assessment step comprises: and calculating the value range, the average number, the median and the distribution histogram of each index.
3. The automated data quality assessment method according to claim 2, wherein the data descriptive analysis obtains parameter indexes through calculation, and completes the check whether the distribution of each index is abnormal or not and whether extreme values exist or not.
4. The automated data quality assessment method according to claim 1, wherein the coverage analysis in the data quality assessment step comprises an analysis of the coverage of each index over different customer groups, different time periods.
5. The automated data quality assessment method according to claim 1, wherein the discriminative power analysis in the data quality assessment step comprises calculating IV of each index for different customer groups and different time periods, and assessing discriminative power of each field of each data source for good or bad customers.
6. The automated data quality assessment method according to claim 1, wherein the correlation analysis in the data quality assessment step includes calculating the correlation between the indexes of the data source and the correlation of the own data, and assessing the gain of the data source to the existing data of the local organization.
7. The automated data quality assessment method of claim 1, wherein the data quality assessment step further comprises automatically building a LightGBM model, performing variable screening and model parameter adjustment, building an optimal model,
calculating AUC and KS of the model in different customer groups and different time periods, and evaluating the effect of modeling by using the data source.
8. The automated data quality assessment method of claim 7, wherein in the LightGBM model, variable screening and model parameter adjustment are performed using Bayesian Optimization and/or Early Stopping methods.
9. The automated data quality assessment method according to claim 1, wherein in step S2, the data quality assessment step is automatically implemented by writing automated data testing codes using Rmarkdown tool in combination with two programming languages, R and Python.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-8.
CN202011531178.0A 2020-12-22 2020-12-22 Automatic data quality evaluation method and readable storage medium Withdrawn CN112529677A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011531178.0A CN112529677A (en) 2020-12-22 2020-12-22 Automatic data quality evaluation method and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011531178.0A CN112529677A (en) 2020-12-22 2020-12-22 Automatic data quality evaluation method and readable storage medium

Publications (1)

Publication Number Publication Date
CN112529677A true CN112529677A (en) 2021-03-19

Family

ID=75002365

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011531178.0A Withdrawn CN112529677A (en) 2020-12-22 2020-12-22 Automatic data quality evaluation method and readable storage medium

Country Status (1)

Country Link
CN (1) CN112529677A (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130144749A1 (en) * 2011-12-05 2013-06-06 Martina Rothley Supplier rating and reporting
CN105976120A (en) * 2016-05-17 2016-09-28 全球能源互联网研究院 Electric power operation monitoring data quality assessment system and method
CN109961308A (en) * 2017-12-25 2019-07-02 北京京东尚科信息技术有限公司 The method and apparatus of assessment tag data
CN111339215A (en) * 2019-05-31 2020-06-26 北京东方融信达软件技术有限公司 Structured data set quality evaluation model generation method, evaluation method and device
CN111553550A (en) * 2019-12-10 2020-08-18 北京理工大学 Power big data quality assessment method aiming at user behavior analysis
CN111861734A (en) * 2020-07-31 2020-10-30 重庆富民银行股份有限公司 Test evaluation system and method for three-party data source
CN111967717A (en) * 2020-07-20 2020-11-20 格创东智(深圳)科技有限公司 Data quality evaluation method based on information entropy
CN112101447A (en) * 2020-09-10 2020-12-18 北京百度网讯科技有限公司 Data set quality evaluation method, device, equipment and storage medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130144749A1 (en) * 2011-12-05 2013-06-06 Martina Rothley Supplier rating and reporting
CN105976120A (en) * 2016-05-17 2016-09-28 全球能源互联网研究院 Electric power operation monitoring data quality assessment system and method
CN109961308A (en) * 2017-12-25 2019-07-02 北京京东尚科信息技术有限公司 The method and apparatus of assessment tag data
CN111339215A (en) * 2019-05-31 2020-06-26 北京东方融信达软件技术有限公司 Structured data set quality evaluation model generation method, evaluation method and device
CN111553550A (en) * 2019-12-10 2020-08-18 北京理工大学 Power big data quality assessment method aiming at user behavior analysis
CN111967717A (en) * 2020-07-20 2020-11-20 格创东智(深圳)科技有限公司 Data quality evaluation method based on information entropy
CN111861734A (en) * 2020-07-31 2020-10-30 重庆富民银行股份有限公司 Test evaluation system and method for three-party data source
CN112101447A (en) * 2020-09-10 2020-12-18 北京百度网讯科技有限公司 Data set quality evaluation method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
Ahmed et al. Approaches to control mechanisms and their implications for companies’ profitability: A study in UAE
Heras-Saizarbitoria et al. OHSAS 18001 certification and work accidents: Shedding light on the connection
Antunes et al. Firm default probabilities revisited
US20060085325A1 (en) System, method, and computer program for assessing risk within a predefined market
Jordaan Foreign workers and productivity in an emerging economy: The case of Malaysia
CN111460312A (en) Method and device for identifying empty-shell enterprise and computer equipment
CN110738527A (en) feature importance ranking method, device, equipment and storage medium
Bodson et al. Dynamic Hedge Fund Style Analysis with Errors‐in‐Variables
Hryhoruk et al. Model for assessment of the financial security level of the enterprise based of the desirability scale
Habtoor et al. Linking corporate risk disclosure practices with firm-specific characteristics in Saudi Arabia
Achim et al. A statistical model of financial risk bankruptcy applied for Romanian manufacturing industry
CN113515402A (en) Fault information classification method and device for engineering equipment and engineering equipment
Felderer et al. Experiences and challenges of introducing risk-based testing in an industrial project
CN111815435A (en) Visualization method, device, equipment and storage medium for group risk characteristics
Schroeder et al. Predicting and evaluating software model growth in the automotive industry
Panasyuk et al. Classification of large and socially important enterprises of the region by the levels of their economic solvency
CN112529677A (en) Automatic data quality evaluation method and readable storage medium
Pamuk et al. Towards ML-based Platforms in Finance Industry–An ML Approach to Generate Corporate Bankruptcy Probabilities based on Annual Financial Statements
Dănescu et al. Opportunity and necessity in audit sampling non-statistical sampling method
Narwal et al. Evaluating intellectual capital and its impact on financial performance: empirical evidence from Indian electricity, mining and asset financing service sectors
Karpac et al. The verification of prediction and classification ability of selected Slovak prediction models and their emplacement in forecasts of financial health of a company in aspect of globalization
Abdelaziz et al. Multiple Linear Regression for Determining Critical Failure Factors of Agile Software Projects.
Mucha et al. Audit Sampling–statistical vs. non-statistical?
Motubatse et al. Audit tools and techniques: crucial dimensions of internal audit engagements in South Africa
Palade et al. A perspective on the performance of the entity. Financial-accounting approach

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20210319