CN111723086A - Data quality checking method, device and equipment and readable storage medium - Google Patents

Data quality checking method, device and equipment and readable storage medium Download PDF

Info

Publication number
CN111723086A
CN111723086A CN202010696381.7A CN202010696381A CN111723086A CN 111723086 A CN111723086 A CN 111723086A CN 202010696381 A CN202010696381 A CN 202010696381A CN 111723086 A CN111723086 A CN 111723086A
Authority
CN
China
Prior art keywords
data
checking
generate
result
script
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202010696381.7A
Other languages
Chinese (zh)
Inventor
蒋晟
万文兵
王宗敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu Suning Bank Co Ltd
Original Assignee
Jiangsu Suning Bank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu Suning Bank Co Ltd filed Critical Jiangsu Suning Bank Co Ltd
Priority to CN202010696381.7A priority Critical patent/CN111723086A/en
Publication of CN111723086A publication Critical patent/CN111723086A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/02Banking, e.g. interest calculation or account maintenance

Abstract

The invention provides a data quality checking method, a device, equipment and a readable storage medium, wherein the method comprises the following steps: 1) performing associated modeling on the multi-service system data according to the quality checking requirement to generate an associated modeling result; 2) configuring a data quality check rule according to the correlation modeling result to obtain a data configuration result; 3) importing the data configuration result into a rule analyzer to generate a checking script; 4) importing the checking script into a script executor to generate a checking list; 5) and summarizing and counting the checking detail table to generate a checking result report. According to the invention, the multi-service system data are temporarily associated and summarized according to different checking requirements, the data to be checked are preliminarily screened, the data range is limited, the accuracy and the effectiveness of the quality checking result can be greatly improved, and the use and maintenance cost is reduced.

Description

Data quality checking method, device and equipment and readable storage medium
Technical Field
The invention relates to the technical field of computers, in particular to a data quality checking method, a data quality checking device, data quality checking equipment and a readable storage medium.
Background
The informatization construction of banking industries in China has been developed for twenty years, a relatively complete information system is established at present, and a large amount of data is accumulated through wide service types and various financial products. In 2018, the bank protection prison issues 'bank protection prison issue [ 2018 ] No. 22 & ltbanking financial institution data governance guidance & gt', which aims to guide banking financial institutions to strengthen data governance, improve data quality, give play to data value and improve operation management capacity. From the perspective of the data life cycle, the following links most easily cause data quality problems:
during the system construction phase, for example: information element registration is missing, definition is ambiguous, content is duplicated, data dictionary is imperfect and not in compliance with system reality, standards are not in compliance within the system, etc.
In the production operation stage, the data is really generated, and the data quality is the most easily problematic link, such as: the teller inputs the data quality problem that the norm leads to, should not appear the null value for empty data, inputs information and does not accord with reality, the information of type-in is repeated, inconsistent, incomplete scheduling problem.
In the data application stage, this is not only the stage of checking the quality of the data of the service system, but also the stage of data problem generation, such as: the problem of inconsistent data counted by each department due to the fact that index names, service calibers and technical calibers are not standard and consistent, the problems that rules are lacked in index data quality checking, tool support is poor, positioning difficulty of report data is high, and the problem of low solving efficiency is solved.
In order to solve the data quality problem, a large bank can establish a full-time data management team, purchase a special commercial data management and control tool, and manage the data of the whole bank, which usually needs investment of at least dozens of hundreds of people and hundreds of millions of funds. And the vast middle and small banks are more prone to seeking support for own system suppliers due to the lack of manpower and material resources, and provide quality management tools with limited functions for specific supervision fields. However, these tools are often not configurable enough, and cannot customize the quality check rule according to the change of the demand, and cannot perform comprehensive check on the data of each business system of the bank.
Disclosure of Invention
In view of the above problems, the present invention provides a data quality checking method, apparatus, device and readable storage medium, which can greatly improve the accuracy and validity of the quality checking result and reduce the use and maintenance cost.
In order to solve the above technical problems, the embodiment of the present invention provides the following specific technical solutions:
in a first aspect, a data quality checking method is provided, which includes the following steps: 1) performing associated modeling on the multi-service system data according to the quality checking requirement to generate an associated modeling result; 2) configuring a data quality check rule according to the correlation modeling result to obtain a data configuration result; 3) importing the data configuration result into a rule analyzer to generate a checking script; 4) importing the checking script into a script executor to generate a checking list; 5) and summarizing and counting the checking detail table to generate a checking result report.
With reference to the first aspect, in a first possible implementation manner, the performing association modeling on the multi-service system data in step 1) specifically includes the following steps: extracting data of the multi-service system into a source layer in a big data platform to generate a source layer data table; associating the source layer data table by compiling Hive QL to generate a data set; and storing the data set in a model layer of a big data platform to obtain an associated modeling result.
With reference to the first aspect, in a second possible implementation manner, the generating a check script in step 3) specifically includes: and the rule analyzer performs matching analysis according to the rule type of the data configuration result, the field to be checked and the checking logic to generate a series of SQL script statements.
With reference to the first aspect, in a third possible implementation manner, the script executor in step 4) includes a configurable process pool, and an executor may configure a maximum available process pool according to a resource situation.
With reference to the first aspect, in a fourth possible implementation manner, in step 5), performing summary statistics on the checking detail table to generate a checking result report, specifically: and summarizing and counting the abnormal data contained in the checking detail list to obtain a summarizing result, sending the summarizing result by mail, displaying report data and carrying out visual analysis on quality trend.
In a second aspect, there is provided a data quality checking apparatus comprising: the correlation modeling module is used for performing correlation modeling on the multi-service system data according to the quality checking requirement to generate a correlation modeling result; the rule configuration module is used for configuring a data quality check rule according to the associated modeling result and acquiring a data configuration result; the analyzer module is used for importing the data configuration result into a rule analyzer to generate a checking script; the script executor module is used for leading the checking script into a script executor to generate a checking list; and the result display module is used for summarizing and counting the checking list to generate a checking result report.
With reference to the second aspect, in a first possible implementation manner, the association modeling module includes: the data extraction module is used for extracting the data of the multi-service system into a source layer in the big data platform to generate a source layer data table; the data association module is used for associating the source layer data table by compiling Hive QL to generate a data set; and the data storage module is used for storing the data set in a model layer of the big data platform to obtain a correlation modeling result.
With reference to the second aspect, in a second possible implementation manner, the script executor module further includes a process pool configuration module, so that an executor may configure a maximum available process pool for the process pool of the script executor according to a resource condition.
In a third aspect, a data processing apparatus is provided, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and the processor implements a data quality checking method when executing the computer program.
In a fourth aspect, a computer-readable storage medium is provided, which stores a computer program that executes a data quality checking method.
Compared with the prior art, the invention has the beneficial effects that: according to the method, the multi-service system data are subjected to temporary association and summarization according to different checking requirements, the data to be checked are preliminarily screened, the data range is limited, and the accuracy and the effectiveness of quality checking results can be greatly improved; by the configurable quality checking rule, the checking personnel can conveniently modify the specific checking rule at any time, and the flexibility and configurability of the quality checking requirement are realized; by adopting the rule parser, the error rate and debugging time of manually compiling the check script can be reduced; by adopting the script executor and executing the checking script, the running efficiency and the quality checking efficiency of the checking script can be improved; through the checking result report, the display mode of the checking result can be customized for different users to use; in addition, the invention does not need to additionally purchase a new application system and a large amount of human input, can reduce the use and maintenance cost and is convenient for small and medium-sized banks to adopt.
Drawings
The disclosure of the present invention is illustrated with reference to the accompanying drawings. It is to be understood that the drawings are designed solely for the purposes of illustration and not as a definition of the limits of the invention. In the drawings, like reference numerals are used to refer to like parts. Wherein:
FIG. 1 is a schematic flow chart of a data quality checking method according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart of the correlation modeling step of an embodiment of the present invention;
FIG. 3 is a schematic structural diagram of a data quality checking apparatus according to an embodiment of the present invention;
FIG. 4 is a schematic structural diagram of an association modeling module according to an embodiment of the invention.
Detailed Description
It is easily understood that according to the technical solution of the present invention, a person skilled in the art can propose various alternative structures and implementation ways without changing the spirit of the present invention. Therefore, the following detailed description and the accompanying drawings are merely illustrative of the technical aspects of the present invention, and should not be construed as all of the present invention or as limitations or limitations on the technical aspects of the present invention.
Fig. 1 shows a flow chart of a data quality checking method according to an embodiment of the invention. As shown in fig. 1, the method comprises the steps of:
s100: and performing associated modeling on the data of the multi-service system according to the quality checking requirement to generate an associated modeling result.
Specifically, the quality check requirements include a supervision reporting requirement, an anti-money laundering requirement, an operation analysis and report requirement and the like, and different fields and data standards are required by different requirements. For example, anti-money laundering requires quality checks of nine elements of an individual customer, including: name, certificate number, certificate validity period, nationality, gender, occupation, mobile phone, communication address and name of work unit. Anti-money laundering requires checking whether these fields are empty, whether the validity period of the certificate meets the date format and range standards, whether the nationality, gender and occupation fields meet the standard code values, etc.
The checking requirement can be provided by a wind control department, such as supervision, anti-money laundering and the like; and the data management system can also be proposed by business departments, such as data management analysis and report forms. The technical department can also take lead to check important basic data in the line uniformly and find the quality condition of the whole line data.
In the embodiment of the invention, the business system comprises various peripheral business systems such as a core system, a user center system, a public loan system, a personal loan system, a supply chain system and the like. Each business system comprises at least dozens of tables, and the required tables and fields are extracted for relevant modeling according to actual conditions considering that the tables are checked in a targeted manner.
As shown in fig. 2, the association modeling of the multi-service system data specifically includes the following steps:
s101: and extracting the data of the multi-service system into a pasting layer in the big data platform to generate a pasting layer data table.
S102: and (4) associating the source layer data table by writing Hive QL to generate a data set.
S103: and storing the data set in a model layer of the big data platform to obtain an associated modeling result.
For example, nine elements of an individual customer who opens a second class of accounts in my bank are subjected to associated modeling according to the anti-money laundering requirement, the fields are distributed in different systems such as a core system, a counter system, a customer management system and the like, and the data sheet comprises an account information sheet, a customer basic information sheet, a customer extended information sheet, a certificate information sheet, a contact information sheet and the like. All data of each service system are extracted into a source layer in a big data platform to generate a source layer data table, then the source layer data table is associated by compiling HiveQL (structured query language) to generate a data set, and finally the data set is stored in a model layer of the big data platform to obtain an associated modeling result.
S200: and configuring a data quality check rule according to the associated modeling result, and acquiring a data configuration result.
It should be understood that the association modeling result here is a temporary wide table structure from multiple business systems through ETL (Extract extraction, Transform conversion, Load loading), and is stored in the big data platform. The table structure contains the fields and the business data which need to be checked. According to the temporary wide table, rule configuration is carried out on each field, and a data configuration result is obtained. The rule configuration metadata comprises rule major classes, rule minor classes, check source tables, source fields, field Chinese names, check logic and the like, and is shown in the following table:
large class of rules Rule subclass Checking source table Source field Name of Chinese character Checking logic
2 class of data misses 1 non-empty check intperson cus_name Name of customer Do not involve
2 class of data misses 1 non-empty check intperson id_num Certificate number Do not involve
2 class of data misses 1 non-empty check contactmethod ctcmth_num Mobile phone number Do not involve
3 violating the coding Specification class 2 time/date format check perident start_dt Date of certificate initiation yyyy-mm-dd
3 violating the coding Specification class 2 time/date format check perident expiry_dt Date of due of certificate yyyy-mm-dd
4 violating technical specification class 1 Range/code value check person country_tp_cd Nationality book nationality
4 violating technical specification class 1 Range/code value check person gender_tp_cd Sex gender
S300: importing the data configuration result into a rule analyzer to generate a checking script;
specifically, the rule parser is an executable program of a Linux system, and can be implemented by using different programming languages such as Shell, Java, C, Python, and the like. And taking the data configuration result file as an input parameter of a rule parser, reading the data configuration results item by the parser, performing matching parsing according to different rule types, to-be-checked fields and checking logic, and finally generating a series of SQL script statements, namely checking scripts.
S400: importing a checking script into a script executor to generate a checking list;
the script executor contains a configurable process pool, and an executor can configure the maximum available process pool according to the resource condition. And checking the script file as an input parameter of the script executor, reading the input parameter line by the script executor, distributing background processes, and then executing the processes concurrently to generate a checking list.
S500: and summarizing and counting the checking detail table to generate a checking result report.
The checking result list comprises all abnormal data in the checking process, statistics and summarization are carried out on the data to obtain a summary result, the summary result can be sent by mail, report data display, quality trend visualization analysis and the like.
As shown in fig. 3, the present invention also discloses a data quality checking apparatus, which includes:
and the association modeling module 100 is used for performing association modeling on the data of the multi-service system according to the quality checking requirement to generate an association modeling result.
As shown in fig. 4, the association modeling module 100 includes: and the data extraction module 101 is configured to extract the multi-service system data into a source layer in the big data platform, and generate a source layer data table. And the data association module 102 is used for associating the source layer data table by writing Hive QL to generate a data set. And the data storage module 103 is used for storing the data set in a model layer of the big data platform to obtain a correlation modeling result.
The rule configuration module 200 is configured to configure a data quality check rule according to the association modeling result, and obtain a data configuration result;
the parser module 300 is configured to import the data configuration result into a rule parser, and generate a check script;
the script executor module 400 is used for leading the checking script into a script executor and generating a checking list; the script executor module also includes a process pool configuration module for the executor to configure the maximum available process pool for the process pool of the script executor according to the resource condition.
And the result display module 500 is configured to perform summary statistics on the checking detail table to generate a checking result report.
Correspondingly, the invention also discloses data processing equipment, which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the processor realizes any data quality checking method when executing the computer program.
Accordingly, the present invention also discloses a computer readable storage medium storing a computer program for executing any one of the data quality checking methods.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
It should be appreciated that the integrated unit, if implemented as a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The technical scope of the present invention is not limited to the above description, and those skilled in the art can make various changes and modifications to the above-described embodiments without departing from the technical spirit of the present invention, and such changes and modifications should fall within the protective scope of the present invention.

Claims (10)

1. A data quality checking method is characterized by comprising the following steps:
1) performing associated modeling on the multi-service system data according to the quality checking requirement to generate an associated modeling result;
2) configuring a data quality check rule according to the correlation modeling result to obtain a data configuration result;
3) importing the data configuration result into a rule analyzer to generate a checking script;
4) importing the checking script into a script executor to generate a checking list;
5) and summarizing and counting the checking detail table to generate a checking result report.
2. The data quality inspection method according to claim 1, wherein the performing of the association modeling on the multi-service system data in step 1) specifically includes the following steps:
extracting data of the multi-service system into a source layer in a big data platform to generate a source layer data table;
associating the source layer data table by compiling Hive QL to generate a data set;
and storing the data set in a model layer of a big data platform to obtain an associated modeling result.
3. The data quality inspection method according to claim 1, wherein the generating of the inspection script in step 3) specifically includes: and the rule analyzer performs matching analysis according to the rule type of the data configuration result, the field to be checked and the checking logic to generate a series of SQL script statements.
4. The data quality checking method according to claim 1, wherein the script executor in step 4) includes a configurable process pool, and an executor can configure the maximum available process pool according to the resource situation.
5. The data quality inspection method according to claim 1, wherein the step 5) performs summary statistics on the inspection list to generate an inspection result report, specifically: and summarizing and counting the abnormal data contained in the checking detail list to obtain a summarizing result, sending the summarizing result by mail, displaying report data and carrying out visual analysis on quality trend.
6. A data quality checking apparatus, comprising:
the correlation modeling module is used for performing correlation modeling on the multi-service system data according to the quality checking requirement to generate a correlation modeling result;
the rule configuration module is used for configuring a data quality check rule according to the associated modeling result and acquiring a data configuration result;
the analyzer module is used for importing the data configuration result into a rule analyzer to generate a checking script;
the script executor module is used for leading the checking script into a script executor to generate a checking list;
and the result display module is used for summarizing and counting the checking list to generate a checking result report.
7. The data quality checking apparatus according to claim 6, wherein the correlation modeling module comprises:
the data extraction module is used for extracting the data of the multi-service system into a source layer in the big data platform to generate a source layer data table;
the data association module is used for associating the source layer data table by compiling Hive QL to generate a data set;
and the data storage module is used for storing the data set in a model layer of the big data platform to obtain a correlation modeling result.
8. The data quality checking apparatus according to claim 6, wherein the script executor module further comprises a process pool configuration module, so that the executor can configure the maximum available process pool for the process pool of the script executor according to the resource condition.
9. A data processing device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the data quality checking method according to any one of claims 1 to 5 when executing the computer program.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program for executing the data quality checking method according to any one of claims 1 to 5.
CN202010696381.7A 2020-07-20 2020-07-20 Data quality checking method, device and equipment and readable storage medium Withdrawn CN111723086A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010696381.7A CN111723086A (en) 2020-07-20 2020-07-20 Data quality checking method, device and equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010696381.7A CN111723086A (en) 2020-07-20 2020-07-20 Data quality checking method, device and equipment and readable storage medium

Publications (1)

Publication Number Publication Date
CN111723086A true CN111723086A (en) 2020-09-29

Family

ID=72572867

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010696381.7A Withdrawn CN111723086A (en) 2020-07-20 2020-07-20 Data quality checking method, device and equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN111723086A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112347095A (en) * 2020-11-16 2021-02-09 建信金融科技有限责任公司 Data table processing method and device and server
CN112416727A (en) * 2020-11-23 2021-02-26 中国建设银行股份有限公司 Batch processing operation checking method, device, equipment and medium
CN112685401A (en) * 2021-01-22 2021-04-20 浪潮云信息技术股份公司 Data quality detection system and method
CN112988736A (en) * 2021-05-20 2021-06-18 睿至科技集团有限公司 Mass data quality checking method and system
CN115328948A (en) * 2022-02-22 2022-11-11 杭州美创科技有限公司 Master data quality management method, master data quality management device, computer equipment and storage medium

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112347095A (en) * 2020-11-16 2021-02-09 建信金融科技有限责任公司 Data table processing method and device and server
CN112347095B (en) * 2020-11-16 2023-04-21 建信金融科技有限责任公司 Data table processing method, device and server
CN112416727A (en) * 2020-11-23 2021-02-26 中国建设银行股份有限公司 Batch processing operation checking method, device, equipment and medium
CN112685401A (en) * 2021-01-22 2021-04-20 浪潮云信息技术股份公司 Data quality detection system and method
CN112988736A (en) * 2021-05-20 2021-06-18 睿至科技集团有限公司 Mass data quality checking method and system
CN112988736B (en) * 2021-05-20 2021-08-03 睿至科技集团有限公司 Mass data quality checking method and system
CN115328948A (en) * 2022-02-22 2022-11-11 杭州美创科技有限公司 Master data quality management method, master data quality management device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
CN111723086A (en) Data quality checking method, device and equipment and readable storage medium
CN110020660B (en) Integrity assessment of unstructured processes using Artificial Intelligence (AI) techniques
US9037549B2 (en) System and method for testing data at a data warehouse
US8504408B2 (en) Customer analytics solution for enterprises
JP6066927B2 (en) Generation of data pattern information
US8768976B2 (en) Operational-related data computation engine
US10671956B2 (en) Measure factory
WO2008045738A2 (en) Fraud detection, risk analysis and compliance assessment
EP3274952A1 (en) A document verification system
CN114418714A (en) 5G base station operation and maintenance management system and method
CN110633331A (en) Method, system and related equipment for extracting data in relational database
CN108960672A (en) The air control method, apparatus and computer readable storage medium of limit limit time
CN106056418A (en) Invoice submission method, device and system
CN108959307A (en) Expansible data reporting method, system and storage medium
CN111652716B (en) First credit user tag determining method and device
CN117094764A (en) Bank integral processing method and device
CN111833182A (en) Method and device for identifying risk object
CN117033431A (en) Work order processing method, device, electronic equipment and medium
CN114265842A (en) Audit data processing method, device, equipment and storage medium based on ERP system
CN114722789A (en) Data report integration method and device, electronic equipment and storage medium
CN115858372B (en) Batch data construction and automatic verification method and system based on OLAP system
Laksana et al. The Effect of Operational Risks for Digital Banking Services at Banks
CN111932368B (en) Credit card issuing system and construction method and device thereof
CN117611099A (en) Money backwashing risk monitoring system and method
CN117408826A (en) Financial data processing method and device, electronic equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20200929