CN113760681A - Unified SQL (structured query language) -based multi-source heterogeneous data quality verification method and system - Google Patents

Unified SQL (structured query language) -based multi-source heterogeneous data quality verification method and system Download PDF

Info

Publication number
CN113760681A
CN113760681A CN202110260430.7A CN202110260430A CN113760681A CN 113760681 A CN113760681 A CN 113760681A CN 202110260430 A CN202110260430 A CN 202110260430A CN 113760681 A CN113760681 A CN 113760681A
Authority
CN
China
Prior art keywords
quality
data quality
sql
verification
scheduling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110260430.7A
Other languages
Chinese (zh)
Inventor
苑洪涛
冯凯
余智华
孙庆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Golaxy Data Technology Co ltd
Original Assignee
Golaxy Data Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Golaxy Data Technology Co ltd filed Critical Golaxy Data Technology Co ltd
Priority to CN202110260430.7A priority Critical patent/CN113760681A/en
Publication of CN113760681A publication Critical patent/CN113760681A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3409Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/248Presentation of query results

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Mathematical Physics (AREA)
  • Computer Hardware Design (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a multisource heterogeneous data quality verification method and a multisource heterogeneous data quality verification system based on unified SQL, which comprises the following steps according to one aspect of the invention: s1, creating a quality check rule; s2, creating a quality check task; s3, creating quality inspection operation; s4, creating a quality check scheduling plan; s5, submitting the scheduling plan to execute a verification logic; s6, generating a quality report; another aspect according to the invention comprises a task job configuration module, a job scheduling module, a verification execution module, a data quality report generation and data quality analysis module. The invention has the beneficial effects that: the method can carry out intuitive, flexible and unified quality management on data in different storage forms, and can meet complex business requirements.

Description

Unified SQL (structured query language) -based multi-source heterogeneous data quality verification method and system
Technical Field
The invention relates to the field of data quality management, in particular to a multisource heterogeneous data quality verification method and system based on unified SQL.
Background
With the rapid development of big data technology in recent years, the value of various network data is increasing as the support of big data. Mass data are generated on a network every day, data quality problems can be caused in each stage of a complete life cycle of planning, obtaining, storing, sharing, maintaining, applying and disappearing of the data, once the problems are generated, the subsequent business process can have disastrous consequences, and therefore, the data which can generate the quality problems are necessary to be identified, measured, monitored and early warned. In addition, different storage forms are available in the whole data life cycle, and may be a relational database such as mysql, postgres, oracle and the like, or a non-relational database such as elasticserver, mongoDB and the like, which further increases the difficulty of data quality management.
Therefore, how to effectively manage the data quality among heterogeneous databases is a problem to be solved at present.
An effective solution to the problems in the related art has not been proposed yet.
Disclosure of Invention
Aiming at the problems in the related art, the invention provides a multisource heterogeneous data quality verification method and system based on unified SQL (structured query language) so as to overcome the technical problems in the prior related art.
The technical scheme of the invention is realized as follows:
according to one aspect of the invention, a multi-source heterogeneous data quality verification method based on unified SQL is provided, which comprises the following steps:
s1, creating a quality check rule;
s2, creating a quality check task;
s3, creating quality inspection operation;
s4, creating a quality check scheduling plan;
s5, submitting the scheduling plan to execute a verification logic;
and S6, generating a quality report.
Further, the step S1 of creating the quality check rule includes filling basic information of the quality rule, such as a rule name and a rule dimension.
Further, the step S2 of creating the quality check task includes filling basic information of the quality check task, selecting a database to be checked, selecting a configuration quality check rule, and configuring an appropriate alarm threshold value.
Further, the creating of the quality inspection job at S3 includes filling basic information of the quality inspection job, and selecting a task to be inspected.
Further, the creating of the quality-check scheduling plan at S4 includes assigning an execution plan to the job created at step S3, which is divided into single execution, cycle-serial, and cycle-parallel.
Further, the step S5 of submitting the scheduler plan execution verification logic includes that if the execution is single execution, the execution unit executes immediately after submitting and executes only once; if the execution unit is periodically serial, the execution unit is periodically scheduled according to the configuration after the execution unit is submitted, the last scheduling is not completed, the next execution is not performed, and the execution unit is not scheduled again until the last execution is completed; if the period is parallel, the execution unit is scheduled according to the configuration period after submission, and the execution unit is scheduled again next time no matter whether the last scheduling is finished. The execution unit is mainly responsible for analyzing task configuration and carrying out data quality verification work.
Further, the S6 quality report generation includes generating a data quality report according to an output of the execution unit, and performing data quality analysis.
Further, the implementation of the step S5 submitting the scheduling plan to execute the verification logic specifically includes the following steps:
s501, generating configuration information for data quality verification;
s502, creating an external table according to the configured data source information, table name and other information, and adding the external table name into the configuration information of data quality verification;
s503, analyzing the submitted service configuration, and determining data accounting conditions;
s504, replacing table name variables, column name variables and parameter variables of the pseudo sql with configured values according to the submitted configuration information, assembling into a standard sql statement, and assembling into a standard sql statement capable of acquiring the problem line number and a standard sql statement capable of acquiring the total line number;
s505, using JDBC api of java to submit the standard sql in the step S504 to postgres for execution;
s506, the unified sql engine acquires a heterogeneous database, a table and a field related to an external table;
s507, optimizing and analyzing the submitted sql sentences by the unified sql engine;
s508, converting the data into database executable query statements related to the external table;
s509, the heterogeneous database executes query and returns a query result;
s510, calculating an alarm value according to the result returned by the sql in the step S509;
s511, comparing the calculation result in the step S510 with a set threshold value, if the calculation result exceeds the threshold value, reaching an alarm condition, carrying out alarm marking on the rule, and sending out alarm information after all the rules are executed;
s512, when the alarm condition is met, alarm marking is carried out on the rule, and alarm information is sent out after all the rules are executed;
and S513, collecting various indexes of the inspection result, and storing whether the alarm condition is met or not.
According to another aspect of the invention, a multi-source heterogeneous data quality verification system based on unified SQL is provided, which comprises a task operation configuration module, an operation scheduling module, a verification execution module and a data quality report generation and data quality analysis module;
the task job configuration module is mainly responsible for collecting configuration information of the verification rules, assembling the rule information into task jobs and delivering the task jobs to the scheduling module for scheduling;
the operation scheduling module is mainly responsible for scheduling the operation, and submitting the task configuration to the verification execution module once or periodically to perform data quality verification work; there are three processing methods for the scheduling method: single scheduling, cycle serial and cycle parallel;
the check execution module mainly obtains indexes such as problem data, total line number of data and the like through a unified sql engine according to configured task information, and compares a threshold value to determine whether an alarm condition is met;
the data quality report generating and analyzing module generates a data quality report according to the output result of the checking execution module and analyzes the data quality.
The system is further characterized in that the job configuration module comprises a rule configuration unit, a task configuration unit and a job configuration unit;
the rule configuration unit is mainly used for creating a quality rule according to business requirements, creating a pseudo sql statement in the quality rule and standardizing a pseudo sql writing mode
The task configuration unit is mainly used for selecting proper quality rules according to business requirements, configuring the quality rules, selecting data sources, audit tables, audit fields and audit parameter values, and configuring alarm conditions, such as problem line numbers and problem ratio.
The operation configuration unit is mainly used for configuring a plurality of tasks into one operation and completing data quality verification work of the plurality of tasks in one scheduling after the operation is submitted to the scheduling module.
The invention has the beneficial effects that: the method can carry out intuitive, flexible and unified quality management on data in different storage forms, and can meet complex business requirements. The method is realized by combining a multi-source heterogeneous database data quality verification method and system based on unified SQL, can simply and flexibly identify the problem data among heterogeneous databases by means of the unified SQL, and can monitor, early warn and analyze the quality of the problem data by setting a threshold value.
Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention may be realized and attained by the means of the instrumentalities and combinations particularly pointed out hereinafter.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
FIG. 1 is a flowchart of a unified SQL-based multi-source heterogeneous database data quality verification method according to an embodiment of the present invention;
FIG. 2 is one of the flowcharts of step S5 in the unified SQL based multi-source heterogeneous database data quality check method according to the embodiment of the present invention;
FIG. 3 is a second flowchart of step S5 in the unified SQL based multi-source heterogeneous database data quality check method according to the embodiment of the invention;
FIG. 4 is a block flow diagram of step S5 in a unified SQL-based multi-source heterogeneous database data quality checking method according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a data quality management system architecture according to an embodiment of the present invention;
FIG. 6 is a diagram illustrating the correspondence of a pseudo sql to a standard sql in accordance with an embodiment of the invention;
FIG. 7 is a diagram illustrating data quality verification results according to an embodiment of the present invention;
fig. 8 is a block diagram of a unified SQL-based multi-source heterogeneous database data quality verification system according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
In the above description of the present invention, it should be noted that the terms "one side", "the other side" and the like indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings or orientations or positional relationships that the products of the present invention are conventionally placed in use, and are only used for convenience in describing the present invention and simplifying the description, but do not indicate or imply that the device or the element to which the present invention is directed must have a specific orientation, be constructed in a specific orientation, and be operated, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and the like are used merely to distinguish one description from another, and are not to be construed as indicating or implying relative importance.
In addition, before describing the implementation of the unified SQL-based multi-source heterogeneous database data quality verification method and system provided by the present invention, the terms mentioned in the present invention are briefly explained:
1. pseudo sql
The pseudo sql is a template for checking the data quality rule, and is similar to the normal sql, but the table name, the field name, or the query condition may be written in a variable form according to the constraint, for example, to express data in a table where a certain field is smaller than a certain value, the pseudo sql may be written as: select from $ { SCHEMA _ TABLE _1} where $ { COLUMN _1} < $ { parm 1}.
2. Variables of the table
Variables in the pseudo sql represent TABLE names, such as $ { SCHEMA _ TABLE _1}.
3. Column variables
The pseudo sql has a variable representing the COLUMN name, such as $ { COLUMN _1}.
4. Parameter variable
The pseudo sql includes variables representing parameter values, such as $ { parm 1}.
5. Standard sql
Structured Query Language, a standard Language for accessing relational databases.
6. Unified sql engine
The big data uniform SQL query engine product can realize uniform data access and business association processing among a plurality of heterogeneous data sources. The product is developed based on postgres and is associated with a heterogeneous database in an external table-based mode. By inquiring the external table, the unified SQL engine can optimize and analyze the inquiry sentences into the inquiry sentences of the target database, and sends the inquiry sentences to the target database for inquiry, and returns the inquiry result.
7. Heterogeneous databases
Databases of different storage forms, such as mysql, oracle, postgres, mongoDB, elasticsearch, etc.
As shown in fig. 1 to 4, according to an embodiment of the present invention, a multi-source heterogeneous data quality verification method based on unified SQL is provided, which includes the following steps:
s1, creating a quality check rule;
filling in basic information of quality rules, such as rule name, rule dimension, pseudo sql statement, such as creating a name is custom max check, dimension is accuracy, pseudo sql is select from $ { SCHEMA TABLE _1} where max ($ { COLUMN _1}) < 100.
S2, creating a quality check task;
filling basic information of a quality inspection task, such as a task name, selecting a database to be inspected, selecting and configuring a quality inspection rule, mainly determining a table name, a column name and a parameter value of the quality inspection rule, and configuring a proper alarm threshold value; one check task can select and configure a plurality of data quality check rules according to the service requirements.
S3, creating quality inspection operation;
basic information of the quality inspection job, such as job name, is filled, a task to be inspected is selected, and a plurality of inspection tasks can be selected for inspection by one job.
S4, creating a quality check scheduling plan;
the execution plan is specified for the job created in step three and is divided into single execution, cycle series and cycle parallel.
S5, submitting the scheduling plan to execute a verification logic;
if the execution is single execution, the execution unit executes immediately after submission and only executes once; if the execution unit is periodically serial, the execution unit is periodically scheduled according to the configuration after the execution unit is submitted, the last scheduling is not completed, the next execution is not performed, and the execution unit is not scheduled again until the last execution is completed; if the period is parallel, the execution unit is scheduled according to the configuration period after submission, and the execution unit is scheduled again next time no matter whether the last scheduling is finished. The execution unit is mainly responsible for analyzing task configuration and performing data quality check work, and the steps included in the execution unit are shown in fig. 2-3:
step S501, generating configuration information for data quality verification;
step S502, an external table is created according to the data source information, the table name and other information configured in the step S501, and the external table name is added into the configuration information for data quality verification;
step S503, analyzing the submitted service configuration, determining which data source, which table and which field need to be subjected to what quality check, and what the condition for triggering alarm is;
step S504, according to the submitted configuration information, replacing the table name variable, the column name variable and the parameter variable of the pseudo sql with the configured values, assembling the table name variable, the column name variable and the parameter variable into a standard sql statement, and assembling the standard sql statement capable of obtaining the problem line number and the standard sql statement capable of obtaining the total line number. The table name here is an external table name;
step S505, the method is realized by java language, and standard sql in step S504 is submitted to postgres for execution by java JDBC api;
step S506, the unified sql engine acquires information such as heterogeneous databases, tables and fields associated with external tables;
step S507, optimizing and analyzing the submitted sql sentences by the unified sql engine;
step S508, converting the data into database executable query statements related to the external table;
step S509, the heterogeneous database executes the query and returns a query result;
step S510, calculating an alarm value according to the result returned by the sql in the step S509;
step S511, comparing the calculation result in the step S510 with a set threshold value, if the calculation result exceeds the threshold value, an alarm condition is reached, an alarm mark is carried out on the rule, and alarm information is sent out after all the rules are executed;
step S512, if the alarm condition is met, the rule is subjected to alarm marking, and alarm information is sent out after all the rules are executed;
step S513, collecting various indexes of the inspection result, and storing whether the alarm condition is satisfied or not.
S6, generating a quality report;
and generating a data quality report according to the output of the execution unit, and analyzing the data quality.
As shown in fig. 8, according to an embodiment of the present invention, there is also provided a multi-source heterogeneous data quality verification system based on unified SQL, including a task job configuration module, a job scheduling module, a verification execution module, and a data quality report generation and data quality analysis module;
wherein the content of the first and second substances,
firstly, a task operation configuration module;
the module is mainly responsible for collecting configuration information of the verification rule, assembling the rule information into task operation and delivering the task operation to the scheduling module for scheduling. The method mainly comprises the following units:
1. a rule configuration unit;
the unit is mainly used for creating a quality rule according to business requirements, creating a pseudo sql statement in the quality rule and standardizing a pseudo sql writing mode:
(1) only the query SQL of the problem data needs to be compiled, and the SQL syntax needs to follow the PostgreSQL syntax;
(2) wherein $ { SCHEMA _ TABLE _ }, $ { COLUMN _ } are reserved keywords, respectively represent TABLE names, field names, support multiple TABLEs, multiple fields, please refer to the sequence numbers as: SCHEMA _ TABLE _1, SCHEMA _ TABLE _2, COLUMN _1, COLUMN _ 2;
(3) when the custom SQL contains a plurality of tables, the $ { COLUMN _ } field must be preceded by a table name, such as: select from $ { SCHEMA _ TABLE _1} where $ { SCHEMA _ TABLE _1}, $ { COLUMN _1} >30and $ { SCHEMA _ TABLE _1}. id in (select id from $ { SCHEMA _ TABLE _2} where $ { SCHEMA _ TABLE _2}, $ { COLUMN _2}, $ 'hill');
(4) the as renaming cannot be performed for the table name and the field name.
2. A task configuration unit;
the unit mainly selects proper quality rules according to business requirements, configures the quality rules, and comprises selecting data sources, an audit table, audit fields and audit parameter values, and configures alarm conditions, such as problem line number and problem ratio.
3. A job configuration unit;
the unit is mainly used for configuring a plurality of tasks into one job and completing data quality verification work of the tasks in one scheduling after the job is submitted to the scheduling module.
Secondly, an operation scheduling module;
the module is mainly responsible for scheduling the operation, and submitting the task configuration to the verification execution module for single time or period to carry out data quality verification work; there are three processing methods for the scheduling method:
1. single scheduling: : after manual submission and scheduling, all tasks in corresponding operation are executed only once;
2. periodic serial: after manual submission and scheduling, data quality verification can be carried out according to the configuration of submitted tasks in a configuration period, and if the data quality verification task in the last scheduling is not completed, the next scheduling does not submit the task configuration to a verification execution module until the data quality verification task in the last scheduling is completed;
3. and (3) cycle parallel: after manual submission and scheduling, data quality verification can be performed according to the submitted task configuration of the configuration period, and no matter whether the data quality verification task in the last scheduling is completed or not, the next scheduling can submit the task configuration to the verification execution module.
Thirdly, checking an execution module;
the module mainly obtains indexes such as problem data, total line number of data and the like through a unified sql engine according to configured task information, and compares a threshold value to determine whether an alarm condition is met.
Fourthly, checking the execution module;
the module generates a data quality report according to the output result of the verification execution module and analyzes the data quality:
1. the quality report details are generated aiming at a verification task and comprise:
(1) and task information: basic information for embodying tasks, such as task names and task ids;
(2) and an inspection object: the method comprises the steps of embodying the inspection of which libraries and tables;
(3) and scheduling information: the scheduling method comprises the following steps of embodying scheduling basic information, such as the name of a scheduling plan, the name of an execution job, the Id of the execution job, the starting time of scheduling, the ending time of scheduling, a scheduling mode (single time/period) and a scheduled cron expression;
(4) and problem statistics: and counting the number of problems (the number of rules of alarming), the problem data quantity (the problem data quantity) and the inspection result.
2. And (3) data quality analysis: performing system analysis on the verification results of all verification tasks, wherein the system analysis comprises quality task statistics, quality problem distribution and data quality grading;
(1) and quality task statistics: displaying a quality task list comprising task basic information and a verification result
(2) And mass problem distribution: respectively carrying out six-dimensional statistical analysis of completeness, accuracy, effectiveness, uniqueness, consistency and timeliness on the problem number and the problem data volume
(3) And data quality scoring: the dimensions (integrity, accuracy, validity, uniqueness, consistency, timeliness) are integrated, and for the problems currently existing, 1 point is deducted for each problem
In summary, by means of the technical scheme of the invention, intuitive, flexible and unified quality management can be performed on data in different storage forms, and meanwhile, complex business requirements can be met. The method is realized by combining a multi-source heterogeneous database data quality verification method and system based on unified SQL, can simply and flexibly identify the problem data among heterogeneous databases by means of the unified SQL, and can monitor, early warn and analyze the quality of the problem data by setting a threshold value.
Finally, the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made to the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, and all of them should be covered in the claims of the present invention.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (10)

1. A multisource heterogeneous data quality verification method based on unified SQL is characterized by comprising the following steps:
s1, creating a quality check rule;
s2, creating a quality check task;
s3, creating quality inspection operation;
s4, creating a quality check scheduling plan;
s5, submitting the scheduling plan to execute a verification logic;
and S6, generating a quality report.
2. The unified SQL-based multi-source heterogeneous data quality verification method according to claim 1, wherein the S1 creating the quality verification rule includes filling basic information of the quality rule, such as a rule name and a rule dimension.
3. The unified SQL based multi-source heterogeneous data quality verification method according to claim 1, wherein the S2 creation of the quality verification task includes filling in basic information of the quality verification task, selecting a database to be verified, selecting a configuration quality verification rule, and configuring an appropriate alarm threshold value.
4. The unified SQL-based multi-source heterogeneous data quality verification method according to claim 1, wherein the S3 creating the quality inspection job includes filling basic information of the quality inspection job, and selecting a task to be inspected.
5. The unified SQL based multi-source heterogeneous data quality check method according to claim 1, wherein the S4 creating the quality check scheduling plan includes specifying an execution plan for the job created in the step S3, which is divided into single execution, cycle serial and cycle parallel.
6. The unified SQL based multi-source heterogeneous data quality checking method according to claim 1, wherein the step of S5 submitting the scheduling plan execution checking logic includes that if the execution is single execution, the execution unit executes immediately after submission and only executes once; if the execution unit is periodically serial, the execution unit is periodically scheduled according to the configuration after the execution unit is submitted, the last scheduling is not completed, the next execution is not performed, and the execution unit is not scheduled again until the last execution is completed; if the period is parallel, the execution unit is scheduled according to the configuration period after submission, and the execution unit is scheduled again next time no matter whether the last scheduling is finished. The execution unit is mainly responsible for analyzing task configuration and carrying out data quality verification work.
7. The unified SQL based multi-source heterogeneous data quality verification method according to claim 1, wherein the S6 quality report generation includes generation of a data quality report according to output of an execution unit and data quality analysis.
8. The unified SQL based multi-source heterogeneous data quality verification method according to claim 6, wherein the implementation of the step S5 submission scheduling plan execution verification logic specifically includes the following steps:
s501, generating configuration information for data quality verification;
s502, creating an external table according to the configured data source information, table name and other information, and adding the external table name into the configuration information of data quality verification;
s503, analyzing the submitted service configuration, and determining data accounting conditions;
s504, replacing table name variables, column name variables and parameter variables of the pseudo sql with configured values according to the submitted configuration information, assembling into a standard sql statement, and assembling into a standard sql statement capable of acquiring the problem line number and a standard sql statement capable of acquiring the total line number;
s505, using JDBC api of java to submit the standard sql in the step S504 to postgres for execution;
s506, the unified sql engine acquires a heterogeneous database, a table and a field related to an external table;
s507, optimizing and analyzing the submitted sql sentences by the unified sql engine;
s508, converting the data into database executable query statements related to the external table;
s509, the heterogeneous database executes query and returns a query result;
s510, calculating an alarm value according to the result returned by the sql in the step S509;
s511, comparing the calculation result in the step S510 with a set threshold value, if the calculation result exceeds the threshold value, reaching an alarm condition, carrying out alarm marking on the rule, and sending out alarm information after all the rules are executed;
s512, when the alarm condition is met, alarm marking is carried out on the rule, and alarm information is sent out after all the rules are executed;
and S513, collecting various indexes of the inspection result, and storing whether the alarm condition is met or not.
9. A multisource heterogeneous data quality verification system based on unified SQL is characterized in that the system is used for the multisource heterogeneous data quality verification method based on unified SQL according to any one of claims 1 to 8, and comprises a task operation configuration module, an operation scheduling module, a verification execution module and a data quality report generation and data quality analysis module;
the task job configuration module is mainly responsible for collecting configuration information of the verification rules, assembling the rule information into task jobs and delivering the task jobs to the scheduling module for scheduling;
the operation scheduling module is mainly responsible for scheduling the operation, and submitting the task configuration to the verification execution module once or periodically to perform data quality verification work; there are three processing methods for the scheduling method: single scheduling, cycle serial and cycle parallel;
the check execution module mainly obtains indexes such as problem data, total line number of data and the like through a unified sql engine according to configured task information, and compares a threshold value to determine whether an alarm condition is met;
the data quality report generating and analyzing module generates a data quality report according to the output result of the checking execution module and analyzes the data quality.
10. The unified SQL-based multi-source heterogeneous data quality verification system according to claim 9, wherein the job configuration module comprises a rule configuration unit, a task configuration unit and a job configuration unit;
the rule configuration unit is mainly used for creating a quality rule according to business requirements, creating a pseudo sql statement in the quality rule and standardizing a pseudo sql writing mode
The task configuration unit is mainly used for selecting proper quality rules according to business requirements, configuring the quality rules, selecting data sources, audit tables, audit fields and audit parameter values, and configuring alarm conditions, such as problem line numbers and problem ratio.
The operation configuration unit is mainly used for configuring a plurality of tasks into one operation and completing data quality verification work of the plurality of tasks in one scheduling after the operation is submitted to the scheduling module.
CN202110260430.7A 2021-03-10 2021-03-10 Unified SQL (structured query language) -based multi-source heterogeneous data quality verification method and system Pending CN113760681A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110260430.7A CN113760681A (en) 2021-03-10 2021-03-10 Unified SQL (structured query language) -based multi-source heterogeneous data quality verification method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110260430.7A CN113760681A (en) 2021-03-10 2021-03-10 Unified SQL (structured query language) -based multi-source heterogeneous data quality verification method and system

Publications (1)

Publication Number Publication Date
CN113760681A true CN113760681A (en) 2021-12-07

Family

ID=78786809

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110260430.7A Pending CN113760681A (en) 2021-03-10 2021-03-10 Unified SQL (structured query language) -based multi-source heterogeneous data quality verification method and system

Country Status (1)

Country Link
CN (1) CN113760681A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114400082A (en) * 2021-12-30 2022-04-26 杭州火树科技有限公司 Medical data quality monitoring platform

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101968793A (en) * 2010-08-25 2011-02-09 大唐软件技术股份有限公司 Method and system for checking on basis of disparate data source data
CN102053961A (en) * 2009-10-27 2011-05-11 中兴通讯股份有限公司 Method and device for SQL statements and system for improving database reliability
US8826084B1 (en) * 2011-09-07 2014-09-02 Innovative Defense Technologies, LLC Method and system for implementing automated test and retest procedures
CN107682373A (en) * 2017-11-21 2018-02-09 中国电子科技集团公司第五十四研究所 A kind of SQL injection defence method based on SQL isomerization
CN107992519A (en) * 2017-10-31 2018-05-04 中国电力科学研究院有限公司 The multi-source heterogeneous data verification system and method for a kind of smart grid-oriented big data
CN108363746A (en) * 2018-01-26 2018-08-03 福建星瑞格软件有限公司 A kind of unified SQL query system for supporting multi-source heterogeneous data
US20180307857A1 (en) * 2015-06-02 2018-10-25 ALTR Solution, Inc. Replacing distinct data in a relational database with a distinct reference to that data and distinct de-referencing of database data
CN110837496A (en) * 2019-11-08 2020-02-25 浪潮云信息技术有限公司 Data quality management method and system based on dynamic sql
CN111400365A (en) * 2020-02-26 2020-07-10 杭州美创科技有限公司 Business system data quality detection method based on standard SQ L
CN112256682A (en) * 2020-10-22 2021-01-22 佳都新太科技股份有限公司 Data quality detection method and device for multi-dimensional heterogeneous data

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102053961A (en) * 2009-10-27 2011-05-11 中兴通讯股份有限公司 Method and device for SQL statements and system for improving database reliability
CN101968793A (en) * 2010-08-25 2011-02-09 大唐软件技术股份有限公司 Method and system for checking on basis of disparate data source data
US8826084B1 (en) * 2011-09-07 2014-09-02 Innovative Defense Technologies, LLC Method and system for implementing automated test and retest procedures
US20180307857A1 (en) * 2015-06-02 2018-10-25 ALTR Solution, Inc. Replacing distinct data in a relational database with a distinct reference to that data and distinct de-referencing of database data
CN107992519A (en) * 2017-10-31 2018-05-04 中国电力科学研究院有限公司 The multi-source heterogeneous data verification system and method for a kind of smart grid-oriented big data
CN107682373A (en) * 2017-11-21 2018-02-09 中国电子科技集团公司第五十四研究所 A kind of SQL injection defence method based on SQL isomerization
CN108363746A (en) * 2018-01-26 2018-08-03 福建星瑞格软件有限公司 A kind of unified SQL query system for supporting multi-source heterogeneous data
CN110837496A (en) * 2019-11-08 2020-02-25 浪潮云信息技术有限公司 Data quality management method and system based on dynamic sql
CN111400365A (en) * 2020-02-26 2020-07-10 杭州美创科技有限公司 Business system data quality detection method based on standard SQ L
CN112256682A (en) * 2020-10-22 2021-01-22 佳都新太科技股份有限公司 Data quality detection method and device for multi-dimensional heterogeneous data

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114400082A (en) * 2021-12-30 2022-04-26 杭州火树科技有限公司 Medical data quality monitoring platform

Similar Documents

Publication Publication Date Title
CN105868373B (en) Method and device for processing key data of power business information system
JP6707564B2 (en) Data quality analysis
Fürber et al. Towards a vocabulary for data quality management in semantic web architectures
US10452625B2 (en) Data lineage analysis
US20110137939A1 (en) Data Supervision Based on the Configuration Rule of All Operational Indicators
CN111291990B (en) Quality monitoring processing method and device
US10210227B2 (en) Processing a data set
CN107168977A (en) A kind of optimization method and device of data query
CN110728422A (en) Building information model, method, device and settlement system for construction project
CN111078766A (en) Data warehouse model construction system and method based on multidimensional theory
US9727663B2 (en) Data store query prediction
CN110716539A (en) Fault diagnosis and analysis method and device
CN114064618A (en) Data quality evaluation method and system
CN111177139A (en) Data quality verification monitoring and early warning method and system based on data quality system
CN113760681A (en) Unified SQL (structured query language) -based multi-source heterogeneous data quality verification method and system
CN108984408A (en) The detection method and device of SQL code in a kind of application system
US8396847B2 (en) System and method to retrieve and analyze data for decision making
Endler et al. An architecture for continuous data quality monitoring in medical centers
CN114742430A (en) User retention early warning visualization method, device, equipment and storage medium
CN112396349A (en) Data quality monitoring method based on business entity
CN112465364B (en) Management system for index library
CN113468158B (en) Data restoration method, system, electronic equipment and medium
CN113590686B (en) Processing method, device and equipment for ecological environment data index
US20240005259A1 (en) Index modeling
CN114579619B (en) Data query method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20211207