CN115757626A

CN115757626A - Data quality detection method and device, electronic equipment and storage medium

Info

Publication number: CN115757626A
Application number: CN202211469462.9A
Authority: CN
Inventors: 梁福坤
Original assignee: Jingdong City Beijing Digital Technology Co Ltd
Current assignee: Jingdong City Beijing Digital Technology Co Ltd
Priority date: 2022-11-22
Filing date: 2022-11-22
Publication date: 2023-03-07

Abstract

The embodiment of the invention discloses a data quality detection method and device, electronic equipment and a storage medium. The method comprises the following steps: responding to a data synchronization instruction, synchronizing the data before synchronization stored on a source region in a source storage engine to a target region in a target storage engine by running a data synchronization task to obtain synchronized data; acquiring a data quality detection task corresponding to the data synchronization task, wherein the data quality detection task is generated based on the acquired source region, target region and data quality detection rule; and performing quality detection on the data before synchronization stored on the source region and the data after synchronization stored on the target region by running a data quality detection task based on a data quality detection rule to obtain the data quality. The technical scheme of the embodiment of the invention can realize automatic detection of data quality.

Description

Data quality detection method and device, electronic equipment and storage medium

Technical Field

The embodiment of the invention relates to the technical field of computer application, in particular to a data quality detection method and device, electronic equipment and a storage medium.

Background

In the case of mass data of e-commerce, logistics or city, the business system accepts transactional business operations, that is, a common online Transaction Processing (OLTP).

When massive data is analyzed, data aggregation or synchronization to a data warehouse is generally adopted, which means that data may involve changes of a data storage engine after synchronization and transmission, and the changes of the data storage engine may affect data quality. Data quality is currently mainly detected manually.

In the process of implementing the present invention, the inventor finds that the following technical problems exist in the prior art: the automation degree is lower, thereby causing the problems of low detection efficiency, higher cost, difficult quality guarantee and the like.

Disclosure of Invention

The embodiment of the invention provides a data quality detection method and device, electronic equipment and a storage medium, so as to realize automatic detection of data quality.

According to an aspect of the present invention, there is provided a data quality detection method, which may include:

responding to a data synchronization instruction, and synchronizing the data before synchronization stored on a source region in a source storage engine to a target region in a target storage engine by running a data synchronization task to obtain synchronized data;

acquiring a data quality detection task corresponding to the data synchronization task, wherein the data quality detection task is generated based on the acquired source region, target region and data quality detection rule;

and performing quality detection on the data before synchronization stored on the source region and the data after synchronization stored on the target region based on a data quality detection rule by operating a data quality detection task to obtain the data quality.

According to another aspect of the present invention, there is provided a data quality detecting apparatus, which may include:

a synchronized data obtaining module, configured to synchronize, in response to a data synchronization instruction, pre-synchronization data stored in a source region in a source storage engine to a target region in a target storage engine by running a data synchronization task, to obtain synchronized data;

the data quality detection task acquisition module is used for acquiring a data quality detection task corresponding to the data synchronization task, wherein the data quality detection task is generated based on the acquired source region, target region and data quality detection rule;

and the data quality obtaining module is used for performing quality detection on the data before synchronization stored on the source region and the data after synchronization stored on the target region on the basis of a data quality detection rule by running a data quality detection task to obtain the data quality.

According to another aspect of the present invention, there is provided an electronic device, which may include:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to cause the at least one processor to perform a method of data quality detection provided by any of the embodiments of the present invention when executed.

According to another aspect of the present invention, there is provided a computer-readable storage medium having stored thereon computer instructions for causing a processor to execute a method for data quality detection provided by any of the embodiments of the present invention.

According to the technical scheme of the embodiment of the invention, in response to a data synchronization instruction, data before synchronization stored on a source region in a source storage engine is synchronized to a target region in a target storage engine by running a data synchronization task to obtain synchronized data; acquiring a data quality detection task corresponding to the data synchronization task, wherein the data quality detection task is generated based on the acquired source region, target region and data quality detection rule; and performing quality detection on the data before synchronization stored on the source region and the data after synchronization stored on the target region based on a data quality detection rule by operating a data quality detection task to obtain the data quality. According to the technical scheme, the data quality detection task is automatically generated based on the source region and the target region with the mapping relation in the data synchronization task and the data quality detection rule, so that the quality detection of the data before synchronization and the data after synchronization related to the data synchronization task is carried out based on the data quality detection task, and the effect of automatic detection of the data quality is achieved.

It should be understood that the statements in this section do not necessarily identify key or critical features of any embodiment of the present invention, nor do they necessarily limit the scope of the present invention. Other features of the present invention will become apparent from the following description.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a flowchart of a data quality detection method provided according to an embodiment of the present invention;

FIG. 2 is a flow chart of another method for data quality detection provided in accordance with an embodiment of the present invention;

FIG. 3 is a flow chart of another data quality detection method provided in accordance with an embodiment of the present invention;

FIG. 4a is an architecture diagram of a data quality detection system in an embodiment of the present invention;

FIG. 4b is a flow chart of the operation of the data quality detection system in an embodiment of the present invention;

FIG. 5 is a flow diagram of a metadata collection task in an embodiment of the invention;

FIG. 6 is a flow diagram of manual metadata collection in an embodiment of the present invention;

FIG. 7a is a first diagram of field mapping rule definitions in an embodiment of the present invention;

FIG. 7b is a second diagram of field mapping rule definitions in an embodiment of the present invention;

FIG. 8 is a schematic diagram of a scheduling rule configuration in an embodiment of the present invention;

FIG. 9 is a flow chart of SQL parsing in an embodiment of the invention;

FIG. 10 is a block diagram of an abstract syntax tree in an embodiment of the present invention;

FIG. 11 is a flow chart of registration of data quality detection tasks in an embodiment of the invention;

fig. 12 is a block diagram of a data quality detection apparatus according to an embodiment of the present invention;

fig. 13 is a schematic structural diagram of an electronic device implementing the data quality detection method according to the embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in other sequences than those illustrated or described herein. The cases of "target", "original", etc. are similar and will not be described in detail herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Fig. 1 is a flowchart of a data quality detection method provided in an embodiment of the present invention. The embodiment can be suitable for the situation of carrying out automatic detection on the data quality. The method can be executed by the data quality detection device provided by the embodiment of the invention, the device can be realized by software and/or hardware, the device can be integrated on electronic equipment, and the electronic equipment can be various user terminals or servers.

Referring to fig. 1, the method of the embodiment of the present invention specifically includes the following steps:

s110, responding to a data synchronization instruction, synchronizing the data before synchronization stored on a source region in the source storage engine to a target region in the target storage engine by running a data synchronization task, and obtaining the synchronized data.

In practical applications, optionally, the data storage engine may be represented in the form of a database engine, for example, mysql, postgresql, hive, clickhouse, redis, or elastic search, and the like; the source region may be a region within the source storage engine for storing pre-synchronization data, and when the pre-synchronization data is data under a source field within a source table within the source storage engine, the source region may be considered a region for storing data under the source field.

The target storage engine may be a data storage engine for storing synchronized data obtained by synchronizing pre-synchronization data, and in practical applications, the source storage engine and the target storage engine may be homogeneous or heterogeneous data storage engines, which is not specifically limited herein; the target area may be an area for storing synchronized data in the target storage engine, and when the synchronized data includes data in a target field in a target table in the target storage engine, the target area may be considered as an area for storing data in the target field, so that a mapping relationship exists between the source field and the target field in the data synchronization process of this time.

The data synchronization task may be a pre-generated task that can be used to implement data synchronization, and different data synchronization tasks may synchronize pre-synchronization data on a corresponding source region to a corresponding target region, so as to obtain post-synchronization data.

And S120, acquiring a data quality detection task corresponding to the data synchronization task, wherein the data quality detection task is generated based on the acquired source region, target region and data quality detection rule.

Each data synchronization task may correspond to a data quality detection task for performing quality detection on pre-synchronization data and post-synchronization data (which may be referred to as pre-synchronization data and post-synchronization data hereinafter), where the quality detection may be embodied in multiple aspects, such as content detection of consistency, integrity, accuracy and the like of the pre-synchronization data and the post-synchronization data, timeliness detection and the like. Of course, for those relatively simple data synchronization tasks, they may not correspond to the corresponding data quality detection tasks, and are not specifically limited herein. The data quality detection tasks respectively corresponding to different data synchronization tasks may be the same or different, which is related to the actual requirement and is not specifically limited herein.

The data quality detection task can be automatically generated in advance based on the acquired source region, target region and data quality detection rule or automatically generated in real time during application, wherein the data quality detection rule defines how to perform quality detection on data before and after synchronization, and can be a rule related to interval calculation, data column summation, sampling row, null value rate, enumeration interval and the like; the source region and the target region can be obtained directly, for example, directly from the pre-configuration, or obtained by analyzing the data synchronization task, that is, the corresponding source region and target region can be obtained by automatically mining the task relationship between the pre-synchronization data and the post-synchronization data in the data synchronization task. It should be noted that, in the case where the data quality detection task is generated in real time during application, although the data before synchronization on the source region has been synchronized onto the target region by running the data synchronization task, what storage regions under what data storage engines are respectively the source region and the target region for the data quality detection task is still unknown and needs to be acquired.

S130, performing quality detection on the data before synchronization stored on the source region and the data after synchronization stored on the target region based on a data quality detection rule by running a data quality detection task to obtain data quality.

The data quality detection task is generated based on the obtained source region, the target region and the data quality detection rule, so that quality detection can be performed on the data before synchronization stored on the source region and the data after synchronization stored on the target region by operating the data quality detection task and based on the data quality detection rule, the data quality of the data before and after synchronization is obtained, and the effect of automatically detecting the data quality is realized.

According to the technical scheme, the mapping relation between the data before and after synchronization can be directly obtained without manual mining, and the mapping relation can be embodied in the aspects of a data storage engine, a data table, a field or a storage area and the like; the data quality detection task can be automatically generated by combining the data quality detection rule based on the mapping relation, so that the data quality detection is finished based on the data quality detection task without manual detection; in addition, the data quality detection scheme can be suitable for data synchronization among any data storage engines, and has good universality.

An optional technical solution, where a data conversion rule is stored in a data synchronization task, and data before synchronization stored on a source region in a source storage engine is synchronized to a target region in a target storage engine by running the data synchronization task, may include: extracting data before synchronization from a source region in a source storage engine by running a data synchronization task; converting the data before synchronization based on a data conversion rule to obtain converted data; and loading the converted data onto a target area in a target storage engine.

When data conversion is involved in the data synchronization process, the data synchronization task may also be referred to as an Extraction-Transformation-Loading (ETL) task, where a data conversion rule that can be used to implement a data conversion function is stored in the data synchronization task (i.e., the ETL task), and the data conversion rule may be obtained when the ETL task is configured, such as a string interception rule, a string completion rule, a string replacement rule, or an empty operation rule. On the basis, by operating the data synchronization task, after the pre-synchronization data is extracted from the source region, the pre-synchronization data can be converted based on the data conversion rule to obtain converted data, and then the converted data is loaded on the target region, so that the obtained loaded data can be better applied by the target storage engine.

On the basis, optionally, the data quality detection task is generated based on a data conversion rule obtained by analyzing the data synchronization task; performing quality detection on the pre-synchronization data stored on the source region and the post-synchronization data stored on the target region based on a data quality detection rule by executing a data quality detection task, which may include: converting the pre-synchronization data extracted from the source region based on a data conversion rule by operating a data quality detection task; and performing quality detection on the aspect of data content on the converted pre-synchronization data and the post-synchronization data extracted from the target area based on a data quality detection rule. In order to ensure the accuracy of data quality detection, when generating a data quality detection task, the data conversion rule obtained by analyzing the data synchronization task can be referred to in addition to the source region, the target region and the data quality detection rule. On the basis, when a data quality detection task is executed, the data before synchronization extracted from the source region can be converted based on the data conversion rule, and then the quality detection in the aspect of data content is performed on the converted data before synchronization and the data after synchronization extracted from the target region based on the data quality detection rule, for example, whether the numerical values of the data before synchronization and the data after synchronization are the same or not is detected, so that the accuracy of the data quality obtained by the method is ensured.

Another optional technical scheme is that the source region and the target region for generating the data quality detection task are obtained by the following steps: and analyzing the data synchronization task, and acquiring a source region and a target region for generating a data quality detection task according to an analysis result. When a source region and a target region associated with a data synchronization task are written into the data synchronization task in a Structured Query Language (SQL) or Application Programming Interface (API) manner instead of being directly stored in a configuration manner, the data synchronization task needs to be analyzed, so as to obtain the source region and the target region for generating a data quality detection task.

On the basis, optionally, the source region and the target region are written into the data synchronization task in a structured query language mode; analyzing the data synchronization task, and acquiring a source region and a target region for generating the data quality detection task according to an analysis result, may include: acquiring a grammar file, and generating a lexical analyzer and a grammar analyzer based on the grammar file; analyzing the structured query language in the data synchronization task based on a lexical analyzer and a syntactic analyzer to obtain an abstract syntax tree of the structured query language; and analyzing the abstract syntax tree to obtain tree nodes and expression nodes, and obtaining a source region and a target region for generating a data quality detection task according to the tree nodes and the expression nodes. The syntax file can be a file written by a user in combination with the data synchronization task, after the syntax file is obtained, a lexical parser and a syntax parser can be generated based on the syntax file, and then SQL in the data synchronization task is parsed based on the lexical parser and the syntax parser, so that an abstract syntax tree of the SQL, namely a tree representation of an abstract syntax structure of the SQL, is obtained. Each tree node on the abstract syntax tree can represent a structure appearing in SQL, and each expression node can represent an expression appearing in SQL, so that a source region and a target region for generating a data quality detection task can be obtained according to the tree nodes and the expression nodes, which is the key point for automatically generating the data quality detection task in the follow-up process.

In another optional technical solution, the data quality detection method may further include: responding to a task generation instruction aiming at a data synchronization task, screening out a data quality detection rule from all candidate quality detection rules in a data quality detection rule set, or taking a preset quality detection rule in the data quality detection rule set as the data quality detection rule; and acquiring a source region and a target region corresponding to the data synchronization task, and generating a data quality detection task corresponding to the data synchronization task based on the source region, the target region and a data quality detection rule. When a data quality detection task is generated, a preset quality detection rule in the data quality detection rule set can be used as a data quality detection rule, namely, fixed data quality detection rules are used as data quality detection rules which are finally applied, so that a large amount of configuration cost is saved, and the data quality detection task is generated at a lower configuration cost; or based on user selection, the data quality detection rule screened from the data quality detection rule set is used as the data quality detection rule which is finally applied, so that a data quality detection task which is more matched with the actual requirement is obtained.

Fig. 2 is a flow chart of another data quality detection method provided in the embodiment of the present invention. The present embodiment is optimized based on the above technical solutions. In this embodiment, optionally, the source storage engine and the target storage engine are heterogeneous, and the synchronizing of the pre-synchronization data stored on the source region in the source storage engine to the target region in the target storage engine by running the data synchronization task to obtain the synchronized data may include: extracting data before synchronization from a source region in a source storage engine by running a data synchronization task; obtaining an engine mapping rule, and determining a second data type matched with a first data type of data before synchronization in a target storage engine based on the engine mapping rule; mapping the data before synchronization to a second data type to obtain mapped data; and loading the mapped data to a target area in a target storage engine to obtain the synchronized data. The same or corresponding terms as those in the above embodiments are not explained in detail herein.

Referring to fig. 2, the method of the present embodiment may specifically include the following steps:

s210, responding to a data synchronization instruction, extracting data before synchronization from a source region in a source storage engine by running a data synchronization task and running the data synchronization task, wherein the source storage engine and a target storage engine to be synchronized are heterogeneous.

S220, obtaining an engine mapping rule, and determining a second data type matched with the first data type of the data before synchronization in the target storage engine based on the engine mapping rule.

Because different data storage engines may have differences in table and field settings, if data is synchronized between heterogeneous data storage engines, mapping of data types before and after synchronization is required to adapt synchronized data loaded into a target storage engine to the target storage engine.

The engine mapping rules may be preconfigured to indicate what data types under a data storage engine match under another data storage engine, for example, table 1 is some examples of mapping relationships between data types of MySQL- > PostgreSQL, and table 2 is some examples of mapping relationships between data types of MySQL- > CSS/ES. Thus, after obtaining the first data type of the pre-synchronization data, a second data type that the first data type matches in the target storage engine may be determined based on the engine mapping rules.

TABLE 1 mapping relationships between data types MySQL- > PostgreSQL

MySQL	PostgreSQL	Whether mapping is supported
			BIGINT	NUMERIC\|BIGINT	Support for
BINARY	BYTEA	Support for
			BIT	BIT	Support for
BLOB	BYTEA	Support for

TABLE 2 mapping relationships between data types MySQL- > CSS/ES

MySQL	CSS/ES	Whether mapping is supported
			TIME[(fsp)]	DATETIME	Support, format: hh mm ss
YEAR[(4)]	DATETIME	Support, format: yyyy
			char[(M)]	TEXT	Support for
VARchar(M)	TEXT	Support for

And S230, mapping the data before synchronization to a second data type to obtain mapped data, and loading the mapped data to a target area in a target storage engine to obtain the data after synchronization.

And mapping the data type of the data before synchronization, and mapping the data before synchronization to a second data type to obtain the mapped data. And then, loading the mapped data to the target area to obtain the synchronized data which can be directly applied by the target storage engine.

S240, acquiring a data quality detection task corresponding to the data synchronization task, wherein the data quality detection task is generated based on the acquired source region, target region and data quality detection rule.

And S250, performing quality detection on the data before synchronization stored on the source region and the data after synchronization stored on the target region based on a data quality detection rule by operating a data quality detection task to obtain data quality.

According to the technical scheme, the data before synchronization is extracted from the source region in the source storage engine by running the data synchronization task, then the second data type matched with the first data type of the data before synchronization in the target storage engine is determined based on the obtained engine mapping rule, the data before synchronization is mapped to the second data type, the data after mapping is obtained, and the data after mapping is loaded to the target region to obtain the data after synchronization, so that the effective data synchronization among heterogeneous data storage engines is ensured.

According to an optional technical scheme, the data quality detection task is generated based on a target mapping rule, and the target mapping rule is obtained by screening from engine mapping rules based on the obtained source region and the obtained target region; the quality detection of the data before synchronization stored on the source region and the data after synchronization stored on the target region is performed by running a data quality detection task based on a data quality detection rule, and the quality detection may include: extracting pre-synchronization data from the source region by operating a data quality detection task, and mapping the pre-synchronization data to a second data type based on a target mapping rule; and performing quality detection on the aspect of data types on the mapped data before synchronization and the data after synchronization extracted from the target area based on a data quality detection rule.

In order to ensure the accuracy of data quality detection, when generating a data quality detection task, the factor of a target mapping rule which is selected from engine mapping rules and is matched with the acquired source region and target region can be referred to in addition to the source region, target region and data quality detection rules. On this basis, when the data quality detection task is executed, the pre-synchronization data extracted from the source region can be mapped to the second data type based on the target mapping rule, and then the quality detection in the aspect of data type is performed on the mapped pre-synchronization data and the synchronized data extracted from the target region based on the data quality detection rule, for example, whether the data types of the mapped pre-synchronization data and the synchronized data are consistent or not is checked, so that the accuracy of the data quality detected by the method can be ensured.

Another optional technical solution, after determining that the first data type of the pre-synchronization data matches the second data type in the target storage engine, the data quality detection method may further include: responsive to the target region not being present within the target storage engine, opening up within the target storage engine a target region available for storing data having the second data type; loading the mapped data onto a target region in a target storage engine may include: and loading the mapped data to the opened target area in the target storage engine. When the target storage engine does not have a target area, or a data table and a field corresponding to the synchronized data are not created in the target storage engine, a target area which can be used for storing the data with the second data type can be found in the target storage engine, and then the mapped data are loaded to the found target area, so that the effective loading of the mapped data is ensured.

In yet another optional technical solution, the source region includes a storage region where data under a source field in the source storage engine is located, and the target region includes a storage region where data under a target field in the target storage engine is located; extracting pre-synchronization data from a source region in a source storage engine may include: extracting data under a source field in a source storage engine to obtain data before synchronization; the data quality control method may further include: metadata describing the source fields is determined from the metadata in the metadata set, and a first data type of the pre-synchronization data is determined based on the metadata describing the source fields. After the data before synchronization is extracted from the source field, the field type of the source field can be determined based on the metadata for describing the source field, so that the first data type of the data before synchronization is obtained, and therefore the effect of quickly and accurately acquiring the first data type is achieved.

On this basis, optionally, the metadata is acquired through the following steps: acquiring metadata acquisition configuration, wherein the metadata acquisition configuration comprises a data source type, a data source link, a database to be acquired and a data table; generating a metadata collection task based on the metadata collection configuration; and establishing a link with the database under the data source type based on the data source link by running a metadata acquisition task, and acquiring metadata from a data table in the database with the established link. In order to realize automatic collection of metadata, a metadata collection configuration can be obtained, which can indicate that metadata is collected from what data table under what database under what data source type, and besides, how to link to the database to be collected. On the basis, the metadata collection task is generated based on the metadata collection configuration, so that the metadata collection task can be operated, the database is linked based on the data source link, and then the metadata can be collected from the data table in the database with the link established, so that the effect of automatically collecting the metadata is realized.

Fig. 3 is a flow chart of another data quality detection method provided in an embodiment of the invention. The present embodiment is optimized based on the above technical solutions. In this embodiment, optionally, after responding to the data synchronization instruction, the data quality detection method may further include: acquiring a scheduling task, and sequentially scheduling a data synchronization task and a data quality detection task by running the scheduling task; synchronizing tasks by running data, including: synchronizing tasks by running the scheduled data; by running data quality detection tasks, including: the task is detected by running the scheduled data quality. The same or corresponding terms as those in the above embodiments are not explained in detail herein.

Referring to fig. 3, the method of this embodiment may specifically include the following steps:

s310, responding to the data synchronization instruction, acquiring a scheduling task, and sequentially scheduling the data synchronization task and the data quality detection task by running the scheduling task.

In order to avoid the data quality detection process from affecting the data synchronization process, or the added/deleted data quality detection process from affecting the original data quality detection process, scheduling tasks may be registered in advance, and then the data synchronization task and at least one data quality detection task are sequentially scheduled based on the scheduling tasks. The scheduling is added in a task mode, the original service data stream cannot be damaged, so that data quality inspection can be performed along with a data synchronization task in a non-invasive manner, and convenience of a data quality detection mode in changing is guaranteed.

And S320, synchronizing the data before synchronization stored on the source region in the source storage engine to the target region in the target storage engine by running the scheduled data synchronization task to obtain the synchronized data.

S330, acquiring a data quality detection task corresponding to the data synchronization task, wherein the data quality detection task is generated based on the acquired source region, target region and data quality detection rule.

S340, performing quality detection on the data before synchronization stored on the source region and the data after synchronization stored on the target region based on a data quality detection rule by operating the scheduled data quality detection task to obtain data quality.

According to the technical scheme of the embodiment of the invention, the data synchronization task and the data quality detection task are sequentially scheduled by running the scheduling task, so that the data quality inspection can be carried out along with the data synchronization task based on a non-intrusive mode, and the convenience of the data quality detection mode in the changing process is ensured.

An optional technical solution, which sequentially schedules a data synchronization task and a data quality detection task, may include: acquiring a synchronous increment range, and sequentially scheduling a data synchronization task and a data quality detection task based on the synchronous increment range; synchronizing pre-synchronization data stored on a source region within a source storage engine to a target region within a target storage engine by executing a scheduled data synchronization task may include: synchronizing pre-synchronization data which is stored on a source region in a source storage engine and corresponds to a synchronization increment range to a target region in a target storage engine by running a scheduled data synchronization task; performing quality detection on the pre-synchronization data stored on the source region and the post-synchronization data stored on the target region based on a data quality detection rule by running the scheduled data quality detection task may include: and performing quality detection on the data before synchronization corresponding to the synchronous increment range stored on the source region and the data after synchronization corresponding to the synchronous increment range stored on the target region by operating the scheduled data quality detection task based on a data quality detection rule.

The data synchronization process may be a full-volume synchronization process or an incremental synchronization process. In the case of incremental synchronization, a synchronization incremental range may be obtained first, where the synchronization incremental range may represent an incremental range of the current data synchronization, and in practical application, the synchronization incremental range may be selectable, and may be a time range or an ID start-stop range; then, when the data synchronization task is scheduled based on the scheduling task, the synchronization increment range can be used as a variable to be transmitted to the data synchronization task, so that when the scheduled data synchronization task is operated, data which is stored on a source region in the source storage engine and corresponds to the synchronization increment range can be used as data before synchronization; further, when the data quality detection task is scheduled based on the scheduling task, the synchronization increment range may be again transferred to the data quality detection task as a variable so that, when the scheduled data quality detection task is executed, the quality detection may be performed on the pre-synchronization data corresponding to the synchronization increment range stored on the source region and the post-synchronization data corresponding to the synchronization increment range stored on the target region based on the data quality detection rule. By the technical scheme, the accuracy of data synchronization and the accuracy of data quality detection are ensured.

In order to better understand the above technical solutions as a whole, the following description is given by way of example with reference to specific examples. The operation of the data quality detection system shown in fig. 4b will be described with reference to the flowchart of fig. 4 a. In particular, the method comprises the following steps of,

the method comprises the following steps: data source configuration

The main content of this step is to specify the main body of the data source and the target main body of the data after synchronization, which is conventionally a database engine. Although most database (bin) library engines support data reading and writing (such as mysql and hive) in an SQL manner, the data reading and writing manners such as redis and ElasticSearch are quite different, and therefore, in order to have better universality, corresponding ETL processing can be performed in a data synchronization process.

The registration of the data source is more based on the link establishment with the database and the authentication so as to refer to the source and destination libraries of the data in the subsequent running data synchronization task. As to whether the library is a source or a target, this is determined according to the data flow direction configured in the data synchronization task.

The main contents of the data source configuration include:

1) The database types are the types of the databases, such as Hive, mysql, postgresql, redis, HBase or Impala, and the like, and the function of distinguishing different database types is that different database links correspond to different data engine drives, and different data flows configured according to an ETL task (namely a data synchronization task) are distinguished as isomorphic data flows or heterogeneous data flows, and for the latter, conversion compatibility processing is required to be carried out on unsupported field types and the like among the databases depending on an engine mapping rule;

2) Data link address and port number: the link address may be an IP or domain name, machine name, primarily for addressing. Generally, when the address and the port are linked, parameters such as time zone setting, coding rules and the like can be additionally added;

3) And (3) authentication mode: no authentication, CA certificate, account name password and the like;

4) Authentication element: according to the authentication mode selected in 3), providing specific authentication elements, such as collecting user name and password.

Step two: metadata collection

Metadata is data used to describe data, and is most beneficial in that it allows the description and classification of information to be formatted to provide a basis for data quality detection and field rule mapping in response to subsequent steps. Metadata for a database may generally include content metadata and operation metadata, where the content metadata may include, for example, databases (names, types), tables/partitions/buckets (names, types, remarks, etc.), columns (names, types, lengths, precisions, etc.), indexes (names, types, fields, etc.), constraints (types, fields, default values, required entries, enumerated types, etc.), and so on, and the operation metadata may include data generation (generation time, job information, etc.), table access (queries, associations, aggregations, etc.), and field access (queries, associations, aggregations, filters, etc.), and so on.

On this basis, a flowchart of a metadata collection task (hereinafter, may be simply referred to as a collection task) may be as shown in fig. 5, specifically:

(1) the metadata collection definition mainly completes the configuration of metadata collection, and comprises an automatic collection definition and a manual collection definition, wherein the automatic collection definition can complete the automatic collection of the metadata by configuring different data source definitions (such as data source types, data source links, a database to be collected and a data table) and combining with a subsequent collection task operation mechanism. The manual collection definition can be achieved by manually submitting metadata meeting a metadata standard format and storing the metadata into a metadata database, which mainly aims at content metadata which is inconvenient to obtain or operation metadata. The manual collection definition content may include metadata name, description, item, data source type, data source link, index/Topic, data format, and field configuration, among others. The flow chart of manual metadata collection can be seen in fig. 6, which includes manual page addition and maintenance by means of batch excel upload. Specifically, the user clicks on the manual collection page [ new metadata ]. Then, filling project contents in a newly-built manual metadata page, uploading field information in batch, wherein batch import can be realized by uploading an excel file, the excel contents are checked in a background, and non-batch import can also be realized by increasing metadata fields one by one; and then, the integrity of the metadata is checked in the foreground, after the check is passed, the metadata details can be checked in a manual metadata detail page, and certainly, the content of the filling item can be returned under the condition that the check is not passed. It should be noted that, when the metadata is manually collected, the collection task scheduling is not needed.

(2) And (3) scheduling collection tasks: the collection task scheduling supports the automated collection of metadata, and metadata can be collected in a timed manner or through notification after metadata change.

(3) Collecting task examples: this is an example when automatically collecting metadata, which may reveal the contents of collecting task lists, sorting ways, task running states, and task logs, etc.

(4) And (3) metadata query: the query for the operation metadata and the content metadata is provided by a metadata query service, and of course, other management metadata and the like are also provided, which are not specifically limited herein. In practical applications, the metadata may represent different data source types, library lists, table information, and the like.

Step three: engine mapping rule storage

There may be differences in the settings of tables and fields between different data storage engines, and if data is synchronized between heterogeneous data storage engines, it is necessary to map the types of data before and after synchronization, so that the synchronized data loaded into the target storage engine is more adaptive to the target storage engine. The specific engine mapping rules have been illustrated above, and are not described herein again. It should be noted that the meaning of the engine mapping rule setting is as follows:

1) And (3) adopting corresponding quality assurance function type switching for the data type, such as BIT [ (M) ] type in mysql, which corresponds to BOOLEAN | LONG | TEXT, and reporting errors if sum is adopted when the finite set technology is applied.

2) The quality assurance of data in a limited range is carried out, for example, there are enum types in mysql, which is a collection of a limited number of data types, specifically, for example, gender, male and female, and unknown, and there is VARCHAR type in postgresql, which corresponds to VARCHAR type, so that the quality cannot be guaranteed at the database level, because VARCHAR can write contents relatively freely, so that the original table type can be referred to in the quality assurance.

This step is an optional step, and is usually performed when a new database type is added. The engine mapping rules may be stored in a rule configuration library for subsequent real-time synchronization and subsequent application upon change.

Step four: ETL configuration

The ETL task is a main operation for performing data synchronization, and the main task is to obtain data from the source database a, perform ETL processing, and then write the data into the target database B. Wherein:

1) And maintaining basic information of the ETL configuration, such as task names, task descriptions, data labels and the like.

2) Configuring a data source (i.e. a source storage engine) and a data target (i.e. a target storage engine), wherein taking the data source as an example, a data source type, a data source name and a database name can be defined, and on the basis of the data source, a read table name can be defined. In practical applications, optionally, a corresponding prefix, for example, a table name dim _ caipin, may be added to the synchronized table in the data target, and then a prefix t _ #, where # represents a previous table name, may be added in a unified manner, so that the target table name may be t _ dim _ caipin. If no corresponding table exists in the target storage engine, the creation of the table may be performed according to a default syntax for the first time.

3) Defining a field mapping rule so as to complete the mapping from the source field in the source storage engine to the target field in the target storage engine. Taking the mapping relationship of the structured data as an example, the source field is preceded and the target field is followed, and there are two defining ways:

the method I comprises the following steps: explicitly specifying the mapping relationship between the source field and the target field in a graphical manner, as shown in fig. 7a, which respectively shows the name, type and description of the source field and the target field;

the second method comprises the following steps: the method is defined by SQL or other API modes, is based on SQL language, HBase Get and other modes during query, and uses insert grammar or HBase Put grammar during insertion, so that the flexibility is stronger, and the professional degree is higher. In short, data reading and writing are performed according to the sql or non-sql mode provided by the data storage engine. Illustratively, see FIG. 7b, which gives a sample of field mappings defined by SQL, the SQL input here is 1select 1as,2as b, 'sd' as c.

4) Configuring data conversion rules, and optional steps, such as:

string interception rule: converting data into character string before synchronization, and intercepting partial content

String completion rules: the data before synchronization is converted into a character string, and the character string can be filled to a specified length in front of, behind, or in the middle of the character string.

String replacement rules: and replacing the hit character strings through a regular matching rule.

Null operation rule: other specified content may be filled for null values.

Step five: scheduling task registration

Scheduling rule setting is performed for configured ETL tasks, for example, referring to fig. 8, the scheduling task may be full synchronization, that is, emptying the target database and synchronizing new data source data each time; the method may also be incremental synchronization, where an increment field needs to be determined, and each scheduling task may update the target data line according to the increment field, and the pull start date and the increment field illustrated in fig. 8 may be a specific example of a synchronization increment range. In practical applications, optionally, for the case of incremental synchronization, the field generally supports centralized types, specifically, the primary key ID: database self-increment ID and the like are used as main keys and are numbers, the ID synchronized last time is recorded each time, and the ID +1 is continuously iterated in the next task until the last record; and the delta field supports the date, datetime, and timemap types: and synchronizing according to the time, recording the time appointed last time, and recording the time again after the task is executed.

In addition, the task execution configuration can be realized in one of the following two ways:

1) Timed tasks

This approach is applicable to situations where no dependency is imposed on upstream data, such as independent events between the various tables similar to the collection of pel data, which have few dependencies.

2) Dependency triggering

Besides the acquisition tasks executed at regular time, most of the data synchronization tasks have dependency relationships, and downstream needs to initiate synchronization content immediately after the synchronization of the dependent source data, so that the data synchronization method has the capability of quick response and timely synchronization. In practical applications, the upstream and downstream dependent triggers may be defined by way of a workflow, and the triggering of downstream ETL tasks may be accomplished by configuration of an upstream workflow or the like.

Step six: ETL task resolution

And the ETL task analysis is to automatically discover the configuration of data synchronization, ETL data processing and circulation aiming at the ETL configuration in the step four, so that the registration of a data quality detection task is facilitated.

The mapping relation between the source field and the target field is definitely specified in a graphical mode stated in the first step, the content comparison of the mapping relation is determined, and the mapping relation from the source data to the database, the table and the field of the target end data and the ETL content have direct mapping relation and can be directly obtained.

However, in the method two in the step four, the source is defined through SQL or other API methods and written into the target library, and the comparison relationship between the two parts needs to perform the analysis of the ETL configuration completion task. Specifically, taking the SQL mode as an example, different data storage engines have differences with respect to rules of SQL parsing, and in practical applications, the rules may be parsed by a solution commonly used for open source projects, for example, antlr, or may be implemented by an own parser of an official party (for example, drivers used by mysql, oracle, and impala engines themselves, such as parsing, checking, and optimizing).

The general rule Antlr is taken as an example to illustrate here, the SQL parsing process is shown in fig. 9, antlr4 is an open source item implemented by Java, a user needs to write a syntax file suffixed by g4, and then Antlr4 can automatically generate a lexical parser and a syntax parser, and interfaces provided for the user are parsed abstract syntax trees and easily accessible Listener and Visitor base classes. Based on the existing grammar file and the open source library, a user only needs to transmit SQL to obtain the abstract grammar tree of the SQL. For example, the input SQL is: the abstract syntax tree obtained by selecting c1, c2 FROM t 1as t2where c3> c4 through Antlr analysis is shown in fig. 10.

In the data quality analysis process, several aspects of the abstract syntax tree can be focused:

1) Data results: paying attention to the contents in the select elements, and for jar driving, returning data results and packaging the data results into an SQLSstatement target object; this part, c1 and c2 in fig. 10, requires inserting the results of the query into the target engine. The data quality analysis mainly compares whether the source data and the target database data are accurate, the query interval ranges (such as the primary key id, the service primary key or the update time) are consistent, and the field values and types of the data query are consistent with the fields in storage (corresponding relation).

2) SQL incremental filtering condition: if the total amount is the total amount, no filtering condition is needed; if the change interval is increment, the change interval needs to be specified in SQL, and id can be defined as an example: SELECT c1, c2 FROM t 1as t2where c3> c4 and c1> $ ID, $ ID $ is ID auto-increment placeholder, if there is such auto-increment result that can be set, other UPDATE timestamp ranges can be defined, such as $ UPDATE _ TIME $, and then the operation can be performed by replacing the timestamp range with a specific value when the decimal task is running.

Through the analysis of the abstract syntax tree, the result and the changed increment field can be clarified, and the interface contrast mapping similar to the first mode in the fourth step can be realized according to the rule setting of the storage.

It should be noted that other API data sources may also be defined by a request of the RestFul API according to a certain rule, and perform JSON Schema analysis on the data format requested by the interface, and then obtain a corresponding data source definition, which is not described in detail herein.

Step seven: field mapping rule storage

Whether the direct mapping relation is performed in the first step or the result is obtained through the data source definition of the SQL or API, the analysis is performed to associate the source field and the target field, and the ETL processing is configured, and the obtained result falls into the rule configuration library for storage.

Step eight: data quality detection task registration

Based on the configuration of the above steps, the setting of the acquired data at the synchronization, ETL and incremental configuration level enters the following step of setting the data quality detection task, as shown in fig. 11, specifically:

1) Reading the increment/full field parameters, and reading the increment field parameters stored in the database in the increment field in the step four;

2) Reading the field mapping rule, and receiving the field mapping rule stored in the rule configuration library in the step seven;

3) Reading ETL configuration to obtain data conversion, filling and default value setting;

4) Reading the engine mapping rules, which depend on the contents of the step three parts.

The above steps are all to prepare for the read statement of data quality detection and the read statement written into the target library, and finally compare the results of the two parts to see whether the results are equal, if so, the quality check is passed, otherwise, the check is not passed.

To illustrate the sequential plot development of the four steps read above, we simulate two data synchronization scenarios for presentation.

Scene one: the field mapping rules are obtained by way of interface configuration, and an ETL task is not needed to analyze data obtained from the database DB _ A of the data source mysql, the table T1,

the sync fields ID (big, key auto-increment), C1 (varchar), SEX (ENUM), C3 (double),

the increment field rule is: the number of ID is increased by oneself,

the ETL rule is: if the SEX field has no value, setting the SEX field as a default value: 0 (unknown).

Then, the type of the target database is database DB _ B of Postgresql, the insertion table is ods _ T1, and the fields are: ID (big, no primary key information), C1 (varchar), SEX (varchar), C3 (FLOAT 8), wherein the incorrectly labeled table name is ods _ T _ prefix added by prefix rule of unified table name, and the underline marked part in the type of the field is the corresponding change by the engine mapping rule in 4).

The method is converted into an SQL example by obtaining the configuration of a source data interface: select ID, C1, SEX, C3from DB _ A.T1 where ID, between $ ID _ START $ and $ ID _ END $, then the SQL statement that gets the target insert data is: select ID, C1, SEX, C3from DB _ B. Ods _ T _ T1 where ID beta $ ID _ START $ and $ ID _ END $.

Scene two: for the acquisition by means of SQL or API, here, a scenario of simulating an aggregate query and a multi-table cascading query is taken as an example:

data are retrieved from the database DB _ a, table T1/T2 of the data source mysql,

Select t1.t_day,t2.c_shop_name,sum(t1.price)as total_price,count(t2.online)as online_count from DB_A.T1 as t1,DBA.T2 as t2 on t1.shop_id＝t2.shop_id where t1.order_time between$DATE_START$and$DATE_END$

then, the type of the target library is database DB _ B of Postgresql, and the insertion table is ods _ T1, then the data of the target table is obtained as follows:

Select t_day,shop_name,total_price,online_count from DB_B.ods_T_T1where update_time between$DATE_START$and$DATE_END$。

further, in order to complete the automatic data quality detection, a corresponding data quality detection rule base needs to be available for validation, and corresponding task discovery needs to be performed. An example of some of the data quality detection rules is given in fig. 11, which can be read from the rule base and thus can be quickly validated when new data quality detection rules are generated.

Continuing with the above two scenarios as examples, the data quality detection rule is exemplarily illustrated. Specifically, for interval counting, it can be used to count the line records of the source and the target, check whether the records on both sides are consistent, or else, there is a case of data duplication or data loss, and a general rule is to obtain corresponding data in a case and compare the data after making part of the ETL task with the target data line. It should be noted here that even a consistent number does not represent that data is not lost, multiple co-validation checks are required. Interval counting can be applied to various data synchronization tasks.

For the first scene, the detection rules of the target and the source are respectively as follows:

packaging external statements for original query SQL

Select count(1)num from(

Select ID,C1,SEX,C3 from DB_A.T1 where ID between$ID_START$and$ID_END$

)t

Select count(1)num from(

Select ID,C1,SEX,C3 from DB_B.ods_T_T1 where ID between$ID_START$and$ID_END$

)t

And finally comparing whether num is consistent.

For data column summation, sum operations can be performed for sources that are numeric or boolean types, and this rule can be used for ID, C3 as described above for case scenario one. The total _ print and online _ count fields of scene two may also be used.

For scene two: the detection rules of the target and the source are respectively

Packaging external statements for original query SQL

Select sum(total_price)num from(

)t

Select sum(total_price)num from(

Select t_day,shop_name,total_price,online_count from DB_B.ods_T_T1where update_time between$DATE_START$and$DATE_END$

)t

And finally comparing whether num is consistent.

For a sampling row, all columns of partial row data are selected from a plurality of row data and are compared with a target database to check whether the all columns are completely kept consistent, the character serial record is effective, but different database engine rules are involved, the types of the data are defined differently, and operations such as conversion, coding and decoding are required to be carried out to keep consistent. For example, scene one: extracting ID in (20001/20002/20003), extracting scene two: the calculation logic of sampling mostly acquires partial line data from a source, and then checks whether other columns are equal to each other such as scene one through an autonomous key equality, or checks whether other fields are equal to each other through a joint unique main key, for example, the shot _ name and the t _ day of scene two are two joint unique main keys, the calculation logic is obtained through an ETL (extract transform and load) analysis abstract syntax tree, and all the columns can be used as a where condition to check whether corresponding result records are inquired at a target. The reason why the self-increment time interval such as update _ time is not used as the comparison condition is that the time is not unique, and different orders can land in the same millisecond.

For a scenario example:

Select*from(

Select ID,C1,SEX,C3 from DB_A.T1 where ID between$ID_START$and$ID_END$

)t

Where ID＝random($ID_START$,$ID_END$)

it should be noted that random ($ ID _ START $, $ ID _ END $) can randomly select a specific ID within an interval range when the task at the service side runs; the SQL query efficiency is not lowered due to external wrapping, and the predicate is pushed down when the SQL is subjected to optimization after analysis;

Select*from(

Select ID,C1,SEX,C3 from DB_B.ods_T_T1 where ID between$ID_START$and$ID_END$

)t

Where ID＝random($ID_START$,$ID_END$)

the above results are compared field by field to see if they are consistent.

It is noted that the ID is consistent with the source, for example, if the ID is 20001, the latter is also the value of the target library.

For the enumeration interval, like scenario one, sex is an enum type in mysql and the gender field is enum-able, but is a varchar type in postgresql, so it is also necessary to see if the enumerated type in mysql is also satisfied in the target engine. If it is found that the number of both parties is equal, it is checked as count (discontinuity enum), or it is checked whether the sex not in ('0', '1', '2') in the target has a record, if so, it is also abnormal. Of course, there are many other data quality detection rules, which are not listed here.

The step is to automatically generate a data quality detection task, and then compare the results in the actual operation process, so that the results meet the expected pass and are not responsible for the expected alarm.

In addition, in the task scheduling registration process, the automation of data quality detection needs to be automatically injected after the data synchronization task, and the main reasons are as follows: 1. timeliness: the detection of data quality follows the ETL task, so the data quality needs to be ordered before and after the scheduling; and for timeliness, data quality detection is required after the ETL task is completed. 2. Context sensing: the synchronization increment range of this time is determined when the scheduling is needed and the task is synchronized, and the synchronization increment range can be transmitted as a variable when the scheduling task calls the data quality detection task, as described above. In practical applications, the content of the dependency step five is registered, and then a monitored dependency can be added, and the task is started by scheduling if the task is completed. This part of the content can be done by a conventional scheduling system.

Step nine: data quality testing task manual configuration (optional step)

In addition to automatically generating data quality detection tasks, the auto-detection terms may also be manually reduced because the more data quality detection tasks that are run, the greater the pressure ratio to the data storage engine. In addition, additional tasks may also be specified for execution by manually supplementing new data quality detection tasks.

And as with the data quality detection rule output in the step eight, the corresponding detection source and target SQL can be directly written in a manual mode, and then the scheduling task is registered for execution.

Step ten: data quality detection task execution

And step five, according to the registration of the scheduling task, gradually scheduling and running the data synchronization task and the data quality detection task. In practical applications, optionally, the scheduling system may rely on linux condontab or windows to time tasks, or may perform specific operations on scheduling tasks through open-source Azkaban, hera, easy Scheduler, or the like, and the like, which is not specifically limited herein.

The data quality detection task can be executed in a timing mode or triggered to be executed, and can have the functions of executing states, logging, checking results and the like, so that the scheduling task can be conveniently tracked, troubleshoot problems or detect running.

Step eleven: data quality alarm

The data quality alarm is a result judgment of checking the data quality in the task execution process or when the execution is completed, and can be classified into alarms determined by success and failure of checking and the like, or can be in a mode of alarming in an interval range, alarming beyond the interval range and the like. In practical application, optionally, the level of the alarm may be classified into error, war or success, and the alarm content may be subjected to persistent library writing processing, so as to facilitate statistics, analysis and display of subsequent steps. Optionally, the alarm result may be bound according to the executed data quality detection task and the target task of data verification, so as to write the executed time, the feedback result details, the result comparison, and other contents, and meanwhile, the notification may be performed by an alarm service, such as a short message, an email, or an in-station message, so that the relevant person may obtain the alarm notification in real time.

Step twelve: data quality visualization/data quality reporting

And C, aiming at the alarm content of the data quality fed back in the step eleven, visualization can be carried out through a large screen, visualization large screen setting can be carried out through a plurality of layers such as tasks, task quantity, data themes, time dimensions, service dimensions and the like, and the template content can be filled according to the template of the data quality by means of the output result of the step eleven, so that a matched data quality report is generated.

To this end, the data quality detection system automatically generates a data quality detection task by automatically discovering a task relationship between source data and target data in the data synchronization task, and supports a non-invasive manner to perform data quality detection following the data synchronization task; besides automatic quality comparison, the method also supports additional manual confirmation to quickly define quality detection. In particular, the method comprises the following steps of,

from the cost level, the relation between data can be mined from the data synchronization task, and certain fixed quality detection items are used as automatic task items, so that a large amount of configuration cost is saved.

From the aspect of data quality, content detection can be performed on data consistency, data integrity and accuracy, data timeliness can also be detected, and meanwhile, configuration of a service characteristic threshold is supported, so that the method is relatively complete.

On the aspect of flexibility, synchronous detection of data of a heterogeneous system and a homogeneous system can be achieved, automatic and manual confirmation configuration can also be achieved, and universality is wider.

Fig. 12 is a block diagram of a data quality detection apparatus according to an embodiment of the present invention, which is configured to execute the data quality detection method according to any of the embodiments. The apparatus and the data quality detection method of each embodiment belong to the same inventive concept, and details that are not described in detail in the embodiments of the data quality detection apparatus may refer to the embodiments of the data quality detection method. Referring to fig. 12, the apparatus may specifically include: a synchronized data obtaining module 410, a data quality detection task obtaining module 420, and a data quality obtaining module 430. Wherein,

a synchronized data obtaining module 410, configured to synchronize, in response to a data synchronization instruction, pre-synchronization data stored in a source region in the source storage engine to a target region in the target storage engine by running a data synchronization task, so as to obtain synchronized data;

a data quality detection task obtaining module 420, configured to obtain a data quality detection task corresponding to the data synchronization task, where the data quality detection task is generated based on the obtained source region, target region, and data quality detection rule;

the data quality obtaining module 430 is configured to perform quality detection on the pre-synchronization data stored in the source region and the post-synchronization data stored in the target region based on a data quality detection rule by running a data quality detection task, so as to obtain data quality.

Optionally, the source storage engine and the target storage engine are heterogeneous, and the synchronized data obtaining module 410 may include:

the first extraction unit of the data before synchronization is used for extracting the data before synchronization from a source region in a source storage engine by running a data synchronization task;

the second data type determining unit is used for acquiring the engine mapping rule and determining a second data type matched with the first data type of the data before synchronization in the target storage engine based on the engine mapping rule;

a mapped data obtaining unit, configured to map the pre-synchronization data to a second data type to obtain mapped data;

and the synchronized data obtaining unit is used for loading the mapped data to a target area in the target storage engine to obtain the synchronized data.

On the basis, optionally, the data quality detection task can be generated based on a target mapping rule, and the target mapping rule is obtained by screening from engine mapping rules based on the obtained source region and target region;

the data quality obtaining module 430 may include:

the data mapping unit before synchronization is used for extracting data before synchronization from the source region by running a data quality detection task and mapping the data before synchronization to a second data type based on a target mapping rule;

and the data type quality detection unit is used for carrying out quality detection on the aspect of data types on the mapped data before synchronization and the data after synchronization extracted from the target area based on a data quality detection rule.

Optionally, the data quality detection apparatus may further include:

a target region tunneling module for opening up a target region in the target storage engine that is available for storing data having a second data type in response to the target region not being present in the target storage engine after determining that the first data type of the pre-synchronization data matches the second data type in the target storage engine;

the synchronized data obtaining unit may include:

and the mapped data loading subunit is used for loading the mapped data to the opened target area in the target storage engine.

Optionally, the source region includes a storage region in which data in the source field in the source storage engine is located, and the target region includes a storage region in which data in the target field in the target storage engine is located;

the pre-synchronization data first extraction unit may include:

the pre-synchronization data obtaining subunit is used for extracting data under a source field in the source storage engine to obtain pre-synchronization data;

the synchronized data obtaining module 410 may further include:

and the first data type determining unit is used for determining metadata for describing the source field from the metadata in the metadata set and determining the first data type of the data before synchronization based on the metadata for describing the source field.

On this basis, optionally, the metadata is acquired through the following modules:

the metadata acquisition configuration acquisition module is used for acquiring metadata acquisition configuration, wherein the metadata acquisition configuration comprises a data source type, a data source link, a database to be acquired and a data table;

the metadata acquisition task generating module is used for generating a metadata acquisition task based on metadata acquisition configuration;

and the metadata acquisition module is used for establishing a link with the database under the data source type based on the data source link by operating a metadata acquisition task and acquiring metadata from a data table in the linked database.

Optionally, the data synchronization task stores a data conversion rule, and the synchronized data obtaining module 410 may include:

the pre-synchronization data second extraction unit is used for extracting pre-synchronization data from a source region in the source storage engine by running a data synchronization task;

a converted data obtaining unit, configured to convert the pre-synchronization data based on a data conversion rule to obtain converted data;

and the converted data loading unit is used for loading the converted data to a target area in the target storage engine.

On this basis, optionally, the data quality detection task is generated based on a data conversion rule obtained by analyzing the data synchronization task;

the data quality obtaining module 430 may include:

the pre-synchronization data conversion unit is used for converting pre-synchronization data extracted from the source region based on a data conversion rule by running a data quality detection task;

and the data content quality detection unit is used for carrying out quality detection on the aspect of data content on the converted data before synchronization and the data after synchronization extracted from the target area based on a data quality detection rule.

Optionally, the data quality detection apparatus may further include:

the task scheduling module is used for acquiring a scheduling task after responding to the data synchronization instruction, and sequentially scheduling the data synchronization task and the data quality detection task by operating the scheduling task;

the synchronized data obtaining module 410 may include:

a data synchronization task execution unit for synchronizing tasks by executing the scheduled data;

the data quality obtaining module 430 may include:

and the data quality detection task running unit is used for running the scheduled data quality detection task.

On this basis, optionally, the task scheduling module may include:

the task scheduling unit is used for acquiring a synchronous increment range and sequentially scheduling the data synchronous task and the data quality detection task based on the synchronous increment range;

the synchronized data obtaining module 410 may further include:

the data synchronization unit before synchronization is used for synchronizing the data before synchronization which is stored on a source region in the source storage engine and corresponds to the synchronization increment range to a target region in the target storage engine;

the data quality obtaining module 430 may further include:

and the data quality detection unit is used for performing quality detection on the data before synchronization corresponding to the synchronous increment range stored on the source region and the data after synchronization corresponding to the synchronous increment range stored on the target region based on a data quality detection rule.

Optionally, the source region and the target region for generating the data quality detection task may be obtained through the following modules:

and the region acquisition module is used for analyzing the data synchronization task and acquiring a source region and a target region for generating a data quality detection task according to an analysis result.

On the basis, optionally, the source region and the target region are written into the data synchronization task in a structured query language mode; the region acquisition module may be specifically configured to:

acquiring a grammar file, and generating a lexical analyzer and a grammar analyzer based on the grammar file;

analyzing the structured query language in the data synchronization task based on a lexical analyzer and a syntactic analyzer to obtain an abstract syntax tree of the structured query language;

and analyzing the abstract syntax tree to obtain tree nodes and expression nodes, and acquiring a source region and a target region for generating a data quality detection task according to the tree nodes and the expression nodes.

Optionally, the data quality detection apparatus may further include:

the data quality detection rule determining module is used for responding to a task generation instruction aiming at a data synchronization task, and screening each candidate quality detection rule in the data quality detection rule set to obtain a data quality detection rule, or taking a preset quality detection rule in the data quality detection rule set as the data quality detection rule;

and the data quality detection task generation module is used for acquiring a source region and a target region corresponding to the data synchronization task and generating a data quality detection task corresponding to the data synchronization task based on the source region, the target region and a data quality detection rule.

According to the data quality detection device provided by the embodiment of the invention, the synchronized data obtaining module responds to the data synchronization instruction, and the data before synchronization stored on the source region in the source storage engine is synchronized to the target region in the target storage engine by running the data synchronization task to obtain the synchronized data; acquiring a data quality detection task corresponding to the data synchronization task through a data quality detection task acquisition module, wherein the data quality detection task is generated based on the acquired source region, target region and data quality detection rule; and running a data quality detection task through a data quality obtaining module, and performing quality detection on the data before synchronization stored on the source region and the data after synchronization stored on the target region based on a data quality detection rule to obtain the data quality. The device automatically generates the data quality detection task based on the source region and the target region with the mapping relation in the data synchronization task and the data quality detection rule, so that the quality detection of the data before synchronization and the data after synchronization related to the data synchronization task is carried out based on the data quality detection task, and the effect of automatic detection of the data quality is achieved.

The data quality detection device provided by the embodiment of the invention can execute the data quality detection method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.

It should be noted that, in the embodiment of the data quality detection apparatus, the included units and modules are only divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.

FIG. 13 illustrates a schematic structural diagram of an electronic device 10 that may be used to implement an embodiment of the present invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

As shown in fig. 13, the electronic device 10 includes at least one processor 11, and a memory communicatively connected to the at least one processor 11, such as a Read Only Memory (ROM) 12, a Random Access Memory (RAM) 13, and the like, wherein the memory stores a computer program executable by the at least one processor, and the processor 11 may perform various appropriate actions and processes according to the computer program stored in the Read Only Memory (ROM) 12 or the computer program loaded from the storage unit 18 into the Random Access Memory (RAM) 13. In the RAM 13, various programs and data necessary for the operation of the electronic apparatus 10 can also be stored. The processor 11, the ROM 12, and the RAM 13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to the bus 14.

A number of components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, or the like; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The processor 11 performs the various methods and processes described above, such as the data quality detection method.

In some embodiments, the data quality detection method may be implemented as a computer program tangibly embodied in a computer-readable storage medium, such as storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19. When the computer program is loaded into RAM 13 and executed by processor 11, one or more steps of the data quality detection method described above may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform the data quality detection method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

A computer program for implementing the methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be performed. A computer program can execute entirely on a machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. A computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the Internet.

The computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service are overcome.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present invention may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired result of the technical solution of the present invention can be achieved.

The above-described embodiments should not be construed as limiting the scope of the invention. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A data quality detection method, comprising:

acquiring a data quality detection task corresponding to the data synchronization task, wherein the data quality detection task is generated based on the acquired source region, the acquired target region and a data quality detection rule;

and performing quality detection on the data before synchronization stored on the source region and the data after synchronization stored on the target region based on the data quality detection rule by operating the data quality detection task to obtain the data quality.

2. The method of claim 1, wherein the source storage engine and the target storage engine are heterogeneous, and the synchronizing pre-synchronization data stored on a source region in the source storage engine to a target region in the target storage engine by running a data synchronization task to obtain synchronized data comprises:

extracting data before synchronization from a source region in a source storage engine by running a data synchronization task;

obtaining an engine mapping rule, and determining a second data type matched with the first data type of the data before synchronization in a target storage engine based on the engine mapping rule;

mapping the data before synchronization to the second data type to obtain mapped data;

and loading the mapped data to a target area in the target storage engine to obtain synchronized data.

3. The method according to claim 2, wherein the data quality detection task is further generated based on a target mapping rule, and the target mapping rule is obtained by screening from the engine mapping rule based on the obtained source area and the target area;

the performing quality detection on the pre-synchronization data stored on the source region and the post-synchronization data stored on the target region based on the data quality detection rule by running the data quality detection task includes:

extracting the data before synchronization from the source region by operating the data quality detection task, and mapping the data before synchronization to the second data type based on the target mapping rule;

and performing data type quality detection on the mapped data before synchronization and the data after synchronization extracted from the target area based on the data quality detection rule.

4. The method of claim 2, further comprising, after the determining that the first data type of the pre-synchronization data matches the second data type in the target storage engine:

responsive to the target region not being present within the target storage engine, then opening up within the target storage engine the target region available for storing data having the second data type;

the loading the mapped data onto a target region in the target storage engine includes:

and loading the mapped data to the target area opened up in the target storage engine.

5. The method of claim 2, wherein the source region comprises a storage region under a source field within the source storage engine and the target region comprises a storage region under a target field within the target storage engine;

the extracting of the pre-synchronization data from the source region in the source storage engine includes:

extracting data under the source field in a source storage engine to obtain data before synchronization;

the method further comprises the following steps:

metadata describing the source field is determined from the metadata in the metadata collection, and a first data type of the pre-synchronization data is determined based on the metadata describing the source field.

6. The method of claim 5, wherein the metadata is collected by:

acquiring metadata acquisition configuration, wherein the metadata acquisition configuration comprises a data source type, a data source link, a database to be acquired and a data table;

generating a metadata collection task based on the metadata collection configuration;

and establishing a link with the database under the data source type based on the data source link by operating the metadata acquisition task, and acquiring the metadata from the data table in the database with the established link.

7. The method of claim 1, wherein the data synchronization task has data transformation rules stored therein, and wherein the synchronizing pre-synchronization data stored on a source region in a source storage engine to a target region in a target storage engine by running the data synchronization task comprises:

converting the data before synchronization based on the data conversion rule to obtain converted data;

and loading the converted data to a target area in a target storage engine.

8. The method of claim 7, wherein the data quality detection task is further generated based on the data transformation rules obtained by analyzing the data synchronization task;

the performing, by running the data quality detection task, quality detection on the pre-synchronization data stored on the source region and the post-synchronization data stored on the target region based on the data quality detection rule includes:

converting the pre-synchronization data extracted from the source region based on the data conversion rule by operating the data quality detection task;

and performing quality detection on the data content of the converted data before synchronization and the data after synchronization extracted from the target area based on the data quality detection rule.

9. The method of claim 1, further comprising, after said responding to a data synchronization instruction:

acquiring a scheduling task, and sequentially scheduling the data synchronization task and the data quality detection task by running the scheduling task;

the task is synchronized by running data, and the task comprises the following steps: synchronizing tasks by running the scheduled data;

the data quality detection task is executed, and the data quality detection task comprises the following steps: by running the data quality detection task that is scheduled.

10. The method of claim 9, wherein the sequentially scheduling the data synchronization task and the data quality detection task comprises:

acquiring a synchronous increment range, and sequentially scheduling the data synchronization task and the data quality detection task based on the synchronous increment range;

the synchronizing pre-synchronization data stored on a source region within a source storage engine to a target region within a target storage engine by running the scheduled data synchronization task, comprising:

synchronizing pre-synchronization data which is stored on a source region in a source storage engine and corresponds to the synchronization increment range to a target region in a target storage engine by running the scheduled data synchronization task;

performing quality detection on the pre-synchronization data stored on the source region and the post-synchronization data stored on the target region based on the data quality detection rule by running the scheduled data quality detection task, including:

and performing quality detection on the pre-synchronization data corresponding to the synchronization increment range stored on the source region and the post-synchronization data corresponding to the synchronization increment range stored on the target region based on the data quality detection rule by running the scheduled data quality detection task.

11. The method of claim 1, wherein the source region and the target region used to generate the data quality detection task are obtained by:

and analyzing the data synchronization task, and acquiring the source region and the target region for generating the data quality detection task according to an analysis result.

12. The method of claim 11, wherein the source region and the target region are written into the data synchronization task by way of a structured query language;

the analyzing the data synchronization task and acquiring the source region and the target region for generating the data quality detection task according to an analysis result includes:

analyzing the structured query language in the data synchronization task based on the lexical analyzer and the syntactic analyzer to obtain an abstract syntax tree of the structured query language;

and analyzing the abstract syntax tree to obtain tree nodes and expression nodes, and obtaining the source region and the target region for generating the data quality detection task according to the tree nodes and the expression nodes.

13. The method of claim 1, further comprising:

responding to a task generation instruction aiming at the data synchronization task, screening the data quality detection rule from all candidate quality detection rules in a data quality detection rule set, or taking a preset quality detection rule in the data quality detection rule set as the data quality detection rule;

and acquiring the source region and the target region corresponding to the data synchronization task, and generating the data quality detection task corresponding to the data synchronization task based on the source region, the target region and the data quality detection rule.

14. A data quality detection apparatus, comprising:

a synchronized data obtaining module, configured to synchronize, in response to a data synchronization instruction, pre-synchronization data stored in a source region in a source storage engine to a target region in a target storage engine by running a data synchronization task, so as to obtain synchronized data;

a data quality detection task acquisition module, configured to acquire a data quality detection task corresponding to the data synchronization task, where the data quality detection task is generated based on the acquired source region, the target region, and a data quality detection rule;

and the data quality obtaining module is used for performing quality detection on the pre-synchronization data stored on the source region and the post-synchronization data stored on the target region based on the data quality detection rule by operating the data quality detection task to obtain data quality.

15. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to cause the at least one processor to perform the data quality detection method of any one of claims 1-13.

16. A computer-readable storage medium storing computer instructions for causing a processor to perform the data quality detection method of any one of claims 1-13 when executed.