CN111400299A

CN111400299A - Method and system for testing fusion quality of multiple data

Info

Publication number: CN111400299A
Application number: CN202010497131.0A
Authority: CN
Inventors: 张艳清; 查文宇; 庞攀; 王怡君; 金日海; 赵神州
Original assignee: Chengdu Sefon Software Co Ltd
Current assignee: Chengdu Sefon Software Co Ltd
Priority date: 2020-06-04
Filing date: 2020-06-04
Publication date: 2020-07-10

Abstract

The invention discloses a method and a system for testing fusion quality of various data, wherein a Spark-SQ L is used for reading a data table from a data source, the quality of the read data table is tested through a Spark cluster and a test report is output, the scheme provides a unified interface for editing Spark-SQ L and carrying out grammar test, as Spark-SQ L can read data of various data sources, the scheme provides the test for the fusion quality of the data table of various data sources, and then the fusion quality is converted into a Spark distributed computing framework for computing by arranging Spark-SQ L, so that TB and PB level data can be processed, and the problems that the traditional data quality test uses a compiling program and a python script mode for quality test, each different scene needs fixed logic, unified SQ L input does not exist, and different database tests need different tools are solved.

Description

Method and system for testing fusion quality of multiple data

Technical Field

The invention relates to the field of big data, in particular to a method and a system for testing fusion quality of multiple data.

Background

Apache Spark is a fast, general-purpose computing engine designed specifically for large-scale data processing. Spark is a universal parallel framework similar to Hadoop MapReduce and open by UCBerkeley AMP lab, and has the advantages of Hadoop MapReduce; but different from MapReduce, Job intermediate output results can be stored in a memory, so that HDFS reading and writing are not needed, and Spark can be better suitable for MapReduce algorithms which need iteration, such as data mining, machine learning and the like. Spark is a similar open source clustered computing environment as Hadoop, but there are some differences between the two that make Spark superior in terms of some workloads, in other words Spark enables memory distributed datasets that, in addition to being able to provide interactive queries, can also optimize iterative workloads. Spark is implemented in the Scala language, which uses Scala as its application framework. Unlike Hadoop, Spark and Scala can be tightly integrated, where Scala can manipulate distributed datasets as easily as manipulating local collection objects.

The traditional data quality inspection uses a writing program and a python script mode to carry out quality inspection, each different scene needs to use fixed logic, uniform work is not carried out to carry out sql input, and different tools are needed to be used for different database inspection.

Disclosure of Invention

The invention aims to provide a method and a system for multi-data fusion quality inspection, which solve the problems that in the traditional data quality inspection, the quality inspection is performed by using a writing program and a python script mode, each different scene needs to use fixed logic, unified SQ L input does not exist, and different tools are needed for different database inspection.

The technical scheme adopted by the invention is as follows:

a method for fusing quality inspection of multiple data comprises the steps of reading a data table from a data source by using Spark-SQ L, performing quality inspection on the read data table through a Spark cluster and outputting an inspection report.

The existing scheme for checking using SQ L includes a first step of using a navicate database linking tool to link databases, such as MySQ L, and then entering MySQ L's SQ L grammar to perform quality checks on a certain table, a second step of outputting problem data to a certain table, and a third step of manually calculating a quality report.

The scheme provides a uniform interface to edit Spark-SQ L, syntax check is carried out, Spark-SQ L is edited through a simple and easy-to-use interface, meanwhile, since Spark-SQ L can read data of various data sources, fusion quality check of data tables of multiple data sources is provided, and then Spark-SQ L is arranged to be converted into a Spark distributed computing framework to carry out computing, so that TB-PB level data can be processed.

The Spark-SQ L is edited through a web interface, a simple and easy-to-use web operation interface is provided, quality inspection can be achieved without programming, only Spark-SQ L logic needs to be written, a rapid, efficient and rapid data inspection method of distributed computing can be conducted, meanwhile, quality inspection of data fusion of various data is supported, and popular and easy-to-understand data quality inspection reports are generated.

The relational database comprises Oracle, DB2, Microsoft SQ L Server, Microsoft Access, MySQ L and the like, and the non-relational database comprises HBase, Redis, MongadDB and the like.

Further, the spare cluster performs quality inspection on the read data table through at least one executor.

Further, the method for performing quality inspection on the read data table by the spare cluster through at least one executor comprises the following steps:

s101, dividing the data table by a Spark cluster, dividing the data table into data sub-tables with the same number as the executors, and issuing the data sub-tables to each executor;

s102, the executor checks the received data sub-table according to indexes needing checking and generates a checking result of the data sub-table;

s103, summarizing the inspection results generated by the executors by the Spark cluster and generating a final inspection report.

s201, distributing a check index for the executors by the Spark cluster, and sending the data table to each executor;

s202, the executor inspects the received data table according to the indexes to be inspected and generates inspection results of the corresponding indexes;

s203, the Spark cluster collects the inspection results generated by the executors and generates a final inspection report.

Further, the data fusion quality comprises at least one index of completeness, normalization, consistency, accuracy, uniqueness and relevance.

A system for multiple data fusion quality verification, comprising:

a memory for storing executable instructions and storing a data table as a data source;

a processor for executing executable instructions stored in said memory to implement reading a data table from a data source using Spark-SQ L as described above;

and a Spark cluster for performing quality inspection on the read data table and outputting an inspection report.

In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that:

1. the invention discloses a method and a system for testing fusion quality of multiple data, which solve the problems that the traditional data quality test uses a compiling program and a python script mode to carry out quality test, each different scene needs to use fixed logic, unified work is not available to carry out sql input, and different databases need to use different tools for testing;

2. the invention provides a method and a system for testing the fusion quality of various data, which can carry out quality testing in a popular and understandable mode and improve the efficiency of data quality testing;

3. the invention discloses a method and a system for testing the fusion quality of various data, which are used for processing the quality test of mass data by using a distributed computing frame, efficiently and quickly testing the quality of a data center of an enterprise and detecting the quality condition of the data in real time.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts, wherein:

FIG. 1 is a schematic diagram of the system architecture of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail with reference to fig. 1, the described embodiments should not be construed as limiting the present invention, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein is for the purpose of describing embodiments of the invention only and is not intended to be limiting of the invention.

Before further detailed description of the embodiments of the present invention, terms and expressions mentioned in the embodiments of the present invention are explained, and the terms and expressions mentioned in the embodiments of the present invention are applied to the following explanations.

Spark-SQ L Spark module for structured data processing, unlike the basic Spark RDD API, Spark-SQ L provides the interface that gives Spark more information about the data structure and the computations being performed.

Data quality:

integrity complete: measure which data is lost or which data is not available;

normative compliance: measuring which data are not stored in a uniform format;

consistency: measure which data values are conflicting in the meaning of the information;

accuracy: measure which data and information is incorrect, or data is out of date;

uniqueness uniquess: measure which data is duplicate data or which attributes of the data are duplicate;

association Integration: which associated data is missing or not indexed is measured.

Example 1

As shown in FIG. 1, in the method for data fusion quality inspection, a Spark-SQ L is used for reading a data table from a data source, performing quality inspection on the read data table through a Spark cluster and outputting an inspection report.

The scheme provides a uniform interface to edit Spark-SQ L, syntax check is carried out, Spark-SQ L is edited through a simple and easy-to-use interface, meanwhile, since Spark-SQ L can read data of various data sources, fusion quality check of data tables of multiple data sources is provided, and then Spark-SQ L is arranged to be converted into a Spark distributed computing framework to carry out computing, so that data at tb and pb levels can be processed.

Example 2

The method is further based on the embodiment 1, the Spark-SQ L is edited through a web interface, a simple and easy-to-use web operation interface is provided, quality inspection can be achieved without programming, only Spark-SQ L logic needs to be written, a rapid, efficient and rapid data inspection method for distributed computing can be carried out, meanwhile, quality inspection of data fusion of various data is supported, and popular and easy-to-understand data quality inspection reports are generated.

Example 3

The embodiment is further based on the embodiment 1, wherein the data source comprises at least 2 databases of a relational database and a non-relational database, the relational database comprises databases of Oracle, DB2, Microsoft SQ L Server, Microsoft Access, MySQ L and the like, and the non-relational database comprises databases of HBase, Redis, MongadDB and the like.

Example 4

In this embodiment, based on embodiment 1, the spare cluster performs quality check on the read data table through at least one executor.

Example 5

The difference between this embodiment and embodiment 4 is that the method for performing quality inspection on the read data table by the spare cluster through at least one executor includes the following steps:

Example 6

The embodiment is further based on embodiment 1, and the data fusion quality includes at least one index of integrity, normalization, consistency, accuracy, uniqueness and relevance.

Example 7

A system for multiple data fusion quality verification, comprising:

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, the functional modules in the embodiments of the present invention may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention. It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for multi-data fusion quality inspection is characterized in that a Spark-SQ L is used for reading a data table from a data source, quality inspection is carried out on the read data table through a Spark cluster, and an inspection report is output.

2. The method for multi-data fusion quality inspection according to claim 1, wherein the Spark-SQ L is edited via a web interface.

3. The method for multi-data fusion quality inspection according to claim 1, wherein: the data source includes at least 2 of a relational database and a non-relational database.

4. The method for multi-data fusion quality inspection according to claim 1, wherein: and the spare cluster performs quality inspection on the read data table through at least one executor.

5. The method for multi-data fusion quality inspection according to claim 4, wherein: the method for the spare cluster to carry out quality inspection on the read data table through at least one executor comprises the following steps:

6. The method for multi-data fusion quality inspection according to claim 4, wherein: the method for the spare cluster to carry out quality inspection on the read data table through at least one executor comprises the following steps:

7. The method for multi-data fusion quality inspection according to claim 1, wherein: the data fusion quality comprises at least one index of completeness, normalization, consistency, accuracy, uniqueness and relevance.

8. A system for testing fusion quality of multiple data is characterized in that: the method comprises the following steps:

a processor for executing executable instructions stored in said memory to implement reading a data table from a data source using Spark-SQ L as claimed in claim 1;