CN111400299A - Method and system for testing fusion quality of multiple data - Google Patents

Method and system for testing fusion quality of multiple data Download PDF

Info

Publication number
CN111400299A
CN111400299A CN202010497131.0A CN202010497131A CN111400299A CN 111400299 A CN111400299 A CN 111400299A CN 202010497131 A CN202010497131 A CN 202010497131A CN 111400299 A CN111400299 A CN 111400299A
Authority
CN
China
Prior art keywords
data
spark
data table
quality inspection
inspection
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010497131.0A
Other languages
Chinese (zh)
Inventor
张艳清
查文宇
庞攀
王怡君
金日海
赵神州
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Sefon Software Co Ltd
Original Assignee
Chengdu Sefon Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Sefon Software Co Ltd filed Critical Chengdu Sefon Software Co Ltd
Priority to CN202010497131.0A priority Critical patent/CN111400299A/en
Publication of CN111400299A publication Critical patent/CN111400299A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2428Query predicate definition using graphical user interfaces, including menus and forms

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Quality & Reliability (AREA)
  • Software Systems (AREA)
  • Human Computer Interaction (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method and a system for testing fusion quality of various data, wherein a Spark-SQ L is used for reading a data table from a data source, the quality of the read data table is tested through a Spark cluster and a test report is output, the scheme provides a unified interface for editing Spark-SQ L and carrying out grammar test, as Spark-SQ L can read data of various data sources, the scheme provides the test for the fusion quality of the data table of various data sources, and then the fusion quality is converted into a Spark distributed computing framework for computing by arranging Spark-SQ L, so that TB and PB level data can be processed, and the problems that the traditional data quality test uses a compiling program and a python script mode for quality test, each different scene needs fixed logic, unified SQ L input does not exist, and different database tests need different tools are solved.

Description

Method and system for testing fusion quality of multiple data
Technical Field
The invention relates to the field of big data, in particular to a method and a system for testing fusion quality of multiple data.
Background
Apache Spark is a fast, general-purpose computing engine designed specifically for large-scale data processing. Spark is a universal parallel framework similar to Hadoop MapReduce and open by UCBerkeley AMP lab, and has the advantages of Hadoop MapReduce; but different from MapReduce, Job intermediate output results can be stored in a memory, so that HDFS reading and writing are not needed, and Spark can be better suitable for MapReduce algorithms which need iteration, such as data mining, machine learning and the like. Spark is a similar open source clustered computing environment as Hadoop, but there are some differences between the two that make Spark superior in terms of some workloads, in other words Spark enables memory distributed datasets that, in addition to being able to provide interactive queries, can also optimize iterative workloads. Spark is implemented in the Scala language, which uses Scala as its application framework. Unlike Hadoop, Spark and Scala can be tightly integrated, where Scala can manipulate distributed datasets as easily as manipulating local collection objects.
The traditional data quality inspection uses a writing program and a python script mode to carry out quality inspection, each different scene needs to use fixed logic, uniform work is not carried out to carry out sql input, and different tools are needed to be used for different database inspection.
Disclosure of Invention
The invention aims to provide a method and a system for multi-data fusion quality inspection, which solve the problems that in the traditional data quality inspection, the quality inspection is performed by using a writing program and a python script mode, each different scene needs to use fixed logic, unified SQ L input does not exist, and different tools are needed for different database inspection.
The technical scheme adopted by the invention is as follows:
a method for fusing quality inspection of multiple data comprises the steps of reading a data table from a data source by using Spark-SQ L, performing quality inspection on the read data table through a Spark cluster and outputting an inspection report.
The existing scheme for checking using SQ L includes a first step of using a navicate database linking tool to link databases, such as MySQ L, and then entering MySQ L's SQ L grammar to perform quality checks on a certain table, a second step of outputting problem data to a certain table, and a third step of manually calculating a quality report.
The scheme provides a uniform interface to edit Spark-SQ L, syntax check is carried out, Spark-SQ L is edited through a simple and easy-to-use interface, meanwhile, since Spark-SQ L can read data of various data sources, fusion quality check of data tables of multiple data sources is provided, and then Spark-SQ L is arranged to be converted into a Spark distributed computing framework to carry out computing, so that TB-PB level data can be processed.
The Spark-SQ L is edited through a web interface, a simple and easy-to-use web operation interface is provided, quality inspection can be achieved without programming, only Spark-SQ L logic needs to be written, a rapid, efficient and rapid data inspection method of distributed computing can be conducted, meanwhile, quality inspection of data fusion of various data is supported, and popular and easy-to-understand data quality inspection reports are generated.
The relational database comprises Oracle, DB2, Microsoft SQ L Server, Microsoft Access, MySQ L and the like, and the non-relational database comprises HBase, Redis, MongadDB and the like.
Further, the spare cluster performs quality inspection on the read data table through at least one executor.
Further, the method for performing quality inspection on the read data table by the spare cluster through at least one executor comprises the following steps:
s101, dividing the data table by a Spark cluster, dividing the data table into data sub-tables with the same number as the executors, and issuing the data sub-tables to each executor;
s102, the executor checks the received data sub-table according to indexes needing checking and generates a checking result of the data sub-table;
s103, summarizing the inspection results generated by the executors by the Spark cluster and generating a final inspection report.
Further, the method for performing quality inspection on the read data table by the spare cluster through at least one executor comprises the following steps:
s201, distributing a check index for the executors by the Spark cluster, and sending the data table to each executor;
s202, the executor inspects the received data table according to the indexes to be inspected and generates inspection results of the corresponding indexes;
s203, the Spark cluster collects the inspection results generated by the executors and generates a final inspection report.
Further, the data fusion quality comprises at least one index of completeness, normalization, consistency, accuracy, uniqueness and relevance.
A system for multiple data fusion quality verification, comprising:
a memory for storing executable instructions and storing a data table as a data source;
a processor for executing executable instructions stored in said memory to implement reading a data table from a data source using Spark-SQ L as described above;
and a Spark cluster for performing quality inspection on the read data table and outputting an inspection report.
In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that:
1. the invention discloses a method and a system for testing fusion quality of multiple data, which solve the problems that the traditional data quality test uses a compiling program and a python script mode to carry out quality test, each different scene needs to use fixed logic, unified work is not available to carry out sql input, and different databases need to use different tools for testing;
2. the invention provides a method and a system for testing the fusion quality of various data, which can carry out quality testing in a popular and understandable mode and improve the efficiency of data quality testing;
3. the invention discloses a method and a system for testing the fusion quality of various data, which are used for processing the quality test of mass data by using a distributed computing frame, efficiently and quickly testing the quality of a data center of an enterprise and detecting the quality condition of the data in real time.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts, wherein:
FIG. 1 is a schematic diagram of the system architecture of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail with reference to fig. 1, the described embodiments should not be construed as limiting the present invention, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein is for the purpose of describing embodiments of the invention only and is not intended to be limiting of the invention.
Before further detailed description of the embodiments of the present invention, terms and expressions mentioned in the embodiments of the present invention are explained, and the terms and expressions mentioned in the embodiments of the present invention are applied to the following explanations.
Spark-SQ L Spark module for structured data processing, unlike the basic Spark RDD API, Spark-SQ L provides the interface that gives Spark more information about the data structure and the computations being performed.
Data quality:
integrity complete: measure which data is lost or which data is not available;
normative compliance: measuring which data are not stored in a uniform format;
consistency: measure which data values are conflicting in the meaning of the information;
accuracy: measure which data and information is incorrect, or data is out of date;
uniqueness uniquess: measure which data is duplicate data or which attributes of the data are duplicate;
association Integration: which associated data is missing or not indexed is measured.
Example 1
As shown in FIG. 1, in the method for data fusion quality inspection, a Spark-SQ L is used for reading a data table from a data source, performing quality inspection on the read data table through a Spark cluster and outputting an inspection report.
The existing scheme for checking using SQ L includes a first step of using a navicate database linking tool to link databases, such as MySQ L, and then entering MySQ L's SQ L grammar to perform quality checks on a certain table, a second step of outputting problem data to a certain table, and a third step of manually calculating a quality report.
The scheme provides a uniform interface to edit Spark-SQ L, syntax check is carried out, Spark-SQ L is edited through a simple and easy-to-use interface, meanwhile, since Spark-SQ L can read data of various data sources, fusion quality check of data tables of multiple data sources is provided, and then Spark-SQ L is arranged to be converted into a Spark distributed computing framework to carry out computing, so that data at tb and pb levels can be processed.
Example 2
The method is further based on the embodiment 1, the Spark-SQ L is edited through a web interface, a simple and easy-to-use web operation interface is provided, quality inspection can be achieved without programming, only Spark-SQ L logic needs to be written, a rapid, efficient and rapid data inspection method for distributed computing can be carried out, meanwhile, quality inspection of data fusion of various data is supported, and popular and easy-to-understand data quality inspection reports are generated.
Example 3
The embodiment is further based on the embodiment 1, wherein the data source comprises at least 2 databases of a relational database and a non-relational database, the relational database comprises databases of Oracle, DB2, Microsoft SQ L Server, Microsoft Access, MySQ L and the like, and the non-relational database comprises databases of HBase, Redis, MongadDB and the like.
Example 4
In this embodiment, based on embodiment 1, the spare cluster performs quality check on the read data table through at least one executor.
Further, the method for performing quality inspection on the read data table by the spare cluster through at least one executor comprises the following steps:
s101, dividing the data table by a Spark cluster, dividing the data table into data sub-tables with the same number as the executors, and issuing the data sub-tables to each executor;
s102, the executor checks the received data sub-table according to indexes needing checking and generates a checking result of the data sub-table;
s103, summarizing the inspection results generated by the executors by the Spark cluster and generating a final inspection report.
Example 5
The difference between this embodiment and embodiment 4 is that the method for performing quality inspection on the read data table by the spare cluster through at least one executor includes the following steps:
s201, distributing a check index for the executors by the Spark cluster, and sending the data table to each executor;
s202, the executor inspects the received data table according to the indexes to be inspected and generates inspection results of the corresponding indexes;
s203, the Spark cluster collects the inspection results generated by the executors and generates a final inspection report.
Example 6
The embodiment is further based on embodiment 1, and the data fusion quality includes at least one index of integrity, normalization, consistency, accuracy, uniqueness and relevance.
Example 7
A system for multiple data fusion quality verification, comprising:
a memory for storing executable instructions and storing a data table as a data source;
a processor for executing executable instructions stored in said memory to implement reading a data table from a data source using Spark-SQ L as described above;
and a Spark cluster for performing quality inspection on the read data table and outputting an inspection report.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, the functional modules in the embodiments of the present invention may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention. It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (8)

1. A method for multi-data fusion quality inspection is characterized in that a Spark-SQ L is used for reading a data table from a data source, quality inspection is carried out on the read data table through a Spark cluster, and an inspection report is output.
2. The method for multi-data fusion quality inspection according to claim 1, wherein the Spark-SQ L is edited via a web interface.
3. The method for multi-data fusion quality inspection according to claim 1, wherein: the data source includes at least 2 of a relational database and a non-relational database.
4. The method for multi-data fusion quality inspection according to claim 1, wherein: and the spare cluster performs quality inspection on the read data table through at least one executor.
5. The method for multi-data fusion quality inspection according to claim 4, wherein: the method for the spare cluster to carry out quality inspection on the read data table through at least one executor comprises the following steps:
s101, dividing the data table by a Spark cluster, dividing the data table into data sub-tables with the same number as the executors, and issuing the data sub-tables to each executor;
s102, the executor checks the received data sub-table according to indexes needing checking and generates a checking result of the data sub-table;
s103, summarizing the inspection results generated by the executors by the Spark cluster and generating a final inspection report.
6. The method for multi-data fusion quality inspection according to claim 4, wherein: the method for the spare cluster to carry out quality inspection on the read data table through at least one executor comprises the following steps:
s201, distributing a check index for the executors by the Spark cluster, and sending the data table to each executor;
s202, the executor inspects the received data table according to the indexes to be inspected and generates inspection results of the corresponding indexes;
s203, the Spark cluster collects the inspection results generated by the executors and generates a final inspection report.
7. The method for multi-data fusion quality inspection according to claim 1, wherein: the data fusion quality comprises at least one index of completeness, normalization, consistency, accuracy, uniqueness and relevance.
8. A system for testing fusion quality of multiple data is characterized in that: the method comprises the following steps:
a memory for storing executable instructions and storing a data table as a data source;
a processor for executing executable instructions stored in said memory to implement reading a data table from a data source using Spark-SQ L as claimed in claim 1;
and a Spark cluster for performing quality inspection on the read data table and outputting an inspection report.
CN202010497131.0A 2020-06-04 2020-06-04 Method and system for testing fusion quality of multiple data Pending CN111400299A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010497131.0A CN111400299A (en) 2020-06-04 2020-06-04 Method and system for testing fusion quality of multiple data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010497131.0A CN111400299A (en) 2020-06-04 2020-06-04 Method and system for testing fusion quality of multiple data

Publications (1)

Publication Number Publication Date
CN111400299A true CN111400299A (en) 2020-07-10

Family

ID=71437620

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010497131.0A Pending CN111400299A (en) 2020-06-04 2020-06-04 Method and system for testing fusion quality of multiple data

Country Status (1)

Country Link
CN (1) CN111400299A (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101718119B1 (en) * 2016-04-22 2017-03-21 숭실대학교산학협력단 System and Method for processing SPARQL queries based on Spark SQL
CN106777167A (en) * 2016-12-21 2017-05-31 中国科学院上海高等研究院 Magnanimity Face Image Retrieval System and search method based on Spark frameworks
CN106874483A (en) * 2017-02-20 2017-06-20 山东鲁能软件技术有限公司 A kind of device and method of the patterned quality of data evaluation and test based on big data technology
CN107247799A (en) * 2017-06-27 2017-10-13 北京天机数测数据科技有限公司 Data processing method, system and its modeling method of compatible a variety of big data storages
CN107368501A (en) * 2016-05-13 2017-11-21 北京京东尚科信息技术有限公司 The processing method and processing device of data
CN108255619A (en) * 2017-12-28 2018-07-06 新华三大数据技术有限公司 A kind of data processing method and device
CN108647360A (en) * 2018-05-18 2018-10-12 南通大学 A kind of method of the access of taxi big data and the processing of multithreading
CN109213751A (en) * 2018-08-06 2019-01-15 北京所问数据科技有限公司 Oracle database parallel migration technology based on Spark platform
CN109992576A (en) * 2019-03-01 2019-07-09 苏州龙石信息科技有限公司 A kind of government data quality evaluation and abnormal data recovery technique based on big data technology
WO2019223598A1 (en) * 2018-05-25 2019-11-28 杭州海康威视数字技术股份有限公司 Method and device for fusing data table

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101718119B1 (en) * 2016-04-22 2017-03-21 숭실대학교산학협력단 System and Method for processing SPARQL queries based on Spark SQL
CN107368501A (en) * 2016-05-13 2017-11-21 北京京东尚科信息技术有限公司 The processing method and processing device of data
CN106777167A (en) * 2016-12-21 2017-05-31 中国科学院上海高等研究院 Magnanimity Face Image Retrieval System and search method based on Spark frameworks
CN106874483A (en) * 2017-02-20 2017-06-20 山东鲁能软件技术有限公司 A kind of device and method of the patterned quality of data evaluation and test based on big data technology
CN107247799A (en) * 2017-06-27 2017-10-13 北京天机数测数据科技有限公司 Data processing method, system and its modeling method of compatible a variety of big data storages
CN108255619A (en) * 2017-12-28 2018-07-06 新华三大数据技术有限公司 A kind of data processing method and device
CN108647360A (en) * 2018-05-18 2018-10-12 南通大学 A kind of method of the access of taxi big data and the processing of multithreading
WO2019223598A1 (en) * 2018-05-25 2019-11-28 杭州海康威视数字技术股份有限公司 Method and device for fusing data table
CN109213751A (en) * 2018-08-06 2019-01-15 北京所问数据科技有限公司 Oracle database parallel migration technology based on Spark platform
CN109992576A (en) * 2019-03-01 2019-07-09 苏州龙石信息科技有限公司 A kind of government data quality evaluation and abnormal data recovery technique based on big data technology

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
昝松亭: "移动医疗大数据的数据质量评估模型研究", 《中国优秀硕士学位论文全文数据库 医药卫生科技辑》 *

Similar Documents

Publication Publication Date Title
US11650854B2 (en) Executing algorithms in parallel
US8019795B2 (en) Data warehouse test automation framework
US10013439B2 (en) Automatic generation of instantiation rules to determine quality of data migration
US8719271B2 (en) Accelerating data profiling process
AU2013202007B2 (en) Data selection and identification
CN111611458A (en) Method for realizing system data architecture combing based on metadata and data analysis technology in big data management
CN111914066B (en) Global searching method and system for multi-source database
US20160154778A1 (en) Automatic modeling of column and pivot table layout tabular data
CN112579586A (en) Data processing method, device, equipment and storage medium
CN112579578A (en) Metadata-based data quality management method, device and system and server
Sneed et al. Testing big data (Assuring the quality of large databases)
CN114253939A (en) Data model construction method and device, electronic equipment and storage medium
CN116719799A (en) Environment-friendly data management method, device, computer equipment and storage medium
CN117421328A (en) Data management method, system, equipment and readable medium based on artificial intelligence
Moussa Tpc-h benchmarking of pig latin on a hadoop cluster
CN111400299A (en) Method and system for testing fusion quality of multiple data
CN116010439A (en) Visual Chinese SQL system and query construction method
Rizk et al. Diftong: a tool for validating big data workflows
CN110647518A (en) Data source fusion calculation method, component and device
Schüle et al. Blue Elephants Inspecting Pandas
Winberg et al. A comparison of relational and graph databases for crm systems
kumar Pallamala et al. Improving the Quality Validation of semi-structured data Process using Hadoop
Alsudais et al. Raven: Accelerating execution of iterative data analytics by reusing results of previous equivalent versions
CN113220530B (en) Data quality monitoring method and platform
CN114066170A (en) Government data open sharing-oriented problem feedback processing system and method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200710