CN111400299A - Method and system for testing fusion quality of multiple data - Google Patents
Method and system for testing fusion quality of multiple data Download PDFInfo
- Publication number
- CN111400299A CN111400299A CN202010497131.0A CN202010497131A CN111400299A CN 111400299 A CN111400299 A CN 111400299A CN 202010497131 A CN202010497131 A CN 202010497131A CN 111400299 A CN111400299 A CN 111400299A
- Authority
- CN
- China
- Prior art keywords
- data
- spark
- data table
- quality inspection
- inspection
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2282—Tablespace storage structures; Management thereof
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/242—Query formulation
- G06F16/2428—Query predicate definition using graphical user interfaces, including menus and forms
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Quality & Reliability (AREA)
- Software Systems (AREA)
- Human Computer Interaction (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a method and a system for testing fusion quality of various data, wherein a Spark-SQ L is used for reading a data table from a data source, the quality of the read data table is tested through a Spark cluster and a test report is output, the scheme provides a unified interface for editing Spark-SQ L and carrying out grammar test, as Spark-SQ L can read data of various data sources, the scheme provides the test for the fusion quality of the data table of various data sources, and then the fusion quality is converted into a Spark distributed computing framework for computing by arranging Spark-SQ L, so that TB and PB level data can be processed, and the problems that the traditional data quality test uses a compiling program and a python script mode for quality test, each different scene needs fixed logic, unified SQ L input does not exist, and different database tests need different tools are solved.
Description
Technical Field
The invention relates to the field of big data, in particular to a method and a system for testing fusion quality of multiple data.
Background
Apache Spark is a fast, general-purpose computing engine designed specifically for large-scale data processing. Spark is a universal parallel framework similar to Hadoop MapReduce and open by UCBerkeley AMP lab, and has the advantages of Hadoop MapReduce; but different from MapReduce, Job intermediate output results can be stored in a memory, so that HDFS reading and writing are not needed, and Spark can be better suitable for MapReduce algorithms which need iteration, such as data mining, machine learning and the like. Spark is a similar open source clustered computing environment as Hadoop, but there are some differences between the two that make Spark superior in terms of some workloads, in other words Spark enables memory distributed datasets that, in addition to being able to provide interactive queries, can also optimize iterative workloads. Spark is implemented in the Scala language, which uses Scala as its application framework. Unlike Hadoop, Spark and Scala can be tightly integrated, where Scala can manipulate distributed datasets as easily as manipulating local collection objects.
The traditional data quality inspection uses a writing program and a python script mode to carry out quality inspection, each different scene needs to use fixed logic, uniform work is not carried out to carry out sql input, and different tools are needed to be used for different database inspection.
Disclosure of Invention
The invention aims to provide a method and a system for multi-data fusion quality inspection, which solve the problems that in the traditional data quality inspection, the quality inspection is performed by using a writing program and a python script mode, each different scene needs to use fixed logic, unified SQ L input does not exist, and different tools are needed for different database inspection.
The technical scheme adopted by the invention is as follows:
a method for fusing quality inspection of multiple data comprises the steps of reading a data table from a data source by using Spark-SQ L, performing quality inspection on the read data table through a Spark cluster and outputting an inspection report.
The existing scheme for checking using SQ L includes a first step of using a navicate database linking tool to link databases, such as MySQ L, and then entering MySQ L's SQ L grammar to perform quality checks on a certain table, a second step of outputting problem data to a certain table, and a third step of manually calculating a quality report.
The scheme provides a uniform interface to edit Spark-SQ L, syntax check is carried out, Spark-SQ L is edited through a simple and easy-to-use interface, meanwhile, since Spark-SQ L can read data of various data sources, fusion quality check of data tables of multiple data sources is provided, and then Spark-SQ L is arranged to be converted into a Spark distributed computing framework to carry out computing, so that TB-PB level data can be processed.
The Spark-SQ L is edited through a web interface, a simple and easy-to-use web operation interface is provided, quality inspection can be achieved without programming, only Spark-SQ L logic needs to be written, a rapid, efficient and rapid data inspection method of distributed computing can be conducted, meanwhile, quality inspection of data fusion of various data is supported, and popular and easy-to-understand data quality inspection reports are generated.
The relational database comprises Oracle, DB2, Microsoft SQ L Server, Microsoft Access, MySQ L and the like, and the non-relational database comprises HBase, Redis, MongadDB and the like.
Further, the spare cluster performs quality inspection on the read data table through at least one executor.
Further, the method for performing quality inspection on the read data table by the spare cluster through at least one executor comprises the following steps:
s101, dividing the data table by a Spark cluster, dividing the data table into data sub-tables with the same number as the executors, and issuing the data sub-tables to each executor;
s102, the executor checks the received data sub-table according to indexes needing checking and generates a checking result of the data sub-table;
s103, summarizing the inspection results generated by the executors by the Spark cluster and generating a final inspection report.
Further, the method for performing quality inspection on the read data table by the spare cluster through at least one executor comprises the following steps:
s201, distributing a check index for the executors by the Spark cluster, and sending the data table to each executor;
s202, the executor inspects the received data table according to the indexes to be inspected and generates inspection results of the corresponding indexes;
s203, the Spark cluster collects the inspection results generated by the executors and generates a final inspection report.
Further, the data fusion quality comprises at least one index of completeness, normalization, consistency, accuracy, uniqueness and relevance.
A system for multiple data fusion quality verification, comprising:
a memory for storing executable instructions and storing a data table as a data source;
a processor for executing executable instructions stored in said memory to implement reading a data table from a data source using Spark-SQ L as described above;
and a Spark cluster for performing quality inspection on the read data table and outputting an inspection report.
In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that:
1. the invention discloses a method and a system for testing fusion quality of multiple data, which solve the problems that the traditional data quality test uses a compiling program and a python script mode to carry out quality test, each different scene needs to use fixed logic, unified work is not available to carry out sql input, and different databases need to use different tools for testing;
2. the invention provides a method and a system for testing the fusion quality of various data, which can carry out quality testing in a popular and understandable mode and improve the efficiency of data quality testing;
3. the invention discloses a method and a system for testing the fusion quality of various data, which are used for processing the quality test of mass data by using a distributed computing frame, efficiently and quickly testing the quality of a data center of an enterprise and detecting the quality condition of the data in real time.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts, wherein:
FIG. 1 is a schematic diagram of the system architecture of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail with reference to fig. 1, the described embodiments should not be construed as limiting the present invention, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein is for the purpose of describing embodiments of the invention only and is not intended to be limiting of the invention.
Before further detailed description of the embodiments of the present invention, terms and expressions mentioned in the embodiments of the present invention are explained, and the terms and expressions mentioned in the embodiments of the present invention are applied to the following explanations.
Spark-SQ L Spark module for structured data processing, unlike the basic Spark RDD API, Spark-SQ L provides the interface that gives Spark more information about the data structure and the computations being performed.
Data quality:
integrity complete: measure which data is lost or which data is not available;
normative compliance: measuring which data are not stored in a uniform format;
consistency: measure which data values are conflicting in the meaning of the information;
accuracy: measure which data and information is incorrect, or data is out of date;
uniqueness uniquess: measure which data is duplicate data or which attributes of the data are duplicate;
association Integration: which associated data is missing or not indexed is measured.
Example 1
As shown in FIG. 1, in the method for data fusion quality inspection, a Spark-SQ L is used for reading a data table from a data source, performing quality inspection on the read data table through a Spark cluster and outputting an inspection report.
The existing scheme for checking using SQ L includes a first step of using a navicate database linking tool to link databases, such as MySQ L, and then entering MySQ L's SQ L grammar to perform quality checks on a certain table, a second step of outputting problem data to a certain table, and a third step of manually calculating a quality report.
The scheme provides a uniform interface to edit Spark-SQ L, syntax check is carried out, Spark-SQ L is edited through a simple and easy-to-use interface, meanwhile, since Spark-SQ L can read data of various data sources, fusion quality check of data tables of multiple data sources is provided, and then Spark-SQ L is arranged to be converted into a Spark distributed computing framework to carry out computing, so that data at tb and pb levels can be processed.
Example 2
The method is further based on the embodiment 1, the Spark-SQ L is edited through a web interface, a simple and easy-to-use web operation interface is provided, quality inspection can be achieved without programming, only Spark-SQ L logic needs to be written, a rapid, efficient and rapid data inspection method for distributed computing can be carried out, meanwhile, quality inspection of data fusion of various data is supported, and popular and easy-to-understand data quality inspection reports are generated.
Example 3
The embodiment is further based on the embodiment 1, wherein the data source comprises at least 2 databases of a relational database and a non-relational database, the relational database comprises databases of Oracle, DB2, Microsoft SQ L Server, Microsoft Access, MySQ L and the like, and the non-relational database comprises databases of HBase, Redis, MongadDB and the like.
Example 4
In this embodiment, based on embodiment 1, the spare cluster performs quality check on the read data table through at least one executor.
Further, the method for performing quality inspection on the read data table by the spare cluster through at least one executor comprises the following steps:
s101, dividing the data table by a Spark cluster, dividing the data table into data sub-tables with the same number as the executors, and issuing the data sub-tables to each executor;
s102, the executor checks the received data sub-table according to indexes needing checking and generates a checking result of the data sub-table;
s103, summarizing the inspection results generated by the executors by the Spark cluster and generating a final inspection report.
Example 5
The difference between this embodiment and embodiment 4 is that the method for performing quality inspection on the read data table by the spare cluster through at least one executor includes the following steps:
s201, distributing a check index for the executors by the Spark cluster, and sending the data table to each executor;
s202, the executor inspects the received data table according to the indexes to be inspected and generates inspection results of the corresponding indexes;
s203, the Spark cluster collects the inspection results generated by the executors and generates a final inspection report.
Example 6
The embodiment is further based on embodiment 1, and the data fusion quality includes at least one index of integrity, normalization, consistency, accuracy, uniqueness and relevance.
Example 7
A system for multiple data fusion quality verification, comprising:
a memory for storing executable instructions and storing a data table as a data source;
a processor for executing executable instructions stored in said memory to implement reading a data table from a data source using Spark-SQ L as described above;
and a Spark cluster for performing quality inspection on the read data table and outputting an inspection report.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, the functional modules in the embodiments of the present invention may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention. It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.
Claims (8)
1. A method for multi-data fusion quality inspection is characterized in that a Spark-SQ L is used for reading a data table from a data source, quality inspection is carried out on the read data table through a Spark cluster, and an inspection report is output.
2. The method for multi-data fusion quality inspection according to claim 1, wherein the Spark-SQ L is edited via a web interface.
3. The method for multi-data fusion quality inspection according to claim 1, wherein: the data source includes at least 2 of a relational database and a non-relational database.
4. The method for multi-data fusion quality inspection according to claim 1, wherein: and the spare cluster performs quality inspection on the read data table through at least one executor.
5. The method for multi-data fusion quality inspection according to claim 4, wherein: the method for the spare cluster to carry out quality inspection on the read data table through at least one executor comprises the following steps:
s101, dividing the data table by a Spark cluster, dividing the data table into data sub-tables with the same number as the executors, and issuing the data sub-tables to each executor;
s102, the executor checks the received data sub-table according to indexes needing checking and generates a checking result of the data sub-table;
s103, summarizing the inspection results generated by the executors by the Spark cluster and generating a final inspection report.
6. The method for multi-data fusion quality inspection according to claim 4, wherein: the method for the spare cluster to carry out quality inspection on the read data table through at least one executor comprises the following steps:
s201, distributing a check index for the executors by the Spark cluster, and sending the data table to each executor;
s202, the executor inspects the received data table according to the indexes to be inspected and generates inspection results of the corresponding indexes;
s203, the Spark cluster collects the inspection results generated by the executors and generates a final inspection report.
7. The method for multi-data fusion quality inspection according to claim 1, wherein: the data fusion quality comprises at least one index of completeness, normalization, consistency, accuracy, uniqueness and relevance.
8. A system for testing fusion quality of multiple data is characterized in that: the method comprises the following steps:
a memory for storing executable instructions and storing a data table as a data source;
a processor for executing executable instructions stored in said memory to implement reading a data table from a data source using Spark-SQ L as claimed in claim 1;
and a Spark cluster for performing quality inspection on the read data table and outputting an inspection report.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010497131.0A CN111400299A (en) | 2020-06-04 | 2020-06-04 | Method and system for testing fusion quality of multiple data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010497131.0A CN111400299A (en) | 2020-06-04 | 2020-06-04 | Method and system for testing fusion quality of multiple data |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111400299A true CN111400299A (en) | 2020-07-10 |
Family
ID=71437620
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010497131.0A Pending CN111400299A (en) | 2020-06-04 | 2020-06-04 | Method and system for testing fusion quality of multiple data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111400299A (en) |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101718119B1 (en) * | 2016-04-22 | 2017-03-21 | 숭실대학교산학협력단 | System and Method for processing SPARQL queries based on Spark SQL |
CN106777167A (en) * | 2016-12-21 | 2017-05-31 | 中国科学院上海高等研究院 | Magnanimity Face Image Retrieval System and search method based on Spark frameworks |
CN106874483A (en) * | 2017-02-20 | 2017-06-20 | 山东鲁能软件技术有限公司 | A kind of device and method of the patterned quality of data evaluation and test based on big data technology |
CN107247799A (en) * | 2017-06-27 | 2017-10-13 | 北京天机数测数据科技有限公司 | Data processing method, system and its modeling method of compatible a variety of big data storages |
CN107368501A (en) * | 2016-05-13 | 2017-11-21 | 北京京东尚科信息技术有限公司 | The processing method and processing device of data |
CN108255619A (en) * | 2017-12-28 | 2018-07-06 | 新华三大数据技术有限公司 | A kind of data processing method and device |
CN108647360A (en) * | 2018-05-18 | 2018-10-12 | 南通大学 | A kind of method of the access of taxi big data and the processing of multithreading |
CN109213751A (en) * | 2018-08-06 | 2019-01-15 | 北京所问数据科技有限公司 | Oracle database parallel migration technology based on Spark platform |
CN109992576A (en) * | 2019-03-01 | 2019-07-09 | 苏州龙石信息科技有限公司 | A kind of government data quality evaluation and abnormal data recovery technique based on big data technology |
WO2019223598A1 (en) * | 2018-05-25 | 2019-11-28 | 杭州海康威视数字技术股份有限公司 | Method and device for fusing data table |
-
2020
- 2020-06-04 CN CN202010497131.0A patent/CN111400299A/en active Pending
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101718119B1 (en) * | 2016-04-22 | 2017-03-21 | 숭실대학교산학협력단 | System and Method for processing SPARQL queries based on Spark SQL |
CN107368501A (en) * | 2016-05-13 | 2017-11-21 | 北京京东尚科信息技术有限公司 | The processing method and processing device of data |
CN106777167A (en) * | 2016-12-21 | 2017-05-31 | 中国科学院上海高等研究院 | Magnanimity Face Image Retrieval System and search method based on Spark frameworks |
CN106874483A (en) * | 2017-02-20 | 2017-06-20 | 山东鲁能软件技术有限公司 | A kind of device and method of the patterned quality of data evaluation and test based on big data technology |
CN107247799A (en) * | 2017-06-27 | 2017-10-13 | 北京天机数测数据科技有限公司 | Data processing method, system and its modeling method of compatible a variety of big data storages |
CN108255619A (en) * | 2017-12-28 | 2018-07-06 | 新华三大数据技术有限公司 | A kind of data processing method and device |
CN108647360A (en) * | 2018-05-18 | 2018-10-12 | 南通大学 | A kind of method of the access of taxi big data and the processing of multithreading |
WO2019223598A1 (en) * | 2018-05-25 | 2019-11-28 | 杭州海康威视数字技术股份有限公司 | Method and device for fusing data table |
CN109213751A (en) * | 2018-08-06 | 2019-01-15 | 北京所问数据科技有限公司 | Oracle database parallel migration technology based on Spark platform |
CN109992576A (en) * | 2019-03-01 | 2019-07-09 | 苏州龙石信息科技有限公司 | A kind of government data quality evaluation and abnormal data recovery technique based on big data technology |
Non-Patent Citations (1)
Title |
---|
昝松亭: "移动医疗大数据的数据质量评估模型研究", 《中国优秀硕士学位论文全文数据库 医药卫生科技辑》 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11650854B2 (en) | Executing algorithms in parallel | |
US8019795B2 (en) | Data warehouse test automation framework | |
US10013439B2 (en) | Automatic generation of instantiation rules to determine quality of data migration | |
US8719271B2 (en) | Accelerating data profiling process | |
AU2013202007B2 (en) | Data selection and identification | |
CN111611458A (en) | Method for realizing system data architecture combing based on metadata and data analysis technology in big data management | |
CN111914066B (en) | Global searching method and system for multi-source database | |
US20160154778A1 (en) | Automatic modeling of column and pivot table layout tabular data | |
CN112579586A (en) | Data processing method, device, equipment and storage medium | |
CN112579578A (en) | Metadata-based data quality management method, device and system and server | |
Sneed et al. | Testing big data (Assuring the quality of large databases) | |
CN114253939A (en) | Data model construction method and device, electronic equipment and storage medium | |
CN116719799A (en) | Environment-friendly data management method, device, computer equipment and storage medium | |
CN117421328A (en) | Data management method, system, equipment and readable medium based on artificial intelligence | |
Moussa | Tpc-h benchmarking of pig latin on a hadoop cluster | |
CN111400299A (en) | Method and system for testing fusion quality of multiple data | |
CN116010439A (en) | Visual Chinese SQL system and query construction method | |
Rizk et al. | Diftong: a tool for validating big data workflows | |
CN110647518A (en) | Data source fusion calculation method, component and device | |
Schüle et al. | Blue Elephants Inspecting Pandas | |
Winberg et al. | A comparison of relational and graph databases for crm systems | |
kumar Pallamala et al. | Improving the Quality Validation of semi-structured data Process using Hadoop | |
Alsudais et al. | Raven: Accelerating execution of iterative data analytics by reusing results of previous equivalent versions | |
CN113220530B (en) | Data quality monitoring method and platform | |
CN114066170A (en) | Government data open sharing-oriented problem feedback processing system and method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200710 |