CN113641572B

CN113641572B - Debugging method for massive big data computing development based on SQL

Info

Publication number: CN113641572B
Application number: CN202110750626.4A
Authority: CN
Inventors: 徐长明
Original assignee: Duodian Life Chengdu Technology Co ltd
Current assignee: Duodian Life Chengdu Technology Co ltd
Priority date: 2021-07-02
Filing date: 2021-07-02
Publication date: 2023-06-13
Anticipated expiration: 2041-07-02
Also published as: CN113641572A

Abstract

The invention discloses a debugging method for massive big data computing development based on SQL, which comprises the steps of acquiring original SQL sentences and debugging information of computing index data; the debugging information comprises the information that debugging is needed and the information that debugging is not needed; judging whether the obtained original SQL statement needs to be debugged or not according to the debugging information; and modifying the obtained original SQL statement to generate debugging SQL and the like. The invention supports the simultaneous index calculation of formal SQL sentences and debugging SQL to obtain index data, and the two modes of data are written into different floor libraries to achieve the aim that the index data are not mutually interfered, thereby ensuring the elegance of debugging, ensuring the SQL writing of SQL developers to be more convenient and quick, reducing the cost investment, ensuring the SQL calculation index to be easier to be smoothly transited and upgraded, avoiding dirty data interference generated by data users in the upgrading process, and effectively reducing the hardware cost.

Description

Debugging method for massive big data computing development based on SQL

Technical Field

The invention relates to the technical field of big data processing, in particular to a debugging method for mass big data computing development based on SQL.

Background

In the process of offline index calculation of big data and more mature stream batch integrated index calculation, a plurality of data calculation engines supporting SQL statement standards, such as hive SQL, spark SQL, flink SQL and the like, are often used for supporting index calculation storage, and an index data set is provided for supporting query of a display system of each data, so that more and more accurate data decision basis and service enabling are provided for enterprises. The SQL calculation in the big data cluster is characterized in that data are obtained from an input source, index data are generated after the SQL calculation is defined by indexes and stored in an output source, and then the data are obtained from the output source for report forms, chart display and data analysis. However, in the current SQL writing and developing process, source data generated by different data environments are different, and when developing calculation indexes, development and use personnel cannot directly develop SQL sentences in a production environment for reasons of safety, data isolation and the like; if the SQL statement errors cannot be found in time in the development environment, the SQL statement errors can be found only after the execution of the production environment fails or the log alarms, and even the phenomenon that the SQL statement errors cannot be found and unreasonable error data are displayed occurs, so that wrong data policy guidance is brought to analysis users, and economic losses are caused. If the accuracy of SQL calculation is to be better ensured, the environmental cost is required to be increased, and more development time and development personnel are required to be invested.

Disclosure of Invention

The invention aims to solve the problems and provide a debugging method for massive big data computing development based on SQL (structured query language) so as to realize the verification of the validity of SQL sentences and the correctness of index data according to real source data in a production environment, protect formal library data from being polluted, fundamentally improve the development efficiency and reduce the input cost of personnel.

The aim of the invention is achieved by the following technical scheme: a debugging method for mass big data computing development based on SQL comprises the following steps:

s1: acquiring original SQL sentences and debugging information of calculation index data; the debugging information comprises the information that debugging is needed and the information that debugging is not needed;

s2: judging whether the obtained original SQL statement needs to be debugged or not according to the debugging information; transforming the obtained original SQL sentence to generate debugging SQL, and executing step S3; if not, executing the step S3;

s3: submitting the original SQL statement or the debug SQL generated by modification in the step S2 to a clustered SQL calculation engine for SQL index calculation; the method comprises the steps of obtaining SQL calculation index data after original SQL statement calculation and obtaining debugging SQL calculation index data after debugging SQL calculation;

s4: outputting a calculation result; the SQL calculation index data are written into the formal landing library, and the debugging SQL calculation index data are written into the debugging landing library;

s5: and respectively providing the data in the formal floor library and the data in the debugging floor library for different users.

The method for modifying the obtained original SQL sentence comprises the following steps of:

step 1: analyzing the original SQL sentence to obtain an input source library table, an output source library table and a target field list of index calculation; meanwhile, the metadata structures of the input source library table and the output source library table and the data source service address information are acquired and perfected from the metadata data warehouse through the library table names of the input source library table and the library table names of the output source library table;

step 2: judging whether field information and a type list of an output source library table can be acquired from an output source; if yes, executing the step 4; if not, executing the step 3;

step 3: judging whether an aggregation calculation field exists in a target field list in the index SQL of the input source; extracting an aggregation calculation field as field information, acquiring a type list from service address information of an input source library table, and executing step 4; if not, directly acquiring field information and a type list from service address information input into a source library table, and executing the step 4;

step 4: according to the field information and the type list, constructing a meta information structure and a storage form of a debugging output source library table;

step 5: and replacing the output source library table of the original SQL sentence with the debugging output source library table to finish transformation and obtain the debugging SQL.

Compared with the prior art, the invention has the following advantages:

(1) The invention supports the simultaneous index calculation of formal SQL sentences and debugging SQL to obtain index data, and the two modes of data are written into different floor libraries to achieve the aim that the index data are not mutually interfered, thereby ensuring the elegance of debugging, ensuring the SQL writing of SQL developers to be more convenient and quick, reducing the cost investment, ensuring the SQL calculation index to be easier to be smoothly transited and upgraded, avoiding dirty data interference generated by data users in the upgrading process, and effectively reducing the hardware cost.

(2) The invention provides a unified background service scheme, develops and uses SQL writing which is focused on index calculation by a person, is suitable for any data production environment, supports SQL operation types of an SQL calculation engine and a stream batch integrated under a big data cluster, and saves hardware cost, flow cost and operation and maintenance cost for building a set of environment.

(3) The unified background service scheme provided by the invention has the advantages that development and use personnel do not need to use various language codes to research and develop file execution packages of various indexes to upload the file execution packages into a cluster system for operation, and only need to pay attention to SQL logic writing processing of the indexes, so that the development cost of the development personnel is saved.

(4) The invention provides unified SQL analysis processing, does not need to develop and use personnel to consider how to debug SQL to sort and pull the debugging source data, and considers the whole debugging scheme and the floor-type warehousing scheme of the debugging result data. The proposal of the invention achieves the purposes of comprehensively avoiding writing into the formal library, achieving the effective isolation of the same SQL, ensuring the correctness of the formal library and saving the efficiency cost and the data construction cost of developers.

(5) The invention provides SQL analysis processing, a developer can find BUG in SQL development by checking the debugging data result, correct SQL errors, facilitate SQL development of the developer, realize one-stop business index calculation, improve index development efficiency and realize cost-effective control.

Drawings

FIG. 1 is a flowchart illustrating the overall steps of the present invention.

FIG. 2 is a flowchart illustrating the steps for transforming an original SQL statement according to the present invention.

Detailed Description

The present invention will be described in further detail with reference to examples, but embodiments of the present invention are not limited thereto.

Examples

The invention relates to a debugging method for mass big data computing development based on SQL, which comprises the following steps of firstly explaining the definition of some key terms:

index data: the data of each statistical index calculated by using a technical means in the big data calculation process in order to analyze the behavior of a user or the event of an object.

Input source: and a storage system for source data, which provides an accessible URL address or socket port service.

And (3) an output source: the storage system receiving the data may be used to provide an accessible URL address or socket port service.

SQL calculation: the distributed SQL engine calculation in the large-number cluster environment is different from the synchronous calculation process of the single SQL calculation which can timely acquire the calculation result. And is an asynchronous index calculation process, and the calculation result data is often stored in a database system, so that the user is supported to inquire and read the data.

As shown in fig. 1 and 2, the debugging method for mass big data computing development based on SQL of the invention comprises the following steps:

s1: acquiring original SQL sentences and debugging information of the calculation index data, namely writing the original SQL sentences of the calculation index data by SQL development users through a front-end editor or a text tool, and attaching debugging information of whether to debug or not; and simultaneously, the original SQL statement and the debugging information are submitted to a back-end service interface executor together. The debugging information comprises two types of information which are required to be debugged and are not required to be debugged. In actual development, SQL written when developing and using personnel to develop data indexes should be debugged first and put into formal production calculation; in addition, in the existing running index calculation, if there is a new version of data or field to be deleted, in order to ensure that the running of the current SQL index calculation is not interrupted, the new version of SQL index calculation usually needs to be debugged.

S2: after receiving SQL and the debugging identification, the back-end service interface executor judges whether the obtained original SQL statement needs to be debugged according to the debugging information; and if the debugging information is that the debugging is needed, modifying the obtained original SQL statement so as to generate a debugging SQL, and executing the step S3. If the debug information is that no debug is required, step S3 is directly performed.

Specifically, the modification of the obtained original SQL sentence comprises the following steps:

step 1: and analyzing the original SQL sentence through an SQL analyzer to obtain an input source library table and an output source library table of index calculation and a target field list determined in operation. And simultaneously, acquiring and perfecting the meta information structures and the data source service address information of the input source library table and the output source library table from the meta information data warehouse through the library table names of the input source library table and the library table names of the output source library table.

Step 2: judging whether field information and a type list of an output source library table can be acquired from an output source; if yes, executing the step 4; and if not, executing the step 3.

Step 3: judging whether an aggregation calculation field exists in a target field list in the index SQL of the input source; extracting an aggregation calculation field as field information, acquiring a type list from service address information of an input source library table, and executing step 4; and if not, directly acquiring field information and a type list from service address information input into the source library table, and executing the step 4.

Because some data source systems in SQL computing activities do not allow library tables to be created directly during index computation, it is desirable to create library table information by means of a data warehouse management system, such as mysql, hive, impala, ckickhouse, but data storage systems that allow structured data information to be built during computation, such as hbase, ES, mongodb, redis, kafka, there may be data warehouses of the output sources that do not have specific data structure names and type information available. The type list information of the field information is complemented from the data warehouse of the library table information of the input source in this embodiment.

Step 4: according to the field information and the type list, constructing a meta information structure and a storage form of a debugging output source library table; after the field information and the type list are obtained through the steps, the floor library table and the field structure information of the debugging output source are built.

Specifically, a database system (such as ES, HBase, mongoDb, redis, kafka) with a cache is uniformly selected as a floor debugging library table according to the characteristics of index calculation to receive debugging index data; and setting an expiration mechanism for the debug library table, adopting the upper limit of the arrival time or the number of pieces as the expiration time of the calculated data, and calculating and deleting the expiration data. If the ground library table is stored by ES, each debugging can independently create an index, the type data of the index contains field type and structure information, and the index is set with expiration time. If HBase is selected, a debug table is generated for each debug, rowkey (a row key may be any character string, and a unique identifier of a row of records may be marked) is set as a table name of an output source plus time, and the format is as follows: test_debug_path_finish_ 20210420113332, the data is a json string containing data in the field structure information of the ground library table. And injecting library table information of the data source subjected to debugging landing into logic executed by the big data SQL execution engine, and applying for registering the connection information of the debugging output source in advance.

According to the method, SQL is analyzed by using an SQL analyzer, the original SQL statement is decomposed, service connection information of an input source and an output source in the original SQL statement is obtained, meta-information structure data of a source library table of the input source and table meta-information structure data generated by analyzing the output source are read to construct a library table structure of a debugging output source, if no field meta-information structure data is specified in the current SQL statement, table structure information is obtained by querying a library table in the output source, a field list for supplementing the debugging SQL is provided for subsequent debugging output use, and the debugging SQL statement is output, so that the difference between the debugging SQL and an output source floor library of the formal SQL is generated, the mutual influence of data generated by calculation of two SQL indexes is achieved, and the common use of developers and non-developers is met. If the memory cache table exists, the corresponding database table cache or memory cache table replacement is also performed according to the writing requirement of the data cache memory table in the SQL.

And delivering the modified SQL statement to a big data SQL calculation engine for execution, generating debugging data, and verifying the correctness of the SQL statement developed by the developer. When the usage scene for the developer is batch calculation, the direct reading input source data directly debugs calculation output to generate a data result. In the case of stream data calculation, the user may be allowed to specify input data in advance, or specify data from a formal input source and write the input data into an intermediate data table or a data storage unit, and the data may be provided to SQL logic to calculate index data. The method comprises the following steps:

s3: submitting the original SQL statement submitted by the user or the debug SQL generated by modification in the step S2 to an SQL calculation engine of the cluster to perform SQL index calculation; the SQL statement is submitted to the SQL computing engine of the big data cluster to be executed, and an index result is generated. The SQL calculation engine may be a hive, impala, sparkSql, flinkSql, es or other SQL calculation engine. The method comprises the steps of obtaining SQL calculation index data after original SQL statement calculation and obtaining debugging SQL calculation index data after debugging SQL calculation.

S4: outputting a calculation result; the SQL calculation index data is written into the formal floor library, and the debugging SQL calculation index data is written into the debugging floor library.

In the whole calculation process, the output floor library table is changed by the debug SQL, and the calculated index data of the debug SQL cannot be written into the floor library table of the non-debug SQL no matter whether the data is qualified or not, so that the correctness of the production environment data is ensured.

In the activity of calculating the debugging indexes, the upper limit of the running time length or the number of data strips processed by the debugging index calculation task can be set as a debugging termination condition, and the development and use personnel can be supported to manually terminate the debugging activity. For example, the groupId consumption of kafka is used for preventing kafka data in index calculation of formal SQL from being consumed, not being consumed again, and causing data loss or less calculation because the same groupId consumption is used in index calculation of debug SQL. Therefore, the incremental means is modified in the debugging process so that the groupId of each debugging is inconsistent; or generates a dedicated debug kafka to cache the data of the input source for each index data debug. For example, in order to prevent repeated writing, a buffer intermediate table is generated to replace the dimension intermediate table in the debugging process, and whether writing is performed is checked in synchronous computing to ensure the correctness of data.

The original SQL sentences respectively designate output sources to receive index data, unified output sources are designed for debugging SQL to store the index data, the output sources are cached according to a set time length or a mechanism of the upper limit of the number to calculate the expiration of the data, and the expiration data is deleted to relieve the storage pressure of the unified debugging output sources, and meanwhile, the correctness of the index data in a formal production environment is ensured not to be influenced by error data generated by the debugging SQL calculation.

S5: and respectively providing the data in the formal floor library and the data in the debugging floor library for different users. Namely, a development user analyzes and debugs the data in the database, and a data analysis person uses the data in the formal floor database; the development user and the data analyst are provided with a read check index debugging data interface or service according to different purposes. SQL development users can acquire debugging data by means of a debugging index data interface or service to demonstrate and analyze and verify the correctness of SQL calculation.

The debugging method of the present embodiment is described below in the following cases:

developer develops debugging execution using the following SQL

The SQL parser is used to parse the output source unique stream data rd mysql test order info library table and the input source order refined ads kafka test order refined library table and aggregate computing functions count, sum and groupby.

Secondly, obtaining meta information structure data from a data warehouse management system by using an output source library table unidata_stream_data_rd.ads_mysql_test_order_info: orderNum is a big type, orderPrice, promotionPrice, totalPrice, waretalprice is a big type, and dt is a substring function operation and thus results in a string.

And a third step of: selecting HBase to generate ads_mysql_test_order_info_debug ID table for the floor library table scheme, and the rowkey in the data is: ads_mysql_test_order_info+debug generate id+time+index number.

The data content is json string assembly mode { orderNum: calculated, promotionPrice: calculated, totalPrice: calculated, waseTotal price: calculated, dt: time calculated }.

Fourth step: the SQL execution engine is injected with the service address of the HBase table and the table name of ads_mysql_test_order_info_debug ID, and the data is written into the HBase in a third step mode.

Fifth step: finally, the debugging SQL is transformed into the following SQL statement:

insert into hbase _default. Ads_mysql_test_order_info_debug ID

Sixth step: and submitting the debugging SQL to a big data SQL execution engine for index calculation, and writing the generated result data into hbase_default.ads_mysql_test_order_info_debugging ID, so that no influence is generated on a library table of the unique data_stream_data_rd.ads_mysql_test_order_info.

Finally, providing debugging data interface or service for the developer to pass

ads_mysql_test_order_info_debug ID to obtain index calculation data of the currently debugged SQL.

As described above, the present invention can be well implemented.

Claims

1. A debugging method for mass big data computing development based on SQL is characterized by comprising the following steps: the method comprises the following steps:

s5: respectively providing the data in the formal floor library and the data in the debugging floor library for different users;