CN116795859A - Data analysis method, device, computer equipment and storage medium - Google Patents

Data analysis method, device, computer equipment and storage medium Download PDF

Info

Publication number
CN116795859A
CN116795859A CN202210246053.6A CN202210246053A CN116795859A CN 116795859 A CN116795859 A CN 116795859A CN 202210246053 A CN202210246053 A CN 202210246053A CN 116795859 A CN116795859 A CN 116795859A
Authority
CN
China
Prior art keywords
query statement
data
data analysis
query
statement
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210246053.6A
Other languages
Chinese (zh)
Inventor
张韶全
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202210246053.6A priority Critical patent/CN116795859A/en
Publication of CN116795859A publication Critical patent/CN116795859A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • G06F16/24534Query rewriting; Transformation
    • G06F16/24535Query rewriting; Transformation of sub-queries or views
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24553Query execution of query operations
    • G06F16/24554Unary operations; Data partitioning operations
    • G06F16/24556Aggregation; Duplicate elimination
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24553Query execution of query operations
    • G06F16/24558Binary matching operations
    • G06F16/2456Join operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/252Integrating or interfacing systems involving database management systems between a Database Management System and a front-end application

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present application relates to a data analysis method, apparatus, computer device, storage medium and computer program product. The method can realize the query analysis of data, and acquire data analysis query sentences aiming at different data sources; performing operator pushing on a logic plan tree for representing the data analysis query statement to obtain a pushing query statement matched with the data analysis query statement, wherein the pushing query statement comprises a summary query statement and a plurality of item-labeled query statements matched with the data source one by one; registering each target query statement to a computing engine, and executing the summarized query statement by the computing engine to obtain a data analysis result corresponding to the data analysis query statement. According to the application, the data under the condition of multiple data sources is analyzed through the calculation engine, the calculation engine can be freely selected according to the condition of the data sources, and the expansibility of joint analysis under the condition of multiple data sources is effectively improved.

Description

Data analysis method, device, computer equipment and storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a data analysis method, apparatus, computer device, and storage medium.
Background
With the development of computer technology, cloud technology is developed, which refers to a hosting technology for integrating hardware, software, network and other serial resources in a wide area network or a local area network to realize calculation, storage, processing and sharing of data. In cloud technology, database (Database) technology plays an important role. The database, which can be simply regarded as an electronic filing cabinet, is a place for storing electronic files, and a user can perform operations such as adding, inquiring, updating, deleting and the like on data in the files. A "database" is a collection of data stored together in a manner that can be shared with multiple users, with as little redundancy as possible, independent of the application. And in the data analysis, the data stored in different data sources are subjected to joint analysis, so that the method plays a vital role in fully playing the data value.
At present, when performing joint analysis on data stored in different data sources, a computing engine coupling technology is generally adopted, namely, one computing engine is selected, and different connectors are customized for the different data sources. When the joint analysis is needed to be carried out on the data, the calculation engine pulls the whole data from different connectors and carries out analysis calculation in the engine. However, this analysis method needs to be bound to a single computing engine, which is not beneficial to expansion and cannot exert the characteristics of different computing engines.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a data analysis method, apparatus, computer device, computer readable storage medium, and computer program product that can effectively improve the expansibility in joint analysis.
In a first aspect, the present application provides a data analysis method. The method comprises the following steps:
acquiring data analysis query sentences aiming at different data sources;
performing operator pushing on a logic plan tree for representing the data analysis query statement to obtain a pushing query statement matched with the data analysis query statement, wherein the pushing query statement comprises a summary query statement and a plurality of item label query statements matched with the data source one by one;
registering each target query statement to a computing engine, and executing the summarized query statement by the computing engine to obtain a data analysis result corresponding to the data analysis query statement.
In a second aspect, the application further provides a data analysis device. The device comprises:
the data acquisition module is used for acquiring data analysis query sentences aiming at different data sources;
the statement analysis module is used for carrying out operator pushing on a logic plan tree for representing the data analysis query statement to obtain a push-down query statement matched with the data analysis query statement, wherein the push-down query statement comprises a summary query statement and a plurality of item-labeled query statements matched with the data source one by one;
The data analysis module is used for registering each target query statement to a calculation engine, and executing the summarized query statement through the calculation engine to obtain a data analysis result corresponding to the data analysis query statement.
In a third aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor which when executing the computer program performs the steps of:
acquiring data analysis query sentences aiming at different data sources;
performing operator pushing on a logic plan tree for representing the data analysis query statement to obtain a pushing query statement matched with the data analysis query statement, wherein the pushing query statement comprises a summary query statement and a plurality of item label query statements matched with the data source one by one;
registering each target query statement to a computing engine, and executing the summarized query statement by the computing engine to obtain a data analysis result corresponding to the data analysis query statement.
In a fourth aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:
Acquiring data analysis query sentences aiming at different data sources;
performing operator pushing on a logic plan tree for representing the data analysis query statement to obtain a pushing query statement matched with the data analysis query statement, wherein the pushing query statement comprises a summary query statement and a plurality of item label query statements matched with the data source one by one;
registering each target query statement to a computing engine, and executing the summarized query statement by the computing engine to obtain a data analysis result corresponding to the data analysis query statement.
In a fifth aspect, the present application also provides a computer program product. The computer program product comprises a computer program which, when executed by a processor, implements the steps of:
acquiring data analysis query sentences aiming at different data sources;
performing operator pushing on a logic plan tree for representing the data analysis query statement to obtain a pushing query statement matched with the data analysis query statement, wherein the pushing query statement comprises a summary query statement and a plurality of item label query statements matched with the data source one by one;
registering each target query statement to a computing engine, and executing the summarized query statement by the computing engine to obtain a data analysis result corresponding to the data analysis query statement.
The above data analysis method, apparatus, computer device, storage medium and computer program product, wherein the method analyzes query statements by obtaining data for different data sources; performing operator pushing on a logic plan tree for representing the data analysis query statement to obtain a pushing query statement matched with the data analysis query statement, wherein the pushing query statement comprises a summary query statement and a plurality of item-labeled query statements matched with the data source one by one; registering each target query statement to a computing engine, and executing the summarized query statement by the computing engine to obtain a data analysis result corresponding to the data analysis query statement. After the data analysis query statement is obtained, the data analysis query statement is analyzed into the summary query statement and the multi-item standard query statement with the data sources matched one by one, and then the data under the condition of multiple data sources is analyzed through the calculation engine, so that the calculation engine can be freely selected according to the condition of the data sources, and the expansibility of joint analysis under the condition of multiple data sources is effectively improved.
Drawings
FIG. 1 is a diagram of an application environment for a data analysis method in one embodiment;
FIG. 2 is a flow chart of a method of data analysis in one embodiment;
FIG. 3 is a flowchart illustrating a step of obtaining a push-down query statement matching a data analysis query statement by operator push-down in one embodiment;
FIG. 4 is a schematic diagram of a logical plan tree change back and forth in an operator push rule process in one embodiment;
FIG. 5 is a flow diagram that illustrates the steps of generating a push-down query statement via a visitor pattern in one embodiment;
FIG. 6 is a flowchart illustrating steps for recursively constructing target query statements corresponding to the same data source in one embodiment;
FIG. 7 is a schematic diagram of a step of generating a corresponding target query statement for Join in one embodiment;
FIG. 8 is a flowchart illustrating steps for obtaining data analysis results in one embodiment;
FIG. 9 is a schematic diagram of a connection of a computing engine to a data source in one embodiment;
FIG. 10 is a flowchart illustrating steps for registering a target query statement and calculating a final data analysis result in one embodiment;
FIG. 11 is an interaction diagram that creates MySQL data sources in one embodiment;
FIG. 12 is a flow chart of a method of data analysis in another embodiment;
FIG. 13 is a block diagram showing the structure of a data analysis device in one embodiment;
fig. 14 is an internal structural diagram of a computer device in one embodiment.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
Cloud technology refers to a hosting technology for unifying serial resources such as hardware, software, network and the like in a wide area network or a local area network to realize calculation, storage, processing and sharing of data. The cloud technology is based on the general names of network technology, information technology, integration technology, management platform technology, application technology and the like applied by the cloud computing business mode, can form a resource pool, and is flexible and convenient as required. Cloud computing technology will become an important support. Background services of technical networking systems require a large amount of computing, storage resources, such as video websites, picture-like websites, and more portals. Along with the high development and application of the internet industry, each article possibly has an own identification mark in the future, the identification mark needs to be transmitted to a background system for logic processing, data with different levels can be processed separately, and various industry data needs strong system rear shield support and can be realized only through cloud computing. The application relates to a database technology in cloud technology.
The database, which can be simply regarded as an electronic filing cabinet, is a place for storing electronic files, and a user can perform operations such as adding, inquiring, updating, deleting and the like on data in the files. A "database" is a collection of data stored together in a manner that can be shared with multiple users, with as little redundancy as possible, independent of the application. The database management system (Database Management System, DBMS) is a computer software system designed for managing databases, and generally has basic functions of storage, interception, security, backup, and the like. The database management system may classify according to the database model it supports, e.g., relational, XML (Extensible Markup Language ); or by the type of computer supported, e.g., server cluster, mobile phone; or by the query language used, such as SQL (structured query language (Structured Query Language), XQuery, or by the energy impact emphasis, such as maximum-scale, maximum-speed, or other classification means, regardless of which classification means is used, some DBMSs can cross-category, for example, while supporting multiple query languages.
Technical terms also referred to in the present application include:
data source: a data storage system such as traditional relational databases Oracle, mySQL, etc. and big data storage systems Hive, HBase, etc.
Joint analysis: data analysis is performed based on data in different data sources.
JDBC (Java Database Connectivity, java database connection): an application program interface for specifying how a client program accesses a database provides methods such as querying and updating data.
Hive: is a system for storing big data and a big data distributed computing engine.
Spark, a big data distributed computing engine, has developed to date as a standard for off-line processing of big data in the industry.
Prest: memory-based MPP distributed computing engine.
RBO (Rule-Based Optimization, rule-based optimizer): and converting the relational expression according to the optimization rule, cutting off the original expression, and generating a final execution plan after a series of conversions.
The data analysis method provided by the embodiment of the application can be applied to an application environment shown in figure 1. Wherein the terminal 102 communicates with the server 104 via a network, and the server 104 has a computing engine integrated thereon. The first data source 106, the second data source 108, and the third data source 110 may store data that the server 104 needs to process. The first data source 106, the second data source 108, and the third data source 110 may be integrated on the server 104, or may be located on a cloud or other server. When a worker on the side of the terminal 102 needs to synthesize data of a plurality of different data sources to perform data analysis, a data analysis query statement aiming at the different data sources can be sent to the server 104, and the server 104 acquires the data analysis query statement aiming at the different data sources; performing operator pushing on a logic plan tree for representing the data analysis query statement to obtain a pushing query statement matched with the data analysis query statement, wherein the pushing query statement comprises a summary query statement and a plurality of item-labeled query statements matched with the data source one by one; registering each target query statement to a computing engine, and executing the summarized query statement by the computing engine to obtain a data analysis result corresponding to the data analysis query statement. And then feeding back the data analysis result to the terminal 102, wherein the terminal 102 can be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, internet of things equipment and portable wearable equipment, and the internet of things equipment can be an intelligent sound box, an intelligent television, an intelligent air conditioner, intelligent vehicle-mounted equipment and the like. The portable wearable device may be a smart watch, smart bracelet, headset, or the like. The server 104 may be implemented as a stand-alone server or as a server cluster of multiple servers.
In one embodiment, as shown in fig. 2, a data analysis method is provided, and the method is applied to the server 104 in fig. 1 for illustration, and includes the following steps:
step 201, obtain data analysis query statement for different data sources.
The data source refers to a source of data analyzed by the data analysis method in the application, and the data source specifically refers to a system for data storage, such as traditional relational databases Oracle, mySQL, etc., and large data storage systems Hive, HBase, etc. Before using the data source, the relevant information of the data source needs to be registered. The data analysis query statement specifically refers to a statement for operating on a database, and the query and analysis of the data in the database can be realized through the data analysis query statement. In one embodiment, the data analysis query statement may be an SQL (Structured Query Language ) statement, where the data query statement may specify which data sources connected to the server 104 need to be used in the data analysis.
Specifically, when a worker at the terminal 102 needs to combine data of a plurality of data sources to perform data analysis, he can write corresponding data analysis query sentences and then send the data analysis query sentences to the server 104 to start a data analysis process, the server 104 can receive the data analysis query sentences and then start to query according to the data analysis query sentences, and in the query process, the data analysis query sentences for querying different data sources are split into target query sentences for different data sources and then summarized and analyzed, so that the dependence on a data engine and a connector is eliminated, and the expansibility of the data analysis process is ensured.
Step 203, performing operator pushing on the logic plan tree for representing the data analysis query statement to obtain a pushing query statement matched with the data analysis query statement, wherein the pushing query statement comprises a summary query statement and a plurality of item-labeled query statements matched with the data source one by one.
The logic planning tree is a grammar tree, which is a tree formed when the grammar tree is deduced according to a certain rule. In the data analysis process, a data analysis query statement can be converted into a grammar tree, and the analysis of the grammar tree can be converted into an execution plan. When big data analysis is performed, for better expansibility, fault tolerance and high availability, the execution plan is divided into a logic execution plan and a physical execution plan due to the fact that the big data analysis is performed, the logic plan is segmented according to the characteristics of the query, the segmented logic plan can be distributed to parallel nodes with better expansibility, and finally the logic execution plan is converted into the physical execution plan to query. The execution process of the data analysis query statement is like a processing pipeline of a factory, the data analysis query statement is gradually progressed layer by layer, the expected result is finally obtained, and an operator is better than one process. For example, for SQL sentences, operators for realizing functions of selection, projection, connection and the like are specifically included. The operator concrete may include: data source (DataSource), selection (Selection), projection (project), join (Join), sort (Sort), grouping (Aggregation), sub-query (sub), etc. Pushdown is an optimized way of database statements, typically used in the query phase, by which operators in the logical plan tree can be adjusted. The push-down query statement specifically comprises a summary query statement and a plurality of item-labeled query statements which are matched with the data sources one by one. The target query statement is used for carrying out query analysis on the matched single data sources, and the data analysis query statement aiming at different data sources can be effectively differentiated into a multi-item target query statement by carrying out operator pushing on the logic plan tree, so that query is realized. The summarized query statement is used for comprehensively processing the query analysis results of the multi-item target query statement.
Specifically, after obtaining the data analysis query statement, the server 104 will convert the data analysis query statement into a logic plan tree, and then push down the operator, so as to split the data analysis query statement for different data sources, and obtain a summary query statement and a multi-item-standard query statement that is matched with the data sources one by one. And then, the trouble of a connector can be eliminated through corresponding target query sentences, so that data query is realized, and finally, the query results of the target query sentences are integrated through summarizing the query sentences, so that the final data analysis result can be obtained.
And step 205, registering each target query statement in a computing engine, and executing the summarized query statement by the computing engine to obtain a data analysis result corresponding to the data analysis query statement.
Where a compute engine is a program that processes data specifically, such as SQL, behind it is the compute engine of the database, but the computation and storage of the database is typically integrated together, collectively referred to as the database engine. The calculation engines are classified into a stream processing calculation engine, which has Storm, samza, flink, spark etc. and a batch processing calculation engine, which has Spark, hive, pig, flink etc. The computing engines that support both stream processing and batch processing are only Apache Flink and Apache Spark.
Specifically, after the push-down query statement is obtained, corresponding data query analysis can be performed based on the obtained push-down query statement. In performing data analysis, each target query statement may be registered with a compute engine, where the compute engine may be any compute engine that does not require customizing different connectors for different data sources to perform data query analysis. The data analysis results corresponding to the data analysis query statement can be obtained by registering each target query statement to the computing engine respectively, namely, the data query analysis aiming at the data source can be realized in the data source corresponding to the target query statement, and finally, the data analysis results of all the data sources can be synthesized by executing the summarization query statement through the computing engine.
According to the data analysis method, the query statement is analyzed by acquiring the data aiming at different data sources; performing operator pushing on a logic plan tree for representing the data analysis query statement to obtain a pushing query statement matched with the data analysis query statement, wherein the pushing query statement comprises a summary query statement and a plurality of item-labeled query statements matched with the data source one by one; registering each target query statement to a computing engine, and executing the summarized query statement by the computing engine to obtain a data analysis result corresponding to the data analysis query statement. After the data analysis query statement is obtained, the data analysis query statement is analyzed into the summary query statement and the multi-item standard query statement with the data sources matched one by one, and then the data under the condition of multiple data sources is analyzed through the calculation engine, so that the calculation engine can be freely selected according to the condition of the data sources, and the expansibility of joint analysis under the condition of multiple data sources is effectively improved.
In one embodiment, as shown in FIG. 3, step 203 comprises:
step 302, analyzing and processing the data analysis query statement to generate a logic plan tree.
And 304, performing transformation processing on the logic plan tree according to the operator pushing rule corresponding to the logic plan tree so as to sum operators of the same data source.
And 306, generating summarized query sentences and multi-item label query sentences which are matched with the data sources one by one according to the logic plan tree after the transformation processing.
The analysis process of the data analysis query statement belongs to the category of compilers and comprises the processes of lexical analysis, grammar and semantic analysis, optimization, execution code generation and the like. The process of generating the logic plan tree is a logic analysis process in semantic analysis, and the logic analysis process is to analyze what is the bottom of the input data analysis query statement, and what operations are all performed. Generally, a data analysis query statement always has one input and one output, and input data is processed by data analysis processing to obtain output data. Thus, the logical plan tree generated at the compilation stage may be traversed in accordance with the order of execution of the data analysis query statement in which what operations are encountered to generate what operators, and the expression is encountered to invoke the prior expression analysis. The operator pushing rules specifically include aggregation pushdown rule, joinpushdown rule, unionpushdown rule, limit pushdown rule, filterpushdown rule, profjectpushdown rule, orderbypushdown rule, and the like, which are used to sum up operators belonging to the same data source, and different operator pushing rules are adopted for different logical plan tree scenarios.
Specifically, in the process of generating a push-down query statement, in order to split a data analysis query statement for different data sources into multiple-item-standard query statements that are matched with the data sources one by one, the data analysis query statement needs to be decomposed, and then operators belonging to the same data source are summed together. Therefore, the logic planning tree is generated through analysis processing, and then the conversion is carried out based on the operator pushing rule corresponding to the matching of the logic planning tree, so that the operators of the same data source in the logic planning tree are summed up. And then, further generating summary query sentences and multi-item label query sentences which are matched with the data sources one by one through push-down processing. In this embodiment, the conversion processing is performed on the logic plan tree by using the operator pushing rule, so that the operators of the same data source are summed up, and the effectiveness of the operator pushing processing process can be effectively ensured.
In one embodiment, the method further comprises: and matching the logic planning tree through a rule-based optimizer to obtain an operator pushing rule corresponding to the logic planning tree.
Among them, the Rule-based optimizer, i.e., RBO (Rule-Based Optimization), includes AggregationPushDownRule, joinPushDowRule, unionPushDownRule, limitPushDownRule, filterPushDownRule, projectPushDownRule, orderbyPushDownRule and other different optimization rules. The logical plan tree may be transformed according to the optimization rules to generate a final execution plan.
Specifically, after the logic plan tree is obtained, rule matching can be performed through a rule-based optimizer, different logic plan trees correspond to different optimization rules, and after the logic plan tree is subjected to transformation processing through the corresponding optimization rules, operators under the same data source can be effectively summarized, so that operator pushing is realized. As shown in fig. 4, taking Aggregation PushDown to Union as an example, a template of a logic planning tree is first matched, when the logic planning tree accords with Aggregation-Union, the corresponding operator pushing rule is Aggregation push down rule, at this time, aggregation can be pushed down to the Union operator through Aggregation push down rule, and meanwhile, a new Aggregation operator is generated, so that the accuracy of a result is ensured. In this embodiment, the rule matching corresponding to the logic plan tree is performed by the rule-based optimizer, so that a corresponding operator pushing rule can be effectively found, and thus operator pushing is realized, and a corresponding pushing query statement is obtained.
In one embodiment, as shown in FIG. 5, step 306 includes:
step 502, constructing an abstract syntax tree layer by layer through a visitor mode based on the logic planning tree after the transformation processing.
At step 504, a multi-label query statement is recursively generated that matches the data sources one to one based on the abstract syntax tree.
Step 506, generating a summary query statement according to the target query statement.
Wherein the visitor pattern, the initiator pattern, represents an operation on elements in an object structure that allows you to define new operations on elements without changing their class. For example, you make a guest at a friend's house, you are a visitor, a friend receives your visit, you pass through the friend's description, and then a decision is made on the friend's description, which is the visitor pattern. The visitor schema may encapsulate operations that act on elements of a data structure that may define new operations on those elements without changing the data structure. An abstract syntax tree can be understood as an organized form of database statements. For example, for an SQL statement, the corresponding abstract syntax tree SqlNode, sqlNode is an organization form of dynamic SQL configuration in a program, and each XML Node will parse into a corresponding SqlNode object. And for recursion, the program invokes its own programming skills called recursion (recurrence). Recursion is widely used as an algorithm in programming languages. A process or function has a method for directly or indirectly calling itself in its definition or description, it usually converts a large complex problem layer by layer into a smaller-scale problem similar to the original problem to solve it, the recursive strategy can describe repeated calculation required by solving the problem process only by a small number of programs, greatly reducing the code quantity of the programs. The ability to recursively consists in defining an infinite set of objects with limited statements. Generally, recursion requires boundary conditions, a recursion advance segment, and a recursion return segment. Recursively advancing when the boundary condition is not satisfied; when the boundary condition is met, the recursion returns. The application mainly builds the final target query statement in a recursion mode.
Specifically, the process of generating the target query statement may be specifically regarded as a process of reversely pushing operators of the same data source into one database query statement, and the operators of the same data source are summed up by performing transformation processing on the logic plan tree through an operator pushing rule. Since operators for the same data source have been summed together, at this point, the operators for the same data source can be back-pushed into one database query statement. Through the visitor mode, a complete abstract syntax tree can be built layer by layer from bottom to top based on operators of the same data source, and then a complete target query statement for a single data source is generated in a recursive manner. And then, after target query sentences corresponding to all the data sources are obtained, generating summarized query sentences by the target query sentences, wherein the summarized query sentences are used for comprehensively processing query results of the target query sentences to obtain a final data analysis result. In the embodiment, the abstract syntax tree is built layer by layer through the visitor mode, so that multi-item target query sentences which are matched with the data sources one by one are effectively built, the validity of the target query sentences can be effectively ensured, and the accuracy of the data query results is ensured.
In one embodiment, as shown in FIG. 6, step 502 includes:
step 601, generating a child node query statement corresponding to a data source in an access mode based on an abstract syntax tree.
Step 603, constructing target query sentences corresponding to the same data source in a recursion mode according to the child node query sentences corresponding to the same data source.
The abstract syntax tree of the same data source, which is built layer by layer through the visitor mode, can comprise a plurality of sub-nodes, and the sub-node query statement is a query statement corresponding to a single sub-node.
Specifically, when constructing a target query statement, firstly, determining an abstract syntax tree corresponding to a single data source, and generating a child node query statement corresponding to the data source in an access mode based on the abstract syntax tree; and finally, constructing target query sentences corresponding to the same data source in a recursion mode according to the child node query sentences corresponding to the same data source. As shown in FIG. 7, in one embodiment, the data analysis query statement is embodied as an SQL statement. Taking Join as an example, when a target query statement corresponding to the Join needs to be generated, corresponding SQL statements are generated respectively based on the left child node and the right child node of the Join through an access mode, and finally a final Join SQL is constructed. The pseudo code generated by the corresponding push-down SQL is as follows:
visit(JdbcJoin joinNode)
1:SqlNode leftSql=visitChild(joinNode.leftNode)
2:SqlNode rightSql=visitChild(joinNode.rightNode)
3:return JoinSqlNode(leftSql,rightSql,joinNode.joinType)
In this embodiment, through the sub-node query statement corresponding to the same data source, the target query statement corresponding to the same data source is constructed in a recursive manner, so that the corresponding target query statement corresponding to the same data source can be effectively constructed, and the accuracy of data analysis and processing is ensured.
In one embodiment, as shown in fig. 8, step 205 includes:
step 801, registering the push-down query statement in a view form to a computing engine to obtain a data query result corresponding to each data source.
Step 803, executing the summary query statement by the computing engine to perform data analysis on the data query result corresponding to each data source, thereby obtaining a data analysis result corresponding to the data analysis query statement.
Wherein a view, i.e., view, refers to a view in a computer database that is a virtual table whose contents are defined by a query. As with the real table, the view contains a series of columns and rows with names. However, the views do not exist in the database in the form of stored sets of data values. The rows and columns are from tables referenced by queries defining the view and are dynamically generated when the view is referenced.
Specifically, the push-down target query sentences of different data sources can be registered in a view form in a computing engine of a server, so that the computing engine is used for computing and acquiring query results corresponding to the target query sentences, and finally, the computing engine can be used for executing a summary query sentence to summarize the query results of the target query sentences, so as to obtain a final data analysis result. In one embodiment, as shown in FIG. 9, the data analysis query statement is embodied as an SQL statement. Each data source connected to server 104 provides a standard JDBC interface. The method comprises the steps of pushing down the calculation of the data source to the data source through JDBC, and pulling the data analysis result through JDBC after the calculation is completed. Based on the results pulled from the different data sources JDBC, the compute engine completes the final joint computation. As shown in fig. 10, assuming that N JDBC data sources are involved in one SQL query, at least n+1 SQL statements (3 SQL statements in the figure, corresponding to 3 views) are eventually generated, where N SQLs are required to be pushed down to the data sources (one SQL corresponds to one data source, each SQL maps to one view of the computing engine), and the last one is responsible for the computing engine to aggregate the results of each view (i.e., the results of the push-down computation) into a computation return result. In this embodiment, by registering the push-down query statement in the view form to the calculation engine, the calculation of the push-down query statement can be effectively completed in the calculation engine, so as to ensure the validity of the final data analysis.
The application also provides an application scene, which applies the data analysis method. Specifically, the application of the data analysis method in the application scene is as follows:
the user may conduct storage, querying and analysis of merchandise sales information data through a database service while the user may store such data through a plurality of different data sources (databases). Before a user uses a data source, it is necessary to register information about the data source. FIG. 11 is an interaction diagram for creating MySQL data sources as follows. After creation is complete, the data in the data source may be queried by the name of the data source (datasource 1). When a user needs to perform data analysis processing on the data of multiple data sources once, such as sales situation analysis based on sales data stored in multiple databases, the user can write an SQL sentence for query analysis on the data of multiple data sources on the terminal, and then send the SQL sentence to the server so as to perform corresponding data query analysis through the server. The server processes the SQL statement, and the process of data analysis may specifically refer to fig. 12, where first, the server needs to parse the obtained SQL statement to generate a logic plan tree, then apply an operator push rule to push the SQL statement, so as to reversely generate n+1 push SQL statements, where N is the number of data sources, i.e. includes multiple SQL statements matched with the data sources one by one, and further includes one summarized SQL statement, and then the server may create N JDBC views in the calculation engine according to the push SQL statement, and submit the summarized SQL statement to the calculation engine, so that final data analysis is performed by the calculation engine, and a data analysis result corresponding to the SQL statement is obtained. The process of parsing the obtained SQL sentence to generate a logic plan tree, then applying an operator pushing rule to push down the SQL sentence, thereby reversely generating n+1 pushing down SQL sentences specifically comprises the following steps: analyzing the SQL sentence to generate a logic planning tree; transforming the logic plan tree according to the operator pushing rule corresponding to the logic plan tree so as to sum operators of the same data source; and generating a push SQL statement according to the logic planning tree after the transformation processing. And the operator push rules are obtained by matching the logic plan tree by a rule-based optimizer. The generation process of the push-down SQL sentence specifically comprises the following steps: constructing an abstract syntax tree layer by layer through a visitor mode based on the logic planning tree after the transformation processing; generating a child node query statement corresponding to the data source in an access mode based on the abstract syntax tree; constructing target query sentences corresponding to the same data source in a recursion mode according to child node query sentences corresponding to the same data source; and generating a summary query statement according to the target query statement. Finally, the process of obtaining the analysis result needs to register the push-down query statement to the computing engine in a view form to obtain the data query result corresponding to each data source; and executing the summarized query statement through the computing engine so as to perform data analysis on the data query results corresponding to the data sources and obtain data analysis results corresponding to the data analysis query statement.
It should be understood that, although the steps in the flowcharts related to the above embodiments are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.
Based on the same inventive concept, the embodiment of the application also provides a data analysis device for realizing the data analysis method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation in one or more embodiments of the data analysis device provided below may refer to the limitation of the data analysis method hereinabove, and will not be repeated herein.
In one embodiment, as shown in fig. 13, there is provided a data analysis apparatus including:
a data acquisition module 1302, configured to acquire data analysis query statements for different data sources;
the statement analysis module 1304 is configured to perform operator pushing on a logic plan tree for representing the data analysis query statement, so as to obtain a push-down query statement that is matched with the data analysis query statement, where the push-down query statement includes a summary query statement and a multi-item label query statement that is matched with the data source one by one;
the data analysis module 1306 is configured to register each target query statement to the computing engine, and execute the summary query statement through the computing engine to obtain a data analysis result corresponding to the data analysis query statement.
In one embodiment, the statement parsing module 1304 is specifically configured to: analyzing the data analysis query statement to generate a logic plan tree; transforming the logic plan tree according to the operator pushing rule corresponding to the logic plan tree so as to sum operators of the same data source; and generating summarized query sentences and multi-item label query sentences which are matched with the data sources one by one according to the logic plan tree after the transformation processing.
In one embodiment, the statement parsing module 1304 is further configured to: and matching the logic planning tree through a rule-based optimizer to obtain an operator pushing rule corresponding to the logic planning tree.
In one embodiment, the statement parsing module 1304 is further configured to: constructing an abstract syntax tree layer by layer through a visitor mode based on the logic planning tree after the transformation processing; recursively generating a multi-item mark query statement which is matched with the data sources one by one based on the abstract syntax tree; and generating a summary query statement according to the target query statement.
In one embodiment, the statement parsing module 1304 is further configured to: generating a child node query statement corresponding to the data source in an access mode based on the abstract syntax tree; and constructing target query sentences corresponding to the same data source in a recursion mode according to the child node query sentences corresponding to the same data source.
In one embodiment, the data analysis module 1306 is specifically configured to: registering the push-down query statement in a view form to a computing engine to acquire a data query result corresponding to each data source; and executing the summarized query statement through the computing engine so as to perform data analysis on the data query results corresponding to the data sources and obtain data analysis results corresponding to the data analysis query statement.
The respective modules in the above-described data analysis apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In one embodiment, a computer device is provided, which may be a server, and the internal structure of which may be as shown in fig. 14. The computer device includes a processor, a memory, an Input/Output interface (I/O) and a communication interface. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is for storing data analysis related data. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a data analysis method.
It will be appreciated by those skilled in the art that the structure shown in fig. 14 is merely a block diagram of a portion of the structure associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements are applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.
In an embodiment, there is also provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.
In one embodiment, a computer-readable storage medium is provided, storing a computer program which, when executed by a processor, implements the steps of the method embodiments described above.
In one embodiment, a computer program product or computer program is provided that includes computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the steps in the above-described method embodiments.
It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region.
Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The foregoing examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims (10)

1. A method of data analysis, the method comprising:
acquiring data analysis query sentences aiming at different data sources;
performing operator pushing on a logic plan tree for representing the data analysis query statement to obtain a pushing query statement matched with the data analysis query statement, wherein the pushing query statement comprises a summary query statement and a plurality of item label query statements matched with the data source one by one;
Registering each target query statement to a computing engine, and executing the summarized query statement by the computing engine to obtain a data analysis result corresponding to the data analysis query statement.
2. The method of claim 1, wherein performing an operator pushdown on a logical plan tree used to characterize the data analysis query statement to obtain a pushdown query statement that matches the data analysis query statement comprises:
analyzing the data analysis query statement to generate a logic plan tree;
performing transformation processing on the logic plan tree according to an operator pushing rule corresponding to the logic plan tree so as to sum operators of the same data source;
and generating the summarized query statement and the multi-item target query statement which is matched with the data source one by one according to the logic plan tree after the transformation processing.
3. The method according to claim 2, wherein the method further comprises:
and matching the logic plan tree through a rule-based optimizer to obtain an operator pushing rule corresponding to the logic plan tree.
4. The method of claim 2, wherein generating the summarized query statement and the multi-label query statement that matches the data source one-to-one based on the transformed logical plan tree comprises:
Constructing an abstract syntax tree layer by layer through a visitor mode based on the logic planning tree after the transformation processing;
recursively generating a multi-item-standard query statement which is matched with the data sources one by one based on the abstract syntax tree;
and generating a summary query statement according to the target query statement.
5. The method of claim 4, wherein recursively generating multi-label query statements that match the data sources one-to-one based on the abstract syntax tree comprises:
generating a child node query statement corresponding to the data source in an access mode based on the abstract syntax tree;
and constructing target query sentences corresponding to the same data source in a recursion mode according to the child node query sentences corresponding to the same data source.
6. The method of claim 1, wherein registering each of the target query terms with a computing engine, and executing the summary query terms by the computing engine to obtain the data analysis result corresponding to the data analysis query term comprises:
registering the push-down query statement in a view form to a computing engine to acquire a data query result corresponding to each data source;
and executing the summarized query statement through the computing engine to perform data analysis on the data query results corresponding to the data sources to obtain data analysis results corresponding to the data analysis query statement.
7. A data analysis device, the device comprising:
the data acquisition module is used for acquiring data analysis query sentences aiming at different data sources;
the statement analysis module is used for carrying out operator pushing on a logic plan tree for representing the data analysis query statement to obtain a push-down query statement matched with the data analysis query statement, wherein the push-down query statement comprises a summary query statement and a plurality of item-labeled query statements matched with the data source one by one;
the data analysis module is used for registering each target query statement to a calculation engine, and executing the summarized query statement through the calculation engine to obtain a data analysis result corresponding to the data analysis query statement.
8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 6 when the computer program is executed.
9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 6.
10. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 6.
CN202210246053.6A 2022-03-14 2022-03-14 Data analysis method, device, computer equipment and storage medium Pending CN116795859A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210246053.6A CN116795859A (en) 2022-03-14 2022-03-14 Data analysis method, device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210246053.6A CN116795859A (en) 2022-03-14 2022-03-14 Data analysis method, device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116795859A true CN116795859A (en) 2023-09-22

Family

ID=88038186

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210246053.6A Pending CN116795859A (en) 2022-03-14 2022-03-14 Data analysis method, device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116795859A (en)

Similar Documents

Publication Publication Date Title
CN106547809B (en) Representing compound relationships in a graph database
US8943059B2 (en) Systems and methods for merging source records in accordance with survivorship rules
CN106202207B (en) HBase-ORM-based indexing and retrieval system
CN105868204B (en) A kind of method and device for converting Oracle scripting language SQL
CN105989150B (en) A kind of data query method and device based on big data environment
US9177021B2 (en) Relational query planning for non-relational data sources
US9009173B2 (en) Using views of subsets of nodes of a schema to generate data transformation jobs to transform input files in first data formats to output files in second data formats
CN101901265B (en) Objectification management system of virtual test data
CN105550241A (en) Multidimensional database query method and apparatus
CN105210058A (en) Graph query processing using plurality of engines
CN107491476B (en) Data model conversion and query analysis method suitable for various big data management systems
CN106844380A (en) A kind of database operation method, information processing method and related device
CN108052635A (en) A kind of heterogeneous data source unifies conjunctive query method
de la Vega et al. Mortadelo: Automatic generation of NoSQL stores from platform-independent data models
US20170060977A1 (en) Data preparation for data mining
CN112860730A (en) SQL statement processing method and device, electronic equipment and readable storage medium
CN117093599A (en) Unified SQL query method for heterogeneous data sources
CN103678396B (en) A kind of data back up method and device based on data model
Andersen et al. SimpleETL: ETL processing by simple specifications
CN114297224A (en) RDF-based heterogeneous data integration and query system and method
CN101719162A (en) Multi-version open geographic information service access method and system based on fragment pattern matching
CN116795859A (en) Data analysis method, device, computer equipment and storage medium
CN110647518B (en) Data source fusion calculation method, component and device
US11074401B2 (en) Merging delta object notation documents
Pintor et al. Why-and how-provenance in distributed environments

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination