CN117668024A

CN117668024A - SQL statement analysis-based data processing and synchronizing method and system

Info

Publication number: CN117668024A
Application number: CN202311675866.8A
Authority: CN
Inventors: 司马铭骏; 谢恒�; 马克; 李凡; 王嘉华
Original assignee: Shanghai Financial Futures Information Technology Co ltd
Current assignee: Shanghai Financial Futures Information Technology Co ltd
Priority date: 2023-12-07
Filing date: 2023-12-07
Publication date: 2024-03-08

Abstract

The invention discloses a data processing and synchronizing method and system based on SQL statement analysis, which solve the problems that a developer needs to realize the associated logic in the SQL statement once again from a code layer, a large amount of redundant development requirements are generated and the development efficiency is low, shorten the data processing path and greatly reduce the reading and processing time required by data processing. The technical proposal is as follows: acquiring input data, the input data comprising: SQL query statements, primary keys, and elastic search index to be written; parsing the SQL query statement to generate an SQL query relationship tree; judging whether the synchronization is the first synchronization, if so, acquiring the snapshot data of the database in a full quantity, and if not, acquiring the change data of the database through real-time synchronization increment; processing data according to the SQL tree, and updating node data stored in the memory or the cache; output nodes in the SQL tree are written to the elastic search.

Description

SQL statement analysis-based data processing and synchronizing method and system

Technical Field

The invention relates to database technology, in particular to a method and a system for data processing and synchronization based on SQL statement analysis.

Background

For a certain business scenario, there often occurs a situation where it is necessary to obtain associated data from multiple tables, for which MySQL has the following drawbacks: first, the essence of JOIN (JOIN operation is a method of correlating data in two or more tables according to specified conditions) in MySQL is cartesian product, with high time complexity and low efficiency; second, mySQL is not suitable for massive data query, and adding an index greatly increases storage cost; third, mySQL does not have the word segmentation capability in ES, and the capability of fuzzy query and full text query is weak. In order to solve the problems, in actual project development, mySQL is often used as a service database, ES is used as a query database, mySQL data JOIN is formed into a service wide table, and data of the service wide table are synchronized to ES in real time, so that read-write separation is realized, query pressure of the MySQL database is relieved, and complex query of mass data is simultaneously dealt with.

In the process of realizing business broad-table data processing and synchronization, a developer often writes corresponding SQL sentences on a database layer first, and then converts the SQL sentences into logic on a code layer. The following problems exist in this flow: firstly, a developer needs to realize the associated logic in the SQL sentence from the code layer once again, so that a great amount of redundant development requirements are generated, and the development efficiency is low; secondly, in the process of data processing, multiple JOIN queries need to be performed on the database to obtain the latest wide-table data, so that the time consumption of processing and synchronization is increased.

Disclosure of Invention

The following presents a simplified summary of one or more aspects in order to provide a basic understanding of such aspects. This summary is not an extensive overview of all contemplated aspects, and is intended to neither identify key or critical elements of all aspects nor delineate the scope of any or all aspects. Its sole purpose is to present some concepts of one or more aspects in a simplified form as a prelude to the more detailed description that is presented later.

The invention aims to solve the problems, and provides a data processing and synchronizing method and system based on SQL statement analysis, which solves the problems that a developer needs to realize the associated logic in the SQL statement again from a code layer, a large amount of redundant development requirements are generated, the development efficiency is low, the data processing path is shortened, and the reading and processing time required by data processing is greatly reduced.

The technical scheme of the invention is as follows: the invention discloses a data processing and synchronizing method based on SQL statement analysis, which comprises the following steps:

step 1: acquiring input data, the input data comprising: SQL query statements, primary keys, and elastic search index to be written;

step 2: parsing the SQL query statement to generate an SQL query relationship tree;

step 3: judging whether the synchronization is the first synchronization, if so, acquiring the snapshot data of the database in a full quantity, and if not, acquiring the change data of the database through real-time synchronization increment;

step 4: processing data according to the SQL tree, and updating node data stored in the memory or the cache;

step 5: output nodes in the SQL tree are written to the elastic search.

According to an embodiment of the method for data processing and synchronization based on SQL statement parsing of the present invention, step 2 further comprises:

step 2-1: analyzing the single-section SQL sentence to obtain an AST grammar analysis tree;

step 2-2: traversing an AST grammar analysis tree, and extracting tables, fields and relations;

step 2-3: and constructing an SQL query relation tree according to the extracted tables, fields and relations.

According to an embodiment of the data processing and synchronization method based on SQL statement parsing of the present invention, in step 2-1, a single-segment SQL statement is parsed by an ANTLR parser and a JavacC parser.

According to an embodiment of the method for data processing and synchronization based on SQL statement parsing of the present invention, step 3 further comprises:

the operation of obtaining the snapshot data of the database in full quantity is to splice SQL sentences according to the table names extracted from the SQL sentences obtained in the step 2, and obtain all the data of the tables in the database at the current moment;

for the operation of incrementally acquiring the database change data, the change is sent to kafka by monitoring the change of the MySQL binlog, the change information of MySQL is read from kafka, and only the data in the table extracted from the SQL statement acquired in the step 2 is reserved.

According to an embodiment of the method for data processing and synchronization based on SQL statement parsing of the present invention, step 4 further comprises:

step 4-1: adding or updating certain root node data in the SQL query relation tree;

step 4-2: updating node data stored in a memory or a cache;

step 4-3: searching all child nodes under the node, wherein each intermediate node is a result generated by JOIN, and for each intermediate node, there are only two father nodes;

step 4-4: searching for each child node to remove another father node of the node, and acquiring the data of the node stored in the memory or the cache;

step 4-5: performing internal connection, left connection or right connection on the data according to the relation between the father nodes stored in the SQL query relation tree to obtain the latest data of the child node, and then jumping to the step 4-2;

step 4-6: and (3) ending the circulation flow of the step (4) until the node does not exist any child node.

The invention also discloses a data processing and synchronizing system based on SQL statement analysis, which comprises:

the data input module is configured to acquire input data, and the input data comprises: SQL query statements, primary keys, and elastic search index to be written;

the query relation tree generation module is configured to analyze the SQL query statement to generate an SQL query relation tree;

the data synchronization module is configured to judge whether the synchronization is the first synchronization, if the synchronization is the first synchronization, the snapshot data of the database is obtained in a full amount, and if the synchronization is not the first synchronization, the change data of the database is obtained through real-time synchronization increment;

the data updating module is configured to process data according to the SQL tree and update node data stored in the memory or the cache;

and the data writing module is configured to write the output nodes in the SQL tree into the elastic search.

According to an embodiment of the system for data processing and synchronization based on SQL statement parsing of the invention, the query relationship tree generation module is further configured to perform the following:

According to an embodiment of the data processing and synchronization system based on SQL statement parsing, in step 2-1, a single-segment SQL statement is parsed by an ANTLR parser and a JavacC parser.

According to an embodiment of the system for SQL statement parsing based data processing and synchronization of the invention, the data synchronization module is further configured to perform the following:

for the operation of obtaining the snapshot data of the database in full quantity, the table names extracted from the SQL sentences obtained in the query relation tree generation module are spliced to obtain all the data of the tables in the database at the current moment;

According to an embodiment of the SQL statement parsing based data processing and synchronization system of the invention, the data update module is further configured to perform the following:

step 4-2: updating node data stored in a memory or a cache;

step 4-6: and ending the circulation flow in the data updating module until the node does not exist in the child nodes.

Compared with the prior art, the invention has the following beneficial effects: the invention starts from the thought of SQL query statement analysis, analyzes SQL query statement into SQL query relation tree, each node in the tree represents data of one table or intermediate result data of two (more) tables join, and node data is stored in memory or buffer. When the change of the table data related to the SQL query statement is monitored, data processing is carried out according to the structure of the query tree, and finally, the processed data is automatically synchronized to the ES. The developer only needs to pay attention to the logic of the SQL query statement, does not need to pay attention to how the code is implemented, and greatly reduces the development cost in data processing and synchronization.

According to the invention, based on the ANTLR4 analyzer, the SQL query relation tree is generated, a developer only needs to verify the correctness of SQL query sentences at a database level, so that a data synchronization program can be automatically generated, low codes and automation are realized in the whole process, and the problems that the developer needs to realize the association logic in the SQL query sentences once again at a code level, a large amount of redundant development requirements are generated, and the development efficiency is low can be solved. In addition, in the data processing stage, the invention utilizes the SQL query relation tree provided in the method to store the source data and the intermediate data in the memory or the cache. Meanwhile, SQL query relation tree is utilized for data processing, so that a data processing path is shortened, and reading and processing time required by data processing is greatly reduced.

Drawings

The above features and advantages of the present invention will be better understood after reading the detailed description of embodiments of the present disclosure in conjunction with the following drawings. In the drawings, the components are not necessarily to scale and components having similar related features or characteristics may have the same or similar reference numerals.

FIG. 1 illustrates a flow chart of one embodiment of a method of SQL statement parsing based data processing and synchronization of the present invention.

FIG. 2 illustrates a detailed flow chart of the generation of an SQL query relational tree by parsing an SQL query statement in the method steps illustrated in FIG. 1.

FIG. 3 shows an example diagram of a SQL query relationship tree in the method steps shown in FIG. 1.

FIG. 4 shows a detailed flow chart of processing data according to the SQL query relational tree in the method steps shown in FIG. 1.

FIG. 5 illustrates a schematic diagram of one embodiment of a system for SQL statement parsing based data processing and synchronization of the present invention.

Detailed Description

The invention is described in detail below with reference to the drawings and the specific embodiments. It is noted that the aspects described below in connection with the drawings and the specific embodiments are merely exemplary and should not be construed as limiting the scope of the invention in any way.

FIG. 1 illustrates a flow of one embodiment of a method of SQL statement parsing based data processing and synchronization of the present invention. Referring to fig. 1, the implementation steps of the method of the present embodiment are described in detail below.

Step 1: acquiring input data, the input data comprising: SQL query statements, primary keys, and an elastic search index that needs to be written. The details are as follows.

The details are as follows:

the first part inputs a single-segment SQL sentence, and does not support multiple segments of SQL sentences or SQL scripts.

The second part is a primary key, and if the primary key is not input, the additional synchronization is performed only, and the update synchronization is not performed.

The third part inputs the elastiscearch index that eventually needs to be written.

Step 2: the SQL query statement is parsed to generate an SQL query relationship tree.

The refinement step of step 2 is shown in fig. 2 and described in detail below.

Step 2-1: based on the ANTLR parser (e.g., the ANTLR4 parser, i.e., the fourth version of ANTLR), the single-segment SQL statement is parsed to obtain an AST syntax parse tree.

ANTLR (collectively: ANother Tool for Language Recognition) is a language recognition tool written in Java language, and uses a top-down recursive descent analysis method based on LL (x) analysis. The grammar analyzer, the tree analyzer and other modules automatically construct the custom language by inputting grammar description files. ANTLR uses context-free grammar description language, and grammar definition uses EBNF-like approaches.

The abstract syntax tree (Abstract Syntax Tree, AST for short) is a representation of the abstract syntax structure of the source code in the computer memory. Each node in an AST represents a syntactic construct in the source code, such as a variable declaration, expression, function call, control structure, etc. The root node of the tree typically represents the entire source code file, while the child nodes represent specific syntax elements and their relationships. For example, an AST node of a function declaration may contain multiple child nodes, such as function names, parameter lists, and function volumes.

In this step, the ANTLR4 parser can be replaced with a framework in which JavaCC or other underlying layers are implemented through ANTLR4 or JavaCC.

Step 2-2: traversing AST grammar analysis tree, extracting table, field and relation.

In this step, using a Visitor mode in the ANTLR, inheriting the SQL basebaselocator class in the ANTLR (the SQL baselocator class is a basic class in the ANTLR for accessing the SQL parse tree), extracting information of nodes in the AST tree, including: all table names are acquired by rewriting the visitTableidenfier method (the visitTableidenfier method is a visitor mode method for accessing and processing the tableideffier node in the SQL syntax analysis tree), all column names are acquired by rewriting the visitColumnReference method (the visitColumnReference method is a visitor mode method for accessing and processing the faunReference node in the SQL syntax analysis tree), all column names are acquired by rewriting the visitJoinReaction (the visitJoinReaction method is a visitor mode method for accessing and processing the JOIN node in the SQL syntax analysis tree), the visitJoinType (the visitJoinType method is a visitor mode method for accessing and processing the JOIN node in the SQL syntax analysis tree, and the visitJoinCritice method is a JOIN node in the SQL syntax analysis tree and is a JOIN node method for accessing and processing the syntax tree and the relation between the table types in the SQL syntax analysis tree and the node access condition.

The concrete principle of constructing the SQL query relation tree is as follows: all tables are root nodes of the SQL query relation tree, and information of corresponding tables in the database is stored; for each JOIN relationship, an intermediate node is added in the tree to store the intermediate table information generated by the JOIN, wherein the intermediate node contains JOIN conditions for use in subsequent data processing.

In connection with fig. 3, the following are illustrative:

the SQL query relationship tree example graph shows one SQL statement as:

SELECT*FROM A

JOIN B ON A.id＝B.aid

LEFT JOIN C ON B.id＝C.bid

the parsed SQL query relationship tree, A, B, C is the parsed table name, which is the root node of the query relationship tree. The SQL statement resolves a relationship of AJOIN B and (AJOIN B) LEFT JOIN C, respectively, so there are two intermediate nodes in the tree.

Step 3: judging whether the synchronization is the first synchronization, if so, obtaining the snapshot data of the database in full quantity, and if not, obtaining the change data of the database through real-time synchronization increment.

And 2, for the operation of obtaining the snapshot data of the database in a full quantity, splicing the SQL sentences according to the table names extracted from the SQL sentences obtained in the step 2, and obtaining all the data of the tables in the database at the current moment.

For the incremental database change data acquisition operation, the change of MySQL is sent to Kafka (Kafka is an open source stream processing platform developed by Apache software foundation and written by Scala and Java, which is a high throughput distributed publish-subscribe messaging system) by monitoring MySQL binlog (binary log, which is a log file specific to MySQL database and is used for recording all update operations performed on MySQL database), and the change is read from Kafka, so that only the data in the table extracted from the SQL statement acquired in the previous step is kept.

Step 4: and processing the data according to the SQL tree, and updating the node data stored in the memory or the cache.

The refinement step of step 4 is shown in fig. 4 and described in detail below.

Step 4-1: and adding or updating certain root node data in the SQL query relation tree. For full volume updates, multiple additions to the root node may be considered.

Step 4-2: updating the node data stored in the memory or the cache.

For example, using JsonNode as the structure of all data, map < String tableName > List < JsonNode > tableDatas > is used for storage.

Memory or cache storage media include, but are not limited to, file systems, rediss, lockdb, memory.

Step 4-3: all child nodes under the node are found, where each intermediate node is a JOIN-generated result, and for each intermediate node there are and only two parent nodes.

Step 4-4: searching for another father node of each child node except the node, and obtaining the data of the other father node stored in the memory or the cache.

Step 4-5: and 4-2, performing INNER JOIN (INNER connection), LFET JOIN (left connection) or RIGHT JOIN (RIGHT connection) on the data according to the relation between the father nodes stored in the SQL query relation tree to obtain the latest data of the child node, and then jumping to a step 4-2 for updating the memory or caching the stored node data.

For INNER JOIN, searching another table according to parameters of the change table in the condition, if data meeting the condition exists, writing all field in the two tables into Jsonnode, wherein key is field name, and value is field value.

If the change table is LEFT table of LEFT JOIN or RIGHT table of RIGHT JOIN, searching the other table according to parameters of the change table in the condition, if the condition is met, writing all field in the two tables into Jsonnode, wherein key is field name, and value is field value. If not, writing all field in the change table into Jsonnode, wherein key is field name, and value is field value. The field of another table is written in Jsonnode, key is field name, and value is null.

If the change table is the RIGHT table of LEFT JOIN or the LEFT table of RIGHT JOIN, the processing mode is consistent with that of INNER JOIN.

Step 5: output nodes in the SQL tree are written to the elastic search.

The elastiscearch is a distributed and scalable real-time search and analysis engine, and a search engine based on the full-text search engine Apache Lucene (TM). An elastic search document is a basic unit of information that can be indexed, and the document is expressed in JSON (JavaScript Object Notation) format, which is a very popular internet data exchange format. The data structure of the output node is Jsonnode, jsonnode is converted into JsonString, jsonString which is JSON format, and ElasticSearch, elasticSearch can be directly written to map the document to Mapping of the index automatically.

FIG. 5 illustrates the principles of one embodiment of a system of data processing and synchronization based on SQL statement parsing of the present invention. Referring to fig. 5, the system of the present embodiment includes: the system comprises a data input module, a query relation tree generation module, a data synchronization module, a data updating module and a data writing module.

The data input module is configured to acquire input data, and the input data comprises: SQL query statements, primary keys, and an elastic search index that needs to be written.

The specific processing of the data input module is the same as that of step 1 of the above method embodiment, and will not be described herein.

And the query relation tree generating module is configured to parse the SQL query statement to generate an SQL query relation tree.

The query relationship tree generation module is further configured to perform the following:

step 2-1: analyzing the single-segment SQL sentence to obtain an AST grammar analysis tree, wherein the single-segment SQL sentence is analyzed by an ANTLR analyzer and a JavacC analyzer;

The specific processing of the query relation tree generating module is the same as the step 2 of the method embodiment, and is not described herein.

And the data synchronization module is configured to judge whether the synchronization is the first synchronization, if the synchronization is the first synchronization, the database snapshot data is obtained in a full quantity, and if the synchronization is not the first synchronization, the database change data is obtained through real-time synchronization increment.

The data synchronization module is further configured to perform the following:

The specific processing of the data synchronization module is the same as that of step 3 of the above method embodiment, and will not be described herein.

And the data updating module is configured to process data according to the SQL tree and update node data stored in the memory or the cache.

The data update module is further configured to perform the following:

step 4-2: updating node data stored in a memory or a cache;

The specific processing of the data updating module is the same as that of step 4 of the above method embodiment, and will not be described herein.

The specific processing of the data writing module is the same as that of step 5 of the above method embodiment, and will not be described herein.

While, for purposes of simplicity of explanation, the methodologies are shown and described as a series of acts, it is to be understood and appreciated that the methodologies are not limited by the order of acts, as some acts may, in accordance with one or more embodiments, occur in different orders and/or concurrently with other acts from that shown and described herein or not shown and described herein, as would be understood and appreciated by those skilled in the art.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The various illustrative logical blocks, modules, and circuits described in connection with the embodiments disclosed herein may be implemented or performed with a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. In the alternative, the processor and the storage medium may reside as discrete components in a user terminal.

In one or more exemplary embodiments, the functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software as a computer program product, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a computer. By way of example, and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. Any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a web site, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. Disk (disk) and disc (disk) as used herein include Compact Disc (CD), laser disc, optical disc, digital Versatile Disc (DVD), floppy disk and blu-ray disc where disks (disk) usually reproduce data magnetically, while discs (disk) reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.

The previous description of the disclosure is provided to enable any person skilled in the art to make or use the disclosure. Various modifications to the disclosure will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other variations without departing from the spirit or scope of the disclosure. Thus, the disclosure is not intended to be limited to the examples and designs described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for data processing and synchronization based on SQL statement parsing, the method comprising:

step 5: output nodes in the SQL tree are written to the elastic search.

2. The method of data processing and synchronization based on SQL statement parsing of claim 1, wherein step 2 further comprises:

3. The method for processing and synchronizing data based on SQL statement parsing according to claim 2, wherein in step 2-1, the single-segment SQL statement is parsed by an ANTLR parser or a JavacC parser.

4. The method for data processing and synchronization based on SQL statement parsing according to claim 1, wherein step 3 further comprises:

5. The method of data processing and synchronization based on SQL statement parsing of claim 1, wherein step 4 further comprises:

step 4-2: updating node data stored in a memory or a cache;

6. A system for data processing and synchronization based on SQL statement parsing, the system comprising:

7. The SQL statement parsing based data processing and synchronization system of claim 6, wherein the query relationship tree generation module is further configured to perform the following:

8. The system for processing and synchronizing data based on SQL statement parsing according to claim 7, wherein in step 2-1, the single-segment SQL statement is parsed by an ANTLR parser or a JavacC parser.

9. The SQL statement parsing based data processing and synchronization system of claim 6, wherein the data synchronization module is further configured to perform the following:

10. The SQL statement parsing based data processing and synchronization system of claim 6, wherein the data update module is further configured to perform the following:

step 4-2: updating node data stored in a memory or a cache;