CN110909016B

CN110909016B - Repeated association detection method, device, equipment and storage medium based on database

Info

Publication number: CN110909016B
Application number: CN201910967690.0A
Authority: CN
Inventors: 林堋鑫
Original assignee: Ping An Property and Casualty Insurance Company of China Ltd
Current assignee: Ping An Property and Casualty Insurance Company of China Ltd
Priority date: 2019-10-12
Filing date: 2019-10-12
Publication date: 2023-06-16
Anticipated expiration: 2039-10-12
Also published as: CN110909016A

Abstract

The invention relates to the technical field of big data, and discloses a repeated association detection method, device, equipment and storage medium based on a database, which are used for counting repeated association source tables in SQL (structured query language) by splitting sentences in SQL scripts, so that the detection efficiency is improved. The method comprises the following steps: searching a code base of a database, wherein the code base comprises all structured query language SQL scripts; analyzing each SQL script to obtain a plurality of SQL paragraphs; preprocessing a plurality of SQL paragraphs to generate a plurality of processed SQL sentences; inputting a plurality of processed SQL sentences into preset software for analysis, and generating at least one pair of source table names which are related in pairs; labeling script catalogues of the database according to at least one pair of source table names associated with each other, and generating labeled script catalogues; and repeatedly detecting the source table names associated with each other in all preset SQL scripts according to the marked script catalogue, and obtaining a detection result.

Description

Repeated association detection method, device, equipment and storage medium based on database

Technical Field

The present invention relates to the field of big data technologies, and in particular, to a method, an apparatus, a device, and a storage medium for detecting repeated association based on a database.

Background

With the rapid development of the internet, more and more data are processed by computer equipment, and databases are widely used. In database systems, multidimensional data analysis is often required, and the data sources are often multiple tables in relational databases, so multiple table associations are often required, and multiple table join (join) in structured query language (structured query language, SQL) is a resource-consuming operation because it involves network Input/Output (Output) and full table scanning.

Because in large database systems, different data processing scripts are often developed by different developers, there may be several repeatedly performed multi-table related SQL statements in the large number of scripts developed. Therefore, for some tables in common use, it is necessary to perform association in advance, and then let the subsequent SQL script use the associated intermediate table, so as to reduce repeated consumption of resources.

The creation of the middle broad table in the traditional data warehouse generally requires a data analysis personnel to manually perform statistical detection according to personal experience or on a large number of existing SQL scripts to find out which tables can be repeatedly associated, so that the manual analysis statistics has low detection efficiency and long time under the condition of more scripts, and partial scripts can be omitted.

Disclosure of Invention

The invention provides a repeated association detection method, device, equipment and storage medium based on a database, which are used for counting repeated association source tables in SQL (structured query language) by splitting SQL sentences in SQL scripts of the structured query language, so that the detection efficiency is improved.

A first aspect of an embodiment of the present invention provides a method for detecting duplicate association based on a database, including: searching a code base of a database, wherein the code base comprises all structured query language SQL scripts; analyzing each SQL script to obtain a plurality of SQL paragraphs, wherein each SQL script comprises at least one SQL paragraph; preprocessing the SQL paragraphs to generate a plurality of processed SQL sentences, wherein each SQL paragraph contains at least one SQL sentence; inputting the plurality of processed SQL sentences into preset software for analysis, and generating at least one pair of source table names which are related in pairs; labeling script catalogues of the database according to the at least one pair of source table names associated with each other, and generating labeled script catalogues; and repeatedly detecting the source table names related to each other in all preset SQL scripts according to the marked script catalogue, and obtaining a detection result.

Optionally, in a first implementation manner of the first aspect of the embodiment of the present invention, the parsing each SQL script to obtain a plurality of SQL paragraphs, where each SQL script includes at least one SQL paragraph includes: determining a target SQL script; inquiring a plurality of preset SQL keywords in the target SQL script according to preset line numbers, wherein the preset line numbers are used for indicating the line numbers of codes needing to be searched in the target SQL script; when the preset line number contains the preset SQL keywords, generating a plurality of SQL paragraphs, wherein each SQL paragraph at least comprises one preset SQL keyword.

Optionally, in a second implementation manner of the first aspect of the embodiment of the present invention, preprocessing the plurality of SQL paragraphs to generate a plurality of processed SQL statements, where each SQL paragraph includes at least one SQL statement includes: determining an SQL keyword part and a non-SQL keyword part of each SQL paragraph in the plurality of SQL paragraphs; determining a plurality of corresponding to-be-selected semicolons in each SQL paragraph; removing illegal semicolons from the corresponding multiple semicolons to be selected, wherein the illegal semicolons are semicolons in a non-SQL keyword part in the corresponding SQL script, and multiple target semicolons corresponding to each SQL paragraph are obtained; dividing the corresponding SQL paragraphs according to the target marks corresponding to each SQL paragraph to generate a plurality of processed SQL sentences, wherein the plurality of processed SQL sentences comprise the processed SQL sentences of each SQL paragraph.

Optionally, in a third implementation manner of the first aspect of the embodiment of the present invention, inputting the plurality of processed SQL statements into preset software for analysis, generating at least one pair of associated source table names includes: inputting a plurality of processed SQL sentences into preset software to generate an abstract syntax tree; traversing the abstract syntax tree to generate a plurality of query blocks; traversing the query blocks, and judging whether each query block contains a connection node or not; if the plurality of query blocks contain the connection nodes, extracting source table names used in the connection nodes; and generating an analysis result according to the source table names, wherein the analysis result comprises at least one pair of associated source table names.

Optionally, in a fourth implementation manner of the first aspect of the embodiment of the present invention, the inputting the plurality of processed SQL statements into the preset software, generating the abstract syntax tree includes: uniformly setting keywords in the plurality of processed SQL sentences into a uppercase format or a lowercase format; carrying out grammar analysis on SQL sentences with uniform formats through preset software to obtain analyzed codes; adding special words into the parsed codes based on preset rules, wherein the special words are used for indicating specific grammar; an abstract syntax tree is generated.

Optionally, in a fifth implementation manner of the first aspect of the embodiment of the present invention, the performing repeated association detection on the source table names associated with each other in all preset SQL scripts according to the noted script directory, and obtaining a detection result includes: determining a script detection sequence according to the annotated script catalogue; determining source table names associated with each other in a first target script based on the script detection sequence, and determining the logarithm of the source table names associated with each other in the first target script, wherein the first target script is a script sequenced first in the script detection sequence; traversing other scripts based on the script detection sequence to obtain detection results, wherein the detection results comprise source table names associated with each other in pairs and logarithms of the source table names associated with each other in all preset SQL scripts.

Optionally, in a sixth implementation manner of the first aspect of the embodiment of the present invention, after performing repeated association detection on the source table names associated with each other in all preset SQL scripts according to the noted script directory, and obtaining a detection result, the method further includes: determining the association times of each associated source table name according to the detection result; sorting according to the association times of each associated source table name from big to small to obtain an association sequence table; and calling a preset intermediate table to replace a target source table name pair, wherein the target source table name is a preset number of source table names which are ranked at the front in the association sequence table.

A second aspect of an embodiment of the present invention provides a repeated association detection device based on a database, including: the searching unit is used for searching a code base of the database, wherein the code base comprises all structured query language SQL scripts; the analyzing unit is used for analyzing each SQL script to obtain a plurality of SQL paragraphs, and each SQL script comprises at least one SQL paragraph; the preprocessing unit is used for preprocessing the SQL paragraphs to generate a plurality of processed SQL sentences, and each SQL paragraph contains at least one SQL sentence; the analysis unit is used for inputting the plurality of processed SQL sentences into preset software for analysis and generating at least one pair of source table names which are related in pairs; the marking unit is used for marking the script catalogue of the database according to the at least one pair of source table names which are related in pairs, and generating the marked script catalogue; and the detection unit is used for carrying out repeated association detection on the source table names associated with each other in all preset SQL scripts according to the marked script catalogue, and obtaining a detection result.

Optionally, in a first implementation manner of the second aspect of the embodiment of the present invention, the parsing unit is specifically configured to: determining a target SQL script; inquiring a plurality of preset SQL keywords in the target SQL script according to preset line numbers, wherein the preset line numbers are used for indicating the line numbers of codes needing to be searched in the target SQL script; when the preset line number contains the preset SQL keywords, generating a plurality of SQL paragraphs, wherein each SQL paragraph at least comprises one preset SQL keyword.

Optionally, in a second implementation manner of the second aspect of the embodiment of the present invention, the preprocessing unit is specifically configured to: determining an SQL keyword part and a non-SQL keyword part of each SQL paragraph in the plurality of SQL paragraphs; determining a plurality of corresponding to-be-selected semicolons in each SQL paragraph; removing illegal semicolons from the corresponding multiple semicolons to be selected, wherein the illegal semicolons are semicolons in a non-SQL keyword part in the corresponding SQL script, and multiple target semicolons corresponding to each SQL paragraph are obtained; dividing the corresponding SQL paragraphs according to the target marks corresponding to each SQL paragraph to generate a plurality of processed SQL sentences, wherein the plurality of processed SQL sentences comprise the processed SQL sentences of each SQL paragraph.

Optionally, in a third implementation manner of the second aspect of the embodiment of the present invention, the analysis unit includes: the first generation module is used for inputting a plurality of processed SQL sentences into preset software to generate an abstract syntax tree; the second generation module is used for traversing the abstract syntax tree and generating a plurality of query blocks; the traversal judging module is used for traversing the query blocks and judging whether each query block contains a connection node or not; the extraction module is used for extracting source table names used in the connection nodes if the plurality of query blocks contain the connection nodes; and the third generation module is used for generating an analysis result according to the source table names, wherein the analysis result comprises at least one pair of associated source table names.

Optionally, in a fourth implementation manner of the second aspect of the embodiment of the present invention, the first generating module is specifically configured to: uniformly setting keywords in the plurality of processed SQL sentences into a uppercase format or a lowercase format; carrying out grammar analysis on SQL sentences with uniform formats through preset software to obtain analyzed codes; adding special words into the parsed codes based on preset rules, wherein the special words are used for indicating specific grammar; an abstract syntax tree is generated.

Optionally, in a fifth implementation manner of the second aspect of the embodiment of the present invention, the detection unit is specifically configured to: determining a script detection sequence according to the annotated script catalogue; determining source table names associated with each other in a first target script based on the script detection sequence, and determining the logarithm of the source table names associated with each other in the first target script, wherein the first target script is a script sequenced first in the script detection sequence; traversing other scripts based on the script detection sequence to obtain detection results, wherein the detection results comprise source table names associated with each other in pairs and logarithms of the source table names associated with each other in all preset SQL scripts.

Optionally, in a sixth implementation manner of the second aspect of the embodiment of the present invention, the database-based duplicate association detection apparatus further includes: a determining unit, configured to determine the association times of each associated source table name according to the detection result; the sorting unit is used for sorting according to the association times of each associated source table name from big to small to obtain an association sequence table; and the replacing unit is used for calling a preset intermediate table to replace the target source table name pairs, wherein the target source table name is a preset number of source table names which are ranked in the association sequence table and are in front.

A third aspect of the embodiments of the present invention provides a database-based duplicate association detection apparatus, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the database-based duplicate association detection method according to any one of the foregoing embodiments when executing the computer program.

A fourth aspect of the embodiments of the present invention provides a computer-readable storage medium, including instructions which, when executed on a computer, cause the computer to perform the steps of the database-based duplicate association detection method according to any of the embodiments described above.

In the technical scheme provided by the embodiment of the invention, a code base of a database is searched, wherein the code base comprises all structured query language SQL scripts; analyzing each SQL script to obtain a plurality of SQL paragraphs, wherein each SQL script comprises at least one SQL paragraph; preprocessing a plurality of SQL paragraphs to generate a plurality of processed SQL sentences, wherein each SQL paragraph comprises at least one SQL sentence; inputting a plurality of processed SQL sentences into preset software for analysis, and generating at least one pair of source table names which are related in pairs; labeling script catalogues of the database according to at least one pair of source table names associated with each other, and generating labeled script catalogues; and repeatedly detecting the source table names associated with each other in all preset SQL scripts according to the marked script catalogue, and obtaining a detection result. According to the embodiment of the invention, the SQL sentences in the structured query language SQL script are split, and the repeatedly associated source tables in the SQL are counted, so that the detection efficiency is improved.

Drawings

FIG. 1 is a schematic diagram of an embodiment of a database-based duplicate association detection method according to an embodiment of the invention;

FIG. 2 is a schematic diagram of another embodiment of a database-based duplicate association detection method according to an embodiment of the invention;

FIG. 3 is a schematic diagram of an embodiment of a database-based duplicate association detection apparatus according to an embodiment of the invention;

FIG. 4 is a schematic diagram of another embodiment of a database-based duplicate association detection apparatus according to an embodiment of the invention;

fig. 5 is a schematic diagram of an embodiment of a database-based duplicate association detection apparatus according to an embodiment of the invention.

Detailed Description

In order to enable those skilled in the art to better understand the present invention, embodiments of the present invention will be described below with reference to the accompanying drawings.

The terms "first," "second," "third," "fourth" and the like in the description and in the claims and in the above drawings, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments described herein may be implemented in other sequences than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.

Referring to fig. 1, a flowchart of a database-based duplicate association detection method according to an embodiment of the present invention specifically includes:

101. searching a code base of the database, wherein the code base comprises all structured query language SQL scripts.

The server searches a code base of the database, wherein the code base comprises all structured query language SQL scripts.

When there are multiple databases, the server needs to specify the database name first, for example, database query generally needs to specify the database name, for example, select from my_database. If a default database, e.g., use my_database, is specified prior to executing the statement; then it can be written directly as select from my_table; the mydatabase will be used by default; it is necessary to determine which of the current default databases is before parsing the statement. When a default database exists, the server searches the default database.

It can be understood that the execution subject of the present invention may be a repeated association detection device based on a database, and may also be a terminal or a server, which is not limited herein. The embodiment of the invention is described by taking a server as an execution main body as an example.

102. And analyzing each SQL script to obtain a plurality of SQL paragraphs, wherein each SQL script comprises at least one SQL paragraph.

The server analyzes each SQL script to obtain a plurality of SQL paragraphs, and each SQL script comprises at least one SQL paragraph. Specifically, the server determines a target SQL script; the server queries a plurality of preset SQL keywords in the target SQL script according to the preset line numbers, wherein the preset line numbers are used for indicating the number of code lines needing to be searched in the target SQL script; the server generates a plurality of SQL paragraphs, each SQL paragraph including at least one preset SQL keyword.

103. Preprocessing a plurality of SQL paragraphs to generate a plurality of processed SQL sentences, wherein each SQL paragraph contains at least one SQL sentence.

The server preprocesses a plurality of SQL paragraphs, and generates a plurality of processed SQL sentences, wherein each SQL paragraph contains at least one SQL sentence. Specifically, the server determines an SQL keyword portion and a non-SQL keyword portion of each SQL paragraph of the plurality of SQL paragraphs; the server determines a plurality of corresponding to-be-selected semicolons in each SQL paragraph; the server eliminates illegal semicolons in a plurality of corresponding semicolons to be selected, wherein the illegal semicolons are semicolons in a non-SQL keyword part in a corresponding SQL script, and a plurality of target semicolons corresponding to each SQL paragraph are obtained; the server divides the corresponding SQL paragraphs according to a plurality of target semicolons corresponding to each SQL paragraph, and generates a plurality of processed SQL sentences, wherein the plurality of processed SQL sentences comprise the processed SQL sentences of each SQL paragraph.

For example, the server first reads the text content of the whole SQL paragraph, and uses' first because there may be multiple SQL paragraphs in each SQL script file; ' dividing the whole sentence to obtain each section of SQL, and eliminating the semicolons of the representative character strings which are bracketed by quotation marks; if it is a use database statement, it needs to be identified as the current default database. Here, the marks of the keywords of the non-SQL sentences are removed, so that wrong segmentation sentences, such as select', are avoided; "from my_table; select "," from my_table; only two sentences will be split.

104. And inputting the plurality of processed SQL sentences into preset software for analysis, and generating at least one pair of source table names which are related in pairs.

The server inputs a plurality of processed SQL sentences into preset software for analysis, and at least one pair of source table names which are related pairwise are generated. Specifically, the server inputs a plurality of processed SQL sentences into preset software to generate an abstract syntax tree (abstract syntax tree, AST); traversing the abstract syntax tree by the server to generate a plurality of query blocks (QueryBlocks); traversing the query blocks by the server, and judging whether each query block contains a connection (join) node or not; if the plurality of query blocks contain the connection node, the server extracts the source table name used in the connection node; the server generates an analysis result according to the source table names, the analysis result including at least one pair of associated source table names.

It should be noted that the preset software needs to be determined according to the implementation of the database, and if different database software is used, the usage may be different. Hive is, for example, "Antlr" as used. Antlr is a tool for language recognition that can be used to construct domain languages. The Antlr is not described in detail, only one grammar file is required to be written for constructing a specific language by using the Antlr, and the lexical analysis, the grammar analysis, the semantic analysis and the intermediate code generation processes are completed by the Antlr.

The server inputs a plurality of processed SQL sentences into preset software, and the process of generating the abstract syntax tree comprises the following steps: the server sets the keywords in the plurality of processed SQL sentences into a uppercase format or a lowercase format in a unified way; the server parses the SQL statement with the unified format through preset software to obtain parsed codes; the server adds special words into the parsed codes based on preset rules, wherein the special words are used for indicating specific grammar; the server generates an abstract syntax tree. After the from clause of the memory sub-query is expanded, each table generates a TOK_TABREF node, and the Join condition generates a "=" node. Other SQL parts are similar and will not be described in detail. For example, when a section of SQL statement is: select State selectClause fromClause whereClausegroupByClausehavingClauseorderByClause? clusteribute byclausesortbyclasselimit clause?

For the grammar rules of SelectStatement in Hive SQL, selectStatement contains clauses of select, from, where, group, have, orderby, etc. The code is rewritten according to a preset rule, and a plurality of special words are added to mark a specific grammar after the rewriting, for example, TOK_QUERY marks a QUERY block, and the specific code is as follows:

^(TOK_QUERY fromClause^(TOK_INSERT^(TOK_DESTINATION^(TOK_DIR TOK_TMP_FILE))selectClause whereClausegroupByClausehavingClauseorderByClause？clusterByClausedistributeByClausesortByClauselimitClause？))。

it can be understood that, because the structure of the abstract syntax tree is complex, it is inconvenient to count the source tables used by Join, so that the abstract syntax tree is generated into the query block QueryBlock, queryBlock which is a most basic component unit of the SQL, and the query block includes three parts: input source, calculation process and output. In short, a query block is a sub-query block, the process of generating the sub-query block by using an abstract syntax tree is a recursive process, traversing AST in advance, encountering different tree nodes (Token nodes), storing the tree nodes into corresponding attributes, and using a source table in a TOK_FROM node (the part of the work is realized by open source codes in hive).

105. And marking the script catalogs of the database according to at least one pair of source table names which are related in pairs, and generating the marked script catalogs.

And the server marks the script catalogs of the database according to at least one pair of the source table names which are related in pairs, and generates the marked script catalogs. It is necessary to mark out the script path of each source table name and output the path of each script. For example, the association may be: source table a join source table B, to count the use of both tables together, then source table a is output. If the source list A is also a source list B, a source list A is output, and a source list B is output; a source table A, a source table C; source table B, source table C.

Note that the Join type may include various types, for example, inner Join, left Outer Join, right Outer Join, full Outer Join, semi Join, anti Join, and the specific examples are not limited herein.

It will be appreciated that because of the large number of scripts involved, e.g., there may be many scripts associated with using both tables a, B at the same time, it may be necessary to mark by different script paths, e.g., mark out script paths, output paths for each script, and then differentiate and reform by the developer based on the paths.

106. And repeatedly detecting the source table names associated with each other in all preset SQL scripts according to the marked script catalogue, and obtaining a detection result.

And the server repeatedly performs association detection on the source table names associated with each other in all preset SQL scripts according to the marked script catalogue, and a detection result is obtained. Specifically, the server determines a script detection sequence according to the annotated script catalogue; the server determines the source table names associated with each other in the first target script based on the script detection sequence, determines the logarithm of the source table names associated with each other in the first target script, and the first target script is the first script sequenced in the script detection sequence; the server traverses other scripts based on the script detection sequence to obtain detection results, wherein the detection results comprise source table names associated with each other in pairs and logarithms of the source table names associated with each other in all preset SQL scripts.

According to the embodiment of the invention, the SQL sentences in the structured query language SQL script are split, and the repeatedly associated source tables in the SQL are counted, so that the detection efficiency is improved.

Referring to fig. 2, another flowchart of a database-based duplicate association detection method according to an embodiment of the present invention specifically includes:

201. searching a code base of the database, wherein the code base comprises all structured query language SQL scripts.

202. And analyzing each SQL script to obtain a plurality of SQL paragraphs, wherein each SQL script comprises at least one SQL paragraph.

203. Preprocessing a plurality of SQL paragraphs to generate a plurality of processed SQL sentences, wherein each SQL paragraph contains at least one SQL sentence.

204. And inputting the plurality of processed SQL sentences into preset software for analysis, and generating at least one pair of source table names which are related in pairs.

205. And marking the script catalogs of the database according to at least one pair of source table names which are related in pairs, and generating the marked script catalogs.

206. And repeatedly detecting the source table names associated with each other in all preset SQL scripts according to the marked script catalogue, and obtaining a detection result.

207. And determining the association times of each associated source table name according to the detection result.

And the server determines the association times of each associated source table name according to the detection result. For example, if a pair of associated source table names A and B occurs, then the number of associations of source table names A and B is increased by 1.

208. And sorting according to the association times of each associated source table name from big to small to obtain an association sequence table.

And the server sorts the table names according to the association times of each associated source table according to the order from big to small to obtain an association sequence table. The source table names with the highest association times are ranked in the first position, and the source table names are ranked by the same way.

209. And calling a preset intermediate table to replace the table names of the target source tables, wherein the table names of the target source tables are the preset number of table names of the source tables which are ranked at the front in the association sequence table.

The server calls a preset intermediate table to replace the target source table name pairs, wherein the target source table name is a preset number of source table names which are ranked in the association sequence table. The preset intermediate table is designed by database analysts according to the detection result, so that the database analysts can use time for analyzing the merging scheme of the intermediate table, a great amount of manpower is reduced to check whether repeated association exists in the script manually, and the working efficiency is improved.

According to the embodiment of the invention, the SQL sentences in the structured query language SQL script are split, the repeated associated source tables in the SQL are counted, and the repeated associated source table names are replaced by the intermediate table, so that the detection efficiency and the association efficiency are improved, and the running efficiency of the database is further improved.

The method for detecting repeated association based on the database in the embodiment of the present invention is described above, and the device for detecting repeated association based on the database in the embodiment of the present invention is described below, referring to fig. 3, one embodiment of the device for detecting repeated association based on the database in the embodiment of the present invention includes:

a searching unit 301, configured to search a code base of a database, where the code base includes all structured query language SQL scripts;

the parsing unit 302 is configured to parse each SQL script to obtain a plurality of SQL paragraphs, where each SQL script includes at least one SQL paragraph;

a preprocessing unit 303, configured to preprocess the plurality of SQL paragraphs, and generate a plurality of processed SQL statements, where each SQL paragraph includes at least one SQL statement;

the analysis unit 304 is configured to input the plurality of processed SQL statements into preset software for analysis, and generate at least one pair of source table names associated with each other;

A labeling unit 305, configured to label the script directory of the database according to the at least one pair of source table names associated with each other, and generate a labeled script directory;

and the detection unit 306 is configured to perform repeated association detection on the source table names associated with each other in all preset SQL scripts according to the annotated script directory, and obtain a detection result.

Referring to fig. 4, another embodiment of the database-based duplicate association detection apparatus according to the present invention includes:

Optionally, the parsing unit 302 is specifically configured to:

determining a target SQL script; inquiring a plurality of preset SQL keywords in the target SQL script according to preset line numbers, wherein the preset line numbers are used for indicating the line numbers of codes needing to be searched in the target SQL script; when the preset line number contains the preset SQL keywords, generating a plurality of SQL paragraphs, wherein each SQL paragraph at least comprises one preset SQL keyword.

Optionally, the preprocessing unit 303 is specifically configured to:

determining an SQL keyword part and a non-SQL keyword part of each SQL paragraph in the plurality of SQL paragraphs; determining a plurality of corresponding to-be-selected semicolons in each SQL paragraph; removing illegal semicolons from the corresponding multiple semicolons to be selected, wherein the illegal semicolons are semicolons in a non-SQL keyword part in the corresponding SQL script, and multiple target semicolons corresponding to each SQL paragraph are obtained; dividing the corresponding SQL paragraphs according to the target marks corresponding to each SQL paragraph to generate a plurality of processed SQL sentences, wherein the plurality of processed SQL sentences comprise the processed SQL sentences of each SQL paragraph.

Optionally, the analysis unit 304 includes:

the first generation module 3041 is configured to input a plurality of processed SQL statements into preset software to generate an abstract syntax tree;

a second generating module 3042, configured to traverse the abstract syntax tree to generate a plurality of query blocks;

the traversal judging module 3043 is configured to traverse the query blocks and judge whether each query block includes a connection node;

the extracting module 3044 is configured to extract a source table name used in the connection node if the plurality of query blocks include the connection node;

and a third generating module 3045, configured to generate an analysis result according to the source table names, where the analysis result includes at least one pair of associated source table names.

Optionally, the first generating module 3041 is specifically configured to:

uniformly setting keywords in the plurality of processed SQL sentences into a uppercase format or a lowercase format; carrying out grammar analysis on SQL sentences with uniform formats through preset software to obtain analyzed codes; adding special words into the parsed codes based on preset rules, wherein the special words are used for indicating specific grammar; an abstract syntax tree is generated.

Optionally, the detecting unit 306 is specifically configured to:

determining a script detection sequence according to the annotated script catalogue; determining source table names associated with each other in a first target script based on the script detection sequence, and determining the logarithm of the source table names associated with each other in the first target script, wherein the first target script is a script sequenced first in the script detection sequence; traversing other scripts based on the script detection sequence to obtain detection results, wherein the detection results comprise source table names associated with each other in pairs and logarithms of the source table names associated with each other in all preset SQL scripts.

Optionally, the repeated association detection device based on the database further comprises:

a determining unit 307, configured to determine, according to the detection result, the association number of each associated source table name;

a sorting unit 308, configured to sort according to the number of association times of each associated source table name from large to small, to obtain an association sequence table;

and the replacing unit 309 is configured to call a preset intermediate table to replace a target source table name pair, where the target source table name is a preset number of source table names in the association sequence table, where the preset number of source table names are ranked at the front.

The database-based duplicate association detection apparatus in the embodiment of the present invention is described in detail from the point of view of the modularized functional entity in fig. 3 to 4 above, and the database-based duplicate association detection device in the embodiment of the present invention is described in detail from the point of view of hardware processing below.

Fig. 5 is a schematic structural diagram of a database-based duplicate association detection apparatus according to an embodiment of the invention, where the database-based duplicate association detection apparatus 500 may have a relatively large difference due to different configurations or performances, and may include one or more processors (central processing units, CPU) 501 (e.g., one or more processors) and a memory 509, and one or more storage media 508 (e.g., one or more mass storage devices) storing application programs 507 or data 506. Wherein the memory 509 and storage medium 508 may be transitory or persistent storage. The program stored on the storage medium 508 may include one or more modules (not shown), each of which may include a series of instruction operations in a database-based duplicate association detection device. Still further, the processor 501 may be configured to communicate with the storage medium 508 and execute a series of instruction operations in the storage medium 508 on the database-based duplicate association detection device 500.

The database-based duplicate association detection apparatus 500 may also include one or more power supplies 502, one or more wired or wireless network interfaces 503, one or more input/output interfaces 504, and/or one or more operating systems 505, such as Windows Serve, mac OSX, unix, linux, freeBSD, and the like. It will be appreciated by those skilled in the art that the database-based duplicate association detection apparatus structure illustrated in fig. 5 does not constitute a limitation of the database-based duplicate association detection apparatus, and may include more or fewer components than illustrated, or may combine certain components, or may be a different arrangement of components. The processor 501 may perform the functions of the searching unit 301, the parsing unit 302, the preprocessing unit 303, the analyzing unit 304, the labeling unit 305, the detecting unit 306, the determining unit 307, the sorting unit 308, and the replacing unit 309 in the above-described embodiments.

The following describes the respective constituent elements of the database-based duplicate association detection apparatus in detail with reference to fig. 5:

the processor 501 is a control center of the database-based duplicate association detection apparatus, and may perform processing according to a set database-based duplicate association detection method. The processor 501 connects various parts of the whole database-based repetitive association detection device by various interfaces and lines, executes various functions of the database-based repetitive association detection device by running or executing software programs and/or modules stored in the memory 509 and calling data stored in the memory 509, and improves detection efficiency by splitting SQL statements in a structured query language SQL script, counting repetitive association source tables in SQL. The storage medium 508 and the memory 509 are both carriers for storing data, and in the embodiment of the present invention, the storage medium 508 may refer to an internal memory with a small storage capacity but a fast speed, and the memory 509 may be an external memory with a large storage capacity but a slow storage speed.

The memory 509 may be used to store software programs and modules, and the processor 501 performs various functional applications and data processing of the database-based duplicate association detection device 500 by running the software programs and modules stored in the memory 509. The memory 509 may mainly include a storage program area and a storage data area, where the storage program area may store an operating system, an application program required for at least one function (such as labeling a script directory of a database according to at least one pair of associated source table names), and the like; the storage data area may store data created from the use of the database-based duplicate association detection device (such as a annotated script directory), and the like. In addition, the memory 509 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. The database-based duplicate association detection method program and received data streams provided in embodiments of the invention are stored in memory and when needed, the processor 501 recalls from the memory 509.

When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, twisted pair), or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer readable storage medium may be any available medium that can be stored by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., an optical disk), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

In the several embodiments provided in the present invention, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiment of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A database-based duplicate association detection method, comprising:

searching a code base of a database, wherein the code base comprises all structured query language SQL scripts;

analyzing each SQL script to obtain a plurality of SQL paragraphs, wherein each SQL script comprises at least one SQL paragraph;

preprocessing the SQL paragraphs to generate a plurality of processed SQL sentences, wherein each SQL paragraph contains at least one SQL sentence;

inputting the plurality of processed SQL sentences into preset software for analysis, and generating at least one pair of source table names which are related in pairs;

labeling script catalogues of the database according to the at least one pair of source table names associated with each other, and generating labeled script catalogues;

Repeating association detection on the source table names associated with each other in all preset SQL scripts according to the marked script catalogue, and obtaining a detection result;

analyzing each SQL script to obtain a plurality of SQL paragraphs, wherein each SQL script comprises at least one SQL paragraph comprising:

determining a target SQL script;

inquiring a plurality of preset SQL keywords in the target SQL script according to preset line numbers, wherein the preset line numbers are used for indicating the line numbers of codes needing to be searched in the target SQL script;

when the preset line number contains the preset SQL keywords, generating a plurality of SQL paragraphs, wherein each SQL paragraph at least comprises one preset SQL keyword;

preprocessing the SQL paragraphs to generate a plurality of processed SQL sentences, wherein each SQL paragraph contains at least one SQL sentence and comprises:

determining an SQL keyword part and a non-SQL keyword part of each SQL paragraph in the plurality of SQL paragraphs;

determining a plurality of corresponding to-be-selected semicolons in each SQL paragraph;

removing illegal semicolons from the corresponding multiple semicolons to be selected, wherein the illegal semicolons are semicolons in a non-SQL keyword part in the corresponding SQL script, and multiple target semicolons corresponding to each SQL paragraph are obtained;

Dividing the corresponding SQL paragraphs according to a plurality of target semicolons corresponding to each SQL paragraph to generate a plurality of processed SQL sentences, wherein the plurality of processed SQL sentences comprise the processed SQL sentences of each SQL paragraph;

inputting the plurality of processed SQL sentences into preset software for analysis, and generating at least one pair of associated source table names comprises the following steps:

inputting a plurality of processed SQL sentences into preset software to generate an abstract syntax tree;

traversing the abstract syntax tree to generate a plurality of query blocks;

traversing the query blocks, and judging whether each query block contains a connection node or not;

if the plurality of query blocks contain the connection nodes, extracting source table names used in the connection nodes;

and generating an analysis result according to the source table names, wherein the analysis result comprises at least one pair of associated source table names.

2. The database-based duplicate association detection method of claim 1, wherein the inputting the plurality of processed SQL statements into the preset software to generate the abstract syntax tree comprises:

uniformly setting keywords in the plurality of processed SQL sentences into a uppercase format or a lowercase format;

Carrying out grammar analysis on SQL sentences with uniform formats through preset software to obtain analyzed codes;

adding special words into the parsed codes based on preset rules, wherein the special words are used for indicating specific grammar;

an abstract syntax tree is generated.

3. The method for detecting repeated association based on database according to claim 1, wherein the step of performing repeated association detection on the source table names associated with each other in all preset SQL scripts according to the annotated script directory, and obtaining a detection result includes:

determining a script detection sequence according to the annotated script catalogue;

determining source table names associated with each other in a first target script based on the script detection sequence, and determining the logarithm of the source table names associated with each other in the first target script, wherein the first target script is a script sequenced first in the script detection sequence;

traversing other scripts based on the script detection sequence to obtain detection results, wherein the detection results comprise source table names associated with each other in pairs and logarithms of the source table names associated with each other in all preset SQL scripts.

4. The method for detecting repeated association based on database according to any one of claims 1-3, wherein the method further comprises, after performing repeated association detection on the source table names associated with each other in all preset SQL scripts according to the annotated script directory, and obtaining a detection result:

Determining the association times of each associated source table name according to the detection result;

sorting according to the association times of each associated source table name from big to small to obtain an association sequence table;

and calling a preset intermediate table to replace a target source table name pair, wherein the target source table name is a preset number of source table names which are ranked at the front in the association sequence table.

5. A database-based duplicate association detection apparatus, comprising:

the searching unit is used for searching a code base of the database, wherein the code base comprises all structured query language SQL scripts;

the analyzing unit is used for analyzing each SQL script to obtain a plurality of SQL paragraphs, and each SQL script comprises at least one SQL paragraph;

the preprocessing unit is used for preprocessing the SQL paragraphs to generate a plurality of processed SQL sentences, and each SQL paragraph contains at least one SQL sentence;

the analysis unit is used for inputting the plurality of processed SQL sentences into preset software for analysis and generating at least one pair of source table names which are related in pairs;

the marking unit is used for marking the script catalogue of the database according to the at least one pair of source table names which are related in pairs, and generating the marked script catalogue;

The detection unit is used for carrying out repeated association detection on the source table names associated with each other in all preset SQL scripts according to the marked script catalogue, and obtaining a detection result;

the analysis unit is specifically configured to: determining a target SQL script; inquiring a plurality of preset SQL keywords in the target SQL script according to preset line numbers, wherein the preset line numbers are used for indicating the line numbers of codes needing to be searched in the target SQL script; when the preset line number contains the preset SQL keywords, generating a plurality of SQL paragraphs, wherein each SQL paragraph at least comprises one preset SQL keyword;

the preprocessing unit is specifically used for: determining an SQL keyword part and a non-SQL keyword part of each SQL paragraph in the plurality of SQL paragraphs; determining a plurality of corresponding to-be-selected semicolons in each SQL paragraph; removing illegal semicolons from the corresponding multiple semicolons to be selected, wherein the illegal semicolons are semicolons in a non-SQL keyword part in the corresponding SQL script, and multiple target semicolons corresponding to each SQL paragraph are obtained; dividing the corresponding SQL paragraphs according to a plurality of target semicolons corresponding to each SQL paragraph to generate a plurality of processed SQL sentences, wherein the plurality of processed SQL sentences comprise the processed SQL sentences of each SQL paragraph;

The analysis unit includes: the first generation module is used for inputting a plurality of processed SQL sentences into preset software to generate an abstract syntax tree; the second generation module is used for traversing the abstract syntax tree and generating a plurality of query blocks; the traversal judging module is used for traversing the query blocks and judging whether each query block contains a connection node or not; the extraction module is used for extracting source table names used in the connection nodes if the plurality of query blocks contain the connection nodes; and the third generation module is used for generating an analysis result according to the source table names, wherein the analysis result comprises at least one pair of associated source table names.

6. The database-based duplicate association detection device of claim 5, wherein the first generation module is specifically configured to:

an abstract syntax tree is generated.

7. The repeated association detection device based on the database according to claim 5, wherein the detection unit is specifically configured to:

8. The database-based duplicate association detection apparatus according to any one of claims 5-7, further comprising:

a determining unit, configured to determine the association times of each associated source table name according to the detection result;

the sorting unit is used for sorting according to the association times of each associated source table name from big to small to obtain an association sequence table;

and the replacing unit is used for calling a preset intermediate table to replace the target source table name pairs, wherein the target source table name is a preset number of source table names which are ranked in the association sequence table and are in front.

9. A database-based duplicate association detection device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the database-based duplicate association detection method of any one of claims 1-4 when the computer program is executed.

10. A computer readable storage medium comprising instructions which, when run on a computer, cause the computer to perform the database-based duplicate association detection method of any one of claims 1-4.