CN114003229B

CN114003229B - SQL code similarity analysis method and system

Info

Publication number: CN114003229B
Application number: CN202111144502.8A
Authority: CN
Inventors: 郭晨皓; 李龙权; 汲浩; 蒋天榕; 胡涛; 唐雁南
Original assignee: Xiamen International Bank Co ltd
Current assignee: Xiamen International Bank Co ltd
Priority date: 2021-09-28
Filing date: 2021-09-28
Publication date: 2022-10-18
Anticipated expiration: 2041-09-28
Also published as: CN114003229A

Abstract

The invention discloses a method and a system for analyzing SQL code similarity, wherein the system comprises the following steps: the extraction module is used for extracting structural codes in the SQL codes, table Chinese names corresponding to the table codes and field Chinese names corresponding to the field codes; the sorting module is used for sorting the Chinese names in the table from top to bottom and sorting the Chinese names in the fields; the space vector calculation module is used for calculating the vector of each important Chinese name and then calculating the space vector of the SQL code; and the analysis module is used for calculating the cosine distance of the two space vectors. By analyzing the semantic similarity between the SQL codes, the problem of repeated labor in the development process can be solved, and the development efficiency and the development quality are improved.

Description

SQL code similarity analysis method and system

Technical Field

The invention relates to the technical field of big data intellectualization, in particular to a method and a system for analyzing similarity of SQL (structured query language) codes.

Background

SQL is a standard computer structured query language for ANSI that is used to access and operate a variety of database systems. The SQL code contains abundant service information, namely the processing caliber of the data. Since the presentation form of SQL is usually complex and has no obvious features, developers engaged in SQL development have a lot of repeated labor. SQL codes needed by current development may exist in own previously written, coworked and system-ready codes, but developers lack a way to simply retrieve the codes and only rewrite the codes.

Disclosure of Invention

The invention provides a method and a system for analyzing the similarity of SQL codes, which can solve the problem of repeated labor in the development process and improve the development efficiency and the development quality by analyzing the semantic similarity between the SQL codes.

In order to achieve the purpose, the invention adopts the technical scheme that:

an SQL code similarity analysis system comprises: the extraction module is used for extracting structural codes in the SQL codes, table Chinese names corresponding to the table codes and field Chinese names corresponding to the field codes; the sorting module is used for sorting the table Chinese names corresponding to the table codes from top to bottom according to the importance of the table codes in the SQL codes and sorting the field Chinese names corresponding to the table codes from top to bottom according to the field codes; a space vector calculation module for calculating a vector of each important Chinese name, and then calculating a space vector of the SQL code according to the vector of the structural code and the vector of the important Chinese name, wherein the important Chinese name comprises CM _ N table Chinese names which are ranked at the top and CM _ N field Chinese names which are ranked at the top; and the analysis module is used for calculating the cosine distance between the two space vectors.

Preferably, the system further comprises: the word segmentation module is used for segmenting each important Chinese name to obtain words; and the space vector calculation module calculates the vector of each important Chinese name according to the word vector of the word.

Preferably, the word vector of the word is a V _ N-dimensional vector obtained by pre-training a word2vec model, and the word2vec model is obtained by training a corpus established based on a historical SQL script.

Preferably, the system further comprises an optimization module for completing the SQL code supplementation.

Preferably, the space vector is obtained by performing weighted calculation on the vector of the structural code and the vector of the important Chinese name, and the weight of the important Chinese name is set according to the code type and the ranking corresponding to the important Chinese name.

Preferably, the system further comprises a storage module, configured to store the historical SQL script and the spatial vector of the historical SQL script.

Preferably, the system further comprises a code recommending module, configured to recommend similar codes of the SQL codes in the historical SQL script according to the spatial vector.

Preferably, the system further comprises a code positioning module, configured to position the SQL code in the historical SQL script according to the spatial vector.

Preferably, the system further comprises a code examination module for judging the grammar performance used by the SQL code.

Based on the same inventive concept, the invention also provides an SQL code similarity analysis method, which comprises the following steps: extracting structural codes in SQL codes, table Chinese names corresponding to the table codes and field Chinese names corresponding to the field codes; sorting table Chinese names corresponding to the table codes from top to bottom according to the importance of the table codes in the SQL codes, and sorting field Chinese names corresponding to the table codes from top to bottom according to the field codes; calculating a vector of each important Chinese name, and then calculating a space vector of the SQL code according to the vector of the structural codes and the vector of the important Chinese names, wherein the important Chinese names comprise CM _ N top table Chinese names and CM _ N top field Chinese names; the cosine distance of the two space vectors is calculated.

The invention has the beneficial effects that:

1. extracting structural codes in the SQL codes, table Chinese names corresponding to the table codes and field Chinese names corresponding to the field codes, calculating space vectors of the SQL codes according to the vectors, and then calculating cosine distances of the two space vectors to serve as the similarity of the two sections of the SQL codes between semantics;

2. the optimization module supplements the SQL codes completely and packages the SQL codes according to the XML format, so that information extraction operation is convenient to perform;

3. the storage module stores the historical SQL script and the space vector of the historical SQL script, so that subsequent application is facilitated.

Drawings

FIG. 1 is an exemplary diagram of a complementary complete SQL construct.

Detailed Description

In order to make the technical problems, technical solutions and advantageous effects of the present invention more clear and obvious, the present invention is further described in detail with reference to specific embodiments below. It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the invention.

How to identify the similarity between different SQL is a complex problem. At present, comprehensive judgment is generally carried out according to structures, tables and fields used in SQL, but on the premise that semantics are completely the same, the positions of the fields, conditions and associations in the SQL can be transformed, and the associations can also adopt a plurality of structural expression modes, so that the method has low accuracy, can embody the effect only when the code difference between the SQL is very small, and has no practical application significance.

Example one

The embodiment discloses an SQL code similarity analysis system, which comprises:

and the optimization module is used for completely supplementing the SQL structure.

SQL code includes single-segment SQL statements, multiple-segment SQL statements, complete SQL scripts (containing multiple-segment SQL statements and some control syntax), and so on.

The optimization module traverses the SQL, matches with a metadata database of the database, and completely supplements the information of the database, the table and the field used by each level in the SQL.

Referring to fig. 1, the SQL code of the present embodiment as an example is as follows:

SELECT A.A1,B1

from AA A

INNER JOIN BB

ON A.A1＝B1。

the operation of the optimization module specifically comprises: and expanding a first child node of the node aiming at the node with the node attribute as the table, wherein the node attribute of the first child node is the name of the Chinese in the table, and the value is the name of the Chinese corresponding to the table.

And expanding a second child node of the node aiming at the node with the node attribute as the field, wherein the node attribute of the second child node is the name of the Chinese character in the field, and the value is the name of the Chinese character corresponding to the field.

And expanding a third child node of the node aiming at the node with the node attribute as the field, wherein the node attribute of the third child node is the field type, and the value is the field type corresponding to the field.

And aiming at the node with the node attribute as the table, if no node attribute is the brother node of the table alias, expanding the first brother node of the node, wherein the node attribute of the first brother node is the table alias, and the value is the alias corresponding to the table. If the sibling node exists, the expansion is not carried out, and the multiplexing is carried out directly.

And for the node with the node attribute as the field, if no node attribute is the brother node of the alias of the field, expanding the second brother node of the node, wherein the node attribute of the second brother node is the alias of the field, and the value is the alias corresponding to the field. If the sibling node exists, the expansion is not carried out, and the multiplexing is carried out directly.

And for the node with the node attribute as the field, if no node attribute is the brother node of the attribution table alias, expanding a third brother node of the node, wherein the node attribute of the third brother node is the attribution table alias, and the value is the table corresponding alias of the attribution of the field. If the sibling node exists, the expansion is not carried out, and the multiplexing is carried out directly.

Furthermore, in order to facilitate the information extraction operation of the extraction module described in this embodiment, an XPATH rule is formulated, and the supplemented and complete SQL is packaged according to an XML format and stored in the two-dimensional table of the database.

And the extraction module is used for extracting structural codes in the SQL codes, namely the keyword codes, the table Chinese names corresponding to the table codes and the field Chinese names corresponding to the field codes.

The extraction module described in this embodiment may access the two-dimensional table of the database using an XPATH rule to obtain service information.

The English names of the fields (e.g., A1 and B1) are discarded because the English name of the field is arbitrarily named in the database, such as the system often uses cust _ no/cus _ no/custorer _ no, which all represent "customer number". If the english name is used for subsequent analysis, a large deviation will occur.

The information extracted by the extraction module according to this embodiment is shown in table 1.

Table 1 SQL construct extraction information

The core of the invention lies in the textualization of SQL codes, and then the application is realized by utilizing the similarity of texts.

The sequencing module is used for sequencing table names corresponding to the table codes from top to bottom according to the importance of the table codes in the SQL codes and sequencing field names corresponding to the table codes from top to bottom according to the field codes;

in this embodiment, the importance of the table code or the field code in the SQL code is the degree of representation of the chinese name corresponding to the code on the semantic level of the segment of SQL, and is expressed by a TF-IDF (term frequency-inverse file frequency) value.

The core idea of TF-IDF is: in an article, the importance of a word is positively related to the number of times the word appears in the article, and is negatively related to the number of articles in the corpus.

TF (term frequency): indicating the frequency of occurrence of a word in an article and indicating the relevance of a word to the article. This number is a normalization of the number of words (term count) to prevent it from biasing towards long documents.

IDF (inverse document frequency): indicating how common a word appears. The IDF of a word can be obtained by dividing the total number of articles by the number of articles in which the word appears, and then taking the obtained quotient to be a base-10 logarithm.

The importance of a word in one article can be expressed as the product of word frequency and inverse document frequency, i.e., TF-IDF = TF IDF.

The premise of adopting the TF-IDF algorithm is as follows: the word importance of an article is not related to where the word appears in the article.

In this embodiment, the data set of the reverse file frequency can be collected and constructed from historical SQL scripts, i.e., the full amount of SQL scripts.

And the sequencing module respectively obtains the ranking of the TF-IDF of the Chinese name in the table and the ranking of the TF-IDF of the Chinese name in the field.

And the word segmentation module is used for segmenting each important Chinese name to obtain words. Important Chinese names include top-ranked CM _ N table Chinese names and top-ranked CM _ N field Chinese names. That is, the word segmentation module firstly selects the table Chinese names and the field Chinese names with the top CM _ N according to the size of SQL, as shown in tables 2 and 3, and can filter some Chinese names which may not be related to the SQL operation significance, such as "time to put in storage", "number", and the like.

TABLE 2 Chinese name ranking

Watch (A)	TF-IDF	Ranking rank
			Bank	15	1
Loan	15	2

Table 3 field chinese name ranking

Field(s)	TF-IDF	Ranking rank
			Bank, bank number	25	1
Loan bank number	25	2

And the space vector calculation module is used for calculating the vector of each important Chinese name and then calculating the space vector of the SQL code according to the vector of the structural code and the vector of the important Chinese name.

And the space vector calculation module calculates the vector of each important Chinese name according to the word vector of the word obtained by word segmentation. Such as important chinese names: the bank number is obtained by word segmentation, and the word vectors corresponding to the words bank and the number are obtained by respectively searching the word segmentation list. And the word vectors of the words in the word segmentation list are V _ N-dimensional vectors obtained through word2vec model pre-training. The larger the dimension of the vector, the higher its accuracy, diversity, but the greater the computational complexity. In the present embodiment, V _ N is set to 255. The word2vec model is obtained by training a corpus built based on historical SQL scripts.

In this embodiment, the average value of the word vectors of the words obtained by word segmentation is taken as the vector of each important Chinese name. The above calculation is performed for each important Chinese name to obtain a set, and the elements in the set include the vector and the type of the important Chinese name, as shown in table 4.

The vector of the structural code is a fixed value, the number of the structured code commonly used in SQL is only 20, and the structured code is directly specified according to the key word (the 255-dimensional vector is only arranged with 1 bit in every 5 dimensions, for example, the SELECT is the first bit 1, and the FROM is the 6 th bit 1).

Table 4 vectors of SQL constructs

SQL structure extraction	Word segmentation result	Corresponding word vector	Weighting vector	Marking as	Type (B)
						SELECT		V000 (constant value)	V000	CV1	Key word
Bank, bank number	Bank/number	V1/V2	(V1+V2)/2	CV2	Field(s)
						Loan bank number	Bank/number	V3/V4	(V3+V4)/2	CV3	Field(s)
FROM		V001 (constant value)	V001	CV4	Key word
						Bank	Bank	V5	V5	CV5	Watch (A)
INNER JOIN		V002 (constant value)	V002	CV6	Key word
						Loan	Loan	V6	V6	CV7	Watch (A)
ON		V003 (constant)	V003	CV8	Key word
						Bank, bank number	Bank/number	V1/V2	(V1+V2)/2	CV9	Field(s)
Loan bank number	Bank/number	V3/V4	(V3+V4)/2	CV10	Field(s)

In this embodiment, the space vector is obtained by performing weighted calculation on the vector of the structural code and the vector of the important chinese name, and corresponding weights are designed according to the characteristics of the SQL structure, such as the type and the rank, as shown in table 5, where maxrank is the TF-IDF rank number, that is, CM _ N, and rank is the TF-IDF corresponding rank.

TABLE 5 weight rules

The vector of the structural code and the vector of the significant Chinese character name are multiplied by the corresponding weights, and the added values are averaged to obtain the space vector of the SQL code, namely, the space vector of the SQL code (= (CV 1 × W3+ CV4 × W3+ CV6 × W3+ CV8 × W3+ CV2 × W2+ CV9 + CV10 × W2+ CV5 × W1+ CV7 × W1)/10.

And the analysis module is used for calculating the cosine distance of the two space vectors.

In this embodiment, the cosine distance is calculated to measure the distance between the space vectors of two sections of SQL codes, thereby representing the similarity between the semantics of the two sections of SQL codes.

Cosine similarity is used for measuring the difference between two individuals by using a cosine value of an included angle between two vectors in a vector space. Cosine similarity is more focused on the difference of two vectors in direction than on distance measure, rather than distance or length.

x represents SQL1 and y represents SQL2.x is the number of _i This refers to the ith dimension of the space vector of the SQL1 code. In the present embodiment, i =255.

And the storage module is used for storing the historical SQL script and the spatial vector of the historical SQL script. An operator can input all available SQL statements into the system described in this embodiment, calculate and store their spatial vectors, which can be used for the above-mentioned inverse file frequency data set construction and model training, and the following applications.

And the code recommending module is used for recommending the similar codes of the SQL codes in the historical SQL script according to the space vectors of the SQL codes.

When all SQL developers write codes under a system framework (a WEB SQL editor), uncompleted written SQL codes can be selected through frames (for example, the processing rule of a certain field is unclear, and a certain condition does not know how to limit), a right key is triggered, a similar recommending function is selected, and a code recommending module can automatically calculate the SQL space vector of the current SQL code and recommend 5 sections of the most similar codes for the reference of the developers.

And the code positioning module is used for positioning the SQL code in the historical SQL script according to the space vector of the SQL code.

When the production finds that a certain section of SQL code runs still or audits find that a certain section of SQL code has a problem, it is unclear which script the section of code comes from, and the code positioning module can quickly position the origin of the problem SQL through similarity calculation.

And the code examination module is used for judging the grammar performance used by the SQL code.

In the code examination stage, in order to detect whether the SQL script contains some low-performance grammars (e.g., notin (select XX, XX FROM BB)), the weight corresponding to the corresponding keyword is raised, in this embodiment, modified to 1000, and then the space vector of the code of the script is calculated, and the distance between the space vector of the code of the script and the space vector of the reference SQL code (containing only the above-mentioned keyword code) is compared, so as to determine whether the script uses the corresponding low-performance grammar.

The system can be applied to aperture arrangement and data management of data assets, and is beneficial to operation and maintenance and auditors to quickly locate the problem SQL attribution script and eliminate the problem through the actually captured SQL code.

Example two

The embodiment discloses a method for analyzing similarity of SQL codes, which comprises the following steps:

s1, completely supplementing the SQL structure.

And traversing the SQL, matching with a metadata base of the database, and completely supplementing the information of the base, the table and the field used by each level in the SQL.

And (5) formulating an XPATH rule, packaging the complete supplemented SQL according to an XML format and storing the SQL in a two-dimensional table of a database.

And S2, extracting structural codes in the SQL codes, table Chinese names corresponding to the table codes and field Chinese names corresponding to the field codes.

And accessing the two-dimensional table of the database by using an XPATH rule to obtain service information.

S3, sorting the table Chinese names corresponding to the table codes from top to bottom according to the importance of the table codes in the SQL codes, and sorting the field Chinese names corresponding to the table codes from top to bottom according to the field codes.

Wherein, the importance of the table code or the field code in the SQL code is represented by TF-IDF value. The ranking of TF-IDF for the Chinese names in the table and the ranking of TF-IDF for the Chinese names in the field are obtained, respectively.

And S4, according to the size of the SQL, respectively selecting CM _ N table Chinese names and field Chinese names which are ranked at the top as important Chinese names. And segmenting each important Chinese name to obtain words.

And S5, calculating the vector of each important Chinese name, and then calculating the space vector of the SQL code according to the vector of the structural code and the vector of the important Chinese name.

And taking the average value of the word vectors of the words obtained by word segmentation as the vector of each important Chinese name.

The word vector of the word is a V _ N-dimensional vector obtained through word2vec model pre-training.

The vector of the structural code is a constant value.

The space vector is obtained by vector weighting calculation of the vector of the structural code and the vector of the important Chinese name.

The corresponding weight is designed according to the characteristics of the SQL structure body such as type, ranking and the like, the vector of the structural code and the vector of the important Chinese name are multiplied by the corresponding weight respectively, and the average value is added and taken as the space vector of the SQL code.

And S6, calculating the cosine distance of the two space vectors as the similarity of the two sections of SQL codes between semantics.

And S7, storing the historical SQL script and the space vector of the historical SQL script.

And S8, performing application including SQL similar recommendation, SQL tracing search, special structure check and the like through similarity calculation.

While the above description shows and describes the preferred embodiments of the present invention, it is to be understood that the invention is not limited to the forms disclosed herein, but is not to be construed as excluding other embodiments and is capable of use in various other combinations, modifications, and environments and is capable of changes within the scope of the inventive concept as expressed herein, commensurate with the above teachings, or the skill or knowledge of the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. An SQL code similarity analysis system, comprising:

the extraction module is used for extracting structural codes in the SQL codes, table Chinese names corresponding to the table codes and field Chinese names corresponding to the field codes;

the sorting module is used for sorting the table Chinese names corresponding to the table codes from top to bottom according to the importance of the table codes in the SQL codes and sorting the field Chinese names corresponding to the table codes from top to bottom according to the field codes;

the space vector calculation module is used for calculating a vector of each important Chinese name, and then calculating the space vector of the SQL code according to the vector of the structural code and the vector of the important Chinese name, wherein the important Chinese name comprises CM _ N table Chinese names with top ranking and CM _ N field Chinese names with top ranking;

the analysis module is used for calculating the cosine distance between the two space vectors;

the system also comprises an optimization module used for completely supplementing the SQL code and packaging according to the established XPATH rule;

the operation of the optimization module specifically comprises: aiming at a node with a node attribute as a table, a first child node of the node is expanded, the node attribute of the first child node is a table Chinese name, and the value is the corresponding Chinese name of the table;

aiming at the node with the node attribute as the field, a second child node of the node is expanded, the node attribute of the second child node is a field Chinese name, and the value is the Chinese name corresponding to the field;

expanding a third child node of the node aiming at the node with the node attribute as the field, wherein the node attribute of the third child node is a field type, and the value is the field type corresponding to the field;

aiming at a node with a node attribute as a table, if no node attribute is a brother node of the table alias, expanding a first brother node of the node, wherein the node attribute of the first brother node is the table alias, the value is the corresponding alias of the table, and if the brother node exists, the node attribute is directly reused;

aiming at a node with a node attribute as a field, if no node attribute is a brother node with a field alias, expanding a second brother node of the node, wherein the node attribute of the second brother node is the field alias, the value is the alias corresponding to the field, and if the brother node exists, the node attribute is directly reused;

and aiming at the node with the node attribute as the field, if no node attribute is the brother node of the alias of the home table, expanding the third brother node of the node, wherein the node attribute of the third brother node is the alias of the home table, the value of the alias corresponds to the table to which the field belongs, and if the brother node exists, directly multiplexing.

2. The SQL code similarity analysis system according to claim 1, further comprising:

the word segmentation module is used for segmenting each important Chinese name to obtain words;

and the space vector calculation module calculates the vector of each important Chinese name according to the word vector of the word.

3. The SQL code similarity analysis system according to claim 2, wherein the word vectors of the words are V _ N dimensional vectors obtained by pre-training of a word2vec model, and the word2vec model is obtained by training of a corpus established based on historical SQL scripts.

4. The SQL code similarity analysis system of claim 1, wherein the space vector is obtained by performing weighted calculation on the vector of the structural code and the vector of the important Chinese name, and the weight of the important Chinese name is set according to the code type and the rank corresponding to the important Chinese name.

5. The SQL code similarity analysis system of claim 1, further comprising a storage module configured to store historical SQL scripts and the spatial vectors of the historical SQL scripts.

6. The SQL code similarity analysis system according to claim 5, further comprising a code recommendation module for recommending similar codes of the SQL codes in the historical SQL script according to the space vector.

7. The SQL code similarity analysis system according to claim 5, further comprising a code location module for locating the SQL codes in the historical SQL script according to the spatial vector.

8. The SQL code similarity analysis system according to claim 5, further comprising a code review module for determining the syntax performance used by the SQL code.

9. A SQL code similarity analysis method is characterized by comprising the following steps:

extracting structural codes in SQL codes, table Chinese names corresponding to the table codes and field Chinese names corresponding to the field codes;

sorting table Chinese names corresponding to the table codes from top to bottom according to the importance of the table codes in the SQL codes, and sorting field Chinese names corresponding to the table codes from top to bottom according to the field codes;

calculating a vector of each important Chinese name, and then calculating a space vector of the SQL code according to the vector of the structural codes and the vector of the important Chinese names, wherein the important Chinese names comprise CM _ N top table Chinese names and CM _ N top field Chinese names;

calculating the cosine distance of the two space vectors;

supplementing the SQL codes completely, and packaging according to a formulated XPATH rule; the operation of the optimization module specifically comprises:

expanding a first child node of the node aiming at the node with the node attribute as the table, wherein the node attribute of the first child node is a table Chinese name, and the value is a Chinese name corresponding to the table;

expanding a third child node of the node aiming at the node with the node attribute as the field, wherein the node attribute of the third child node is the field type, and the value is the field type corresponding to the field;