CN114003229B - SQL code similarity analysis method and system - Google Patents

SQL code similarity analysis method and system Download PDF

Info

Publication number
CN114003229B
CN114003229B CN202111144502.8A CN202111144502A CN114003229B CN 114003229 B CN114003229 B CN 114003229B CN 202111144502 A CN202111144502 A CN 202111144502A CN 114003229 B CN114003229 B CN 114003229B
Authority
CN
China
Prior art keywords
node
field
sql
codes
attribute
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111144502.8A
Other languages
Chinese (zh)
Other versions
CN114003229A (en
Inventor
郭晨皓
李龙权
汲浩
蒋天榕
胡涛
唐雁南
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen International Bank Co ltd
Original Assignee
Xiamen International Bank Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen International Bank Co ltd filed Critical Xiamen International Bank Co ltd
Priority to CN202111144502.8A priority Critical patent/CN114003229B/en
Publication of CN114003229A publication Critical patent/CN114003229A/en
Application granted granted Critical
Publication of CN114003229B publication Critical patent/CN114003229B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/42Syntactic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/42Syntactic analysis
    • G06F8/425Lexical analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method and a system for analyzing SQL code similarity, wherein the system comprises the following steps: the extraction module is used for extracting structural codes in the SQL codes, table Chinese names corresponding to the table codes and field Chinese names corresponding to the field codes; the sorting module is used for sorting the Chinese names in the table from top to bottom and sorting the Chinese names in the fields; the space vector calculation module is used for calculating the vector of each important Chinese name and then calculating the space vector of the SQL code; and the analysis module is used for calculating the cosine distance of the two space vectors. By analyzing the semantic similarity between the SQL codes, the problem of repeated labor in the development process can be solved, and the development efficiency and the development quality are improved.

Description

SQL code similarity analysis method and system
Technical Field
The invention relates to the technical field of big data intellectualization, in particular to a method and a system for analyzing similarity of SQL (structured query language) codes.
Background
SQL is a standard computer structured query language for ANSI that is used to access and operate a variety of database systems. The SQL code contains abundant service information, namely the processing caliber of the data. Since the presentation form of SQL is usually complex and has no obvious features, developers engaged in SQL development have a lot of repeated labor. SQL codes needed by current development may exist in own previously written, coworked and system-ready codes, but developers lack a way to simply retrieve the codes and only rewrite the codes.
Disclosure of Invention
The invention provides a method and a system for analyzing the similarity of SQL codes, which can solve the problem of repeated labor in the development process and improve the development efficiency and the development quality by analyzing the semantic similarity between the SQL codes.
In order to achieve the purpose, the invention adopts the technical scheme that:
an SQL code similarity analysis system comprises: the extraction module is used for extracting structural codes in the SQL codes, table Chinese names corresponding to the table codes and field Chinese names corresponding to the field codes; the sorting module is used for sorting the table Chinese names corresponding to the table codes from top to bottom according to the importance of the table codes in the SQL codes and sorting the field Chinese names corresponding to the table codes from top to bottom according to the field codes; a space vector calculation module for calculating a vector of each important Chinese name, and then calculating a space vector of the SQL code according to the vector of the structural code and the vector of the important Chinese name, wherein the important Chinese name comprises CM _ N table Chinese names which are ranked at the top and CM _ N field Chinese names which are ranked at the top; and the analysis module is used for calculating the cosine distance between the two space vectors.
Preferably, the system further comprises: the word segmentation module is used for segmenting each important Chinese name to obtain words; and the space vector calculation module calculates the vector of each important Chinese name according to the word vector of the word.
Preferably, the word vector of the word is a V _ N-dimensional vector obtained by pre-training a word2vec model, and the word2vec model is obtained by training a corpus established based on a historical SQL script.
Preferably, the system further comprises an optimization module for completing the SQL code supplementation.
Preferably, the space vector is obtained by performing weighted calculation on the vector of the structural code and the vector of the important Chinese name, and the weight of the important Chinese name is set according to the code type and the ranking corresponding to the important Chinese name.
Preferably, the system further comprises a storage module, configured to store the historical SQL script and the spatial vector of the historical SQL script.
Preferably, the system further comprises a code recommending module, configured to recommend similar codes of the SQL codes in the historical SQL script according to the spatial vector.
Preferably, the system further comprises a code positioning module, configured to position the SQL code in the historical SQL script according to the spatial vector.
Preferably, the system further comprises a code examination module for judging the grammar performance used by the SQL code.
Based on the same inventive concept, the invention also provides an SQL code similarity analysis method, which comprises the following steps: extracting structural codes in SQL codes, table Chinese names corresponding to the table codes and field Chinese names corresponding to the field codes; sorting table Chinese names corresponding to the table codes from top to bottom according to the importance of the table codes in the SQL codes, and sorting field Chinese names corresponding to the table codes from top to bottom according to the field codes; calculating a vector of each important Chinese name, and then calculating a space vector of the SQL code according to the vector of the structural codes and the vector of the important Chinese names, wherein the important Chinese names comprise CM _ N top table Chinese names and CM _ N top field Chinese names; the cosine distance of the two space vectors is calculated.
The invention has the beneficial effects that:
1. extracting structural codes in the SQL codes, table Chinese names corresponding to the table codes and field Chinese names corresponding to the field codes, calculating space vectors of the SQL codes according to the vectors, and then calculating cosine distances of the two space vectors to serve as the similarity of the two sections of the SQL codes between semantics;
2. the optimization module supplements the SQL codes completely and packages the SQL codes according to the XML format, so that information extraction operation is convenient to perform;
3. the storage module stores the historical SQL script and the space vector of the historical SQL script, so that subsequent application is facilitated.
Drawings
FIG. 1 is an exemplary diagram of a complementary complete SQL construct.
Detailed Description
In order to make the technical problems, technical solutions and advantageous effects of the present invention more clear and obvious, the present invention is further described in detail with reference to specific embodiments below. It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the invention.
How to identify the similarity between different SQL is a complex problem. At present, comprehensive judgment is generally carried out according to structures, tables and fields used in SQL, but on the premise that semantics are completely the same, the positions of the fields, conditions and associations in the SQL can be transformed, and the associations can also adopt a plurality of structural expression modes, so that the method has low accuracy, can embody the effect only when the code difference between the SQL is very small, and has no practical application significance.
Example one
The embodiment discloses an SQL code similarity analysis system, which comprises:
and the optimization module is used for completely supplementing the SQL structure.
SQL code includes single-segment SQL statements, multiple-segment SQL statements, complete SQL scripts (containing multiple-segment SQL statements and some control syntax), and so on.
The optimization module traverses the SQL, matches with a metadata database of the database, and completely supplements the information of the database, the table and the field used by each level in the SQL.
Referring to fig. 1, the SQL code of the present embodiment as an example is as follows:
SELECT A.A1,B1
from AA A
INNER JOIN BB
ON A.A1=B1。
the operation of the optimization module specifically comprises: and expanding a first child node of the node aiming at the node with the node attribute as the table, wherein the node attribute of the first child node is the name of the Chinese in the table, and the value is the name of the Chinese corresponding to the table.
And expanding a second child node of the node aiming at the node with the node attribute as the field, wherein the node attribute of the second child node is the name of the Chinese character in the field, and the value is the name of the Chinese character corresponding to the field.
And expanding a third child node of the node aiming at the node with the node attribute as the field, wherein the node attribute of the third child node is the field type, and the value is the field type corresponding to the field.
And aiming at the node with the node attribute as the table, if no node attribute is the brother node of the table alias, expanding the first brother node of the node, wherein the node attribute of the first brother node is the table alias, and the value is the alias corresponding to the table. If the sibling node exists, the expansion is not carried out, and the multiplexing is carried out directly.
And for the node with the node attribute as the field, if no node attribute is the brother node of the alias of the field, expanding the second brother node of the node, wherein the node attribute of the second brother node is the alias of the field, and the value is the alias corresponding to the field. If the sibling node exists, the expansion is not carried out, and the multiplexing is carried out directly.
And for the node with the node attribute as the field, if no node attribute is the brother node of the attribution table alias, expanding a third brother node of the node, wherein the node attribute of the third brother node is the attribution table alias, and the value is the table corresponding alias of the attribution of the field. If the sibling node exists, the expansion is not carried out, and the multiplexing is carried out directly.
Furthermore, in order to facilitate the information extraction operation of the extraction module described in this embodiment, an XPATH rule is formulated, and the supplemented and complete SQL is packaged according to an XML format and stored in the two-dimensional table of the database.
And the extraction module is used for extracting structural codes in the SQL codes, namely the keyword codes, the table Chinese names corresponding to the table codes and the field Chinese names corresponding to the field codes.
The extraction module described in this embodiment may access the two-dimensional table of the database using an XPATH rule to obtain service information.
The English names of the fields (e.g., A1 and B1) are discarded because the English name of the field is arbitrarily named in the database, such as the system often uses cust _ no/cus _ no/custorer _ no, which all represent "customer number". If the english name is used for subsequent analysis, a large deviation will occur.
The information extracted by the extraction module according to this embodiment is shown in table 1.
Table 1 SQL construct extraction information
Figure BDA0003284880030000051
Figure BDA0003284880030000061
The core of the invention lies in the textualization of SQL codes, and then the application is realized by utilizing the similarity of texts.
The sequencing module is used for sequencing table names corresponding to the table codes from top to bottom according to the importance of the table codes in the SQL codes and sequencing field names corresponding to the table codes from top to bottom according to the field codes;
in this embodiment, the importance of the table code or the field code in the SQL code is the degree of representation of the chinese name corresponding to the code on the semantic level of the segment of SQL, and is expressed by a TF-IDF (term frequency-inverse file frequency) value.
The core idea of TF-IDF is: in an article, the importance of a word is positively related to the number of times the word appears in the article, and is negatively related to the number of articles in the corpus.
TF (term frequency): indicating the frequency of occurrence of a word in an article and indicating the relevance of a word to the article. This number is a normalization of the number of words (term count) to prevent it from biasing towards long documents.
IDF (inverse document frequency): indicating how common a word appears. The IDF of a word can be obtained by dividing the total number of articles by the number of articles in which the word appears, and then taking the obtained quotient to be a base-10 logarithm.
The importance of a word in one article can be expressed as the product of word frequency and inverse document frequency, i.e., TF-IDF = TF IDF.
The premise of adopting the TF-IDF algorithm is as follows: the word importance of an article is not related to where the word appears in the article.
In this embodiment, the data set of the reverse file frequency can be collected and constructed from historical SQL scripts, i.e., the full amount of SQL scripts.
And the sequencing module respectively obtains the ranking of the TF-IDF of the Chinese name in the table and the ranking of the TF-IDF of the Chinese name in the field.
And the word segmentation module is used for segmenting each important Chinese name to obtain words. Important Chinese names include top-ranked CM _ N table Chinese names and top-ranked CM _ N field Chinese names. That is, the word segmentation module firstly selects the table Chinese names and the field Chinese names with the top CM _ N according to the size of SQL, as shown in tables 2 and 3, and can filter some Chinese names which may not be related to the SQL operation significance, such as "time to put in storage", "number", and the like.
TABLE 2 Chinese name ranking
Watch (A) TF-IDF Ranking rank
Bank 15 1
Loan 15 2
Table 3 field chinese name ranking
Field(s) TF-IDF Ranking rank
Bank, bank number 25 1
Loan bank number 25 2
And the space vector calculation module is used for calculating the vector of each important Chinese name and then calculating the space vector of the SQL code according to the vector of the structural code and the vector of the important Chinese name.
And the space vector calculation module calculates the vector of each important Chinese name according to the word vector of the word obtained by word segmentation. Such as important chinese names: the bank number is obtained by word segmentation, and the word vectors corresponding to the words bank and the number are obtained by respectively searching the word segmentation list. And the word vectors of the words in the word segmentation list are V _ N-dimensional vectors obtained through word2vec model pre-training. The larger the dimension of the vector, the higher its accuracy, diversity, but the greater the computational complexity. In the present embodiment, V _ N is set to 255. The word2vec model is obtained by training a corpus built based on historical SQL scripts.
In this embodiment, the average value of the word vectors of the words obtained by word segmentation is taken as the vector of each important Chinese name. The above calculation is performed for each important Chinese name to obtain a set, and the elements in the set include the vector and the type of the important Chinese name, as shown in table 4.
The vector of the structural code is a fixed value, the number of the structured code commonly used in SQL is only 20, and the structured code is directly specified according to the key word (the 255-dimensional vector is only arranged with 1 bit in every 5 dimensions, for example, the SELECT is the first bit 1, and the FROM is the 6 th bit 1).
Table 4 vectors of SQL constructs
SQL structure extraction Word segmentation result Corresponding word vector Weighting vector Marking as Type (B)
SELECT V000 (constant value) V000 CV1 Key word
Bank, bank number Bank/number V1/V2 (V1+V2)/2 CV2 Field(s)
Loan bank number Bank/number V3/V4 (V3+V4)/2 CV3 Field(s)
FROM V001 (constant value) V001 CV4 Key word
Bank Bank V5 V5 CV5 Watch (A)
INNER JOIN V002 (constant value) V002 CV6 Key word
Loan Loan V6 V6 CV7 Watch (A)
ON V003 (constant) V003 CV8 Key word
Bank, bank number Bank/number V1/V2 (V1+V2)/2 CV9 Field(s)
Loan bank number Bank/number V3/V4 (V3+V4)/2 CV10 Field(s)
In this embodiment, the space vector is obtained by performing weighted calculation on the vector of the structural code and the vector of the important chinese name, and corresponding weights are designed according to the characteristics of the SQL structure, such as the type and the rank, as shown in table 5, where maxrank is the TF-IDF rank number, that is, CM _ N, and rank is the TF-IDF corresponding rank.
TABLE 5 weight rules
Figure BDA0003284880030000081
Figure BDA0003284880030000091
The vector of the structural code and the vector of the significant Chinese character name are multiplied by the corresponding weights, and the added values are averaged to obtain the space vector of the SQL code, namely, the space vector of the SQL code (= (CV 1 × W3+ CV4 × W3+ CV6 × W3+ CV8 × W3+ CV2 × W2+ CV9 + CV10 × W2+ CV5 × W1+ CV7 × W1)/10.
And the analysis module is used for calculating the cosine distance of the two space vectors.
In this embodiment, the cosine distance is calculated to measure the distance between the space vectors of two sections of SQL codes, thereby representing the similarity between the semantics of the two sections of SQL codes.
Cosine similarity is used for measuring the difference between two individuals by using a cosine value of an included angle between two vectors in a vector space. Cosine similarity is more focused on the difference of two vectors in direction than on distance measure, rather than distance or length.
Figure BDA0003284880030000092
x represents SQL1 and y represents SQL2.x is the number of i This refers to the ith dimension of the space vector of the SQL1 code. In the present embodiment, i =255.
And the storage module is used for storing the historical SQL script and the spatial vector of the historical SQL script. An operator can input all available SQL statements into the system described in this embodiment, calculate and store their spatial vectors, which can be used for the above-mentioned inverse file frequency data set construction and model training, and the following applications.
And the code recommending module is used for recommending the similar codes of the SQL codes in the historical SQL script according to the space vectors of the SQL codes.
When all SQL developers write codes under a system framework (a WEB SQL editor), uncompleted written SQL codes can be selected through frames (for example, the processing rule of a certain field is unclear, and a certain condition does not know how to limit), a right key is triggered, a similar recommending function is selected, and a code recommending module can automatically calculate the SQL space vector of the current SQL code and recommend 5 sections of the most similar codes for the reference of the developers.
And the code positioning module is used for positioning the SQL code in the historical SQL script according to the space vector of the SQL code.
When the production finds that a certain section of SQL code runs still or audits find that a certain section of SQL code has a problem, it is unclear which script the section of code comes from, and the code positioning module can quickly position the origin of the problem SQL through similarity calculation.
And the code examination module is used for judging the grammar performance used by the SQL code.
In the code examination stage, in order to detect whether the SQL script contains some low-performance grammars (e.g., notin (select XX, XX FROM BB)), the weight corresponding to the corresponding keyword is raised, in this embodiment, modified to 1000, and then the space vector of the code of the script is calculated, and the distance between the space vector of the code of the script and the space vector of the reference SQL code (containing only the above-mentioned keyword code) is compared, so as to determine whether the script uses the corresponding low-performance grammar.
The system can be applied to aperture arrangement and data management of data assets, and is beneficial to operation and maintenance and auditors to quickly locate the problem SQL attribution script and eliminate the problem through the actually captured SQL code.
Example two
The embodiment discloses a method for analyzing similarity of SQL codes, which comprises the following steps:
s1, completely supplementing the SQL structure.
And traversing the SQL, matching with a metadata base of the database, and completely supplementing the information of the base, the table and the field used by each level in the SQL.
And (5) formulating an XPATH rule, packaging the complete supplemented SQL according to an XML format and storing the SQL in a two-dimensional table of a database.
And S2, extracting structural codes in the SQL codes, table Chinese names corresponding to the table codes and field Chinese names corresponding to the field codes.
And accessing the two-dimensional table of the database by using an XPATH rule to obtain service information.
S3, sorting the table Chinese names corresponding to the table codes from top to bottom according to the importance of the table codes in the SQL codes, and sorting the field Chinese names corresponding to the table codes from top to bottom according to the field codes.
Wherein, the importance of the table code or the field code in the SQL code is represented by TF-IDF value. The ranking of TF-IDF for the Chinese names in the table and the ranking of TF-IDF for the Chinese names in the field are obtained, respectively.
And S4, according to the size of the SQL, respectively selecting CM _ N table Chinese names and field Chinese names which are ranked at the top as important Chinese names. And segmenting each important Chinese name to obtain words.
And S5, calculating the vector of each important Chinese name, and then calculating the space vector of the SQL code according to the vector of the structural code and the vector of the important Chinese name.
And taking the average value of the word vectors of the words obtained by word segmentation as the vector of each important Chinese name.
The word vector of the word is a V _ N-dimensional vector obtained through word2vec model pre-training.
The vector of the structural code is a constant value.
The space vector is obtained by vector weighting calculation of the vector of the structural code and the vector of the important Chinese name.
The corresponding weight is designed according to the characteristics of the SQL structure body such as type, ranking and the like, the vector of the structural code and the vector of the important Chinese name are multiplied by the corresponding weight respectively, and the average value is added and taken as the space vector of the SQL code.
And S6, calculating the cosine distance of the two space vectors as the similarity of the two sections of SQL codes between semantics.
And S7, storing the historical SQL script and the space vector of the historical SQL script.
And S8, performing application including SQL similar recommendation, SQL tracing search, special structure check and the like through similarity calculation.
While the above description shows and describes the preferred embodiments of the present invention, it is to be understood that the invention is not limited to the forms disclosed herein, but is not to be construed as excluding other embodiments and is capable of use in various other combinations, modifications, and environments and is capable of changes within the scope of the inventive concept as expressed herein, commensurate with the above teachings, or the skill or knowledge of the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (9)

1. An SQL code similarity analysis system, comprising:
the extraction module is used for extracting structural codes in the SQL codes, table Chinese names corresponding to the table codes and field Chinese names corresponding to the field codes;
the sorting module is used for sorting the table Chinese names corresponding to the table codes from top to bottom according to the importance of the table codes in the SQL codes and sorting the field Chinese names corresponding to the table codes from top to bottom according to the field codes;
the space vector calculation module is used for calculating a vector of each important Chinese name, and then calculating the space vector of the SQL code according to the vector of the structural code and the vector of the important Chinese name, wherein the important Chinese name comprises CM _ N table Chinese names with top ranking and CM _ N field Chinese names with top ranking;
the analysis module is used for calculating the cosine distance between the two space vectors;
the system also comprises an optimization module used for completely supplementing the SQL code and packaging according to the established XPATH rule;
the operation of the optimization module specifically comprises: aiming at a node with a node attribute as a table, a first child node of the node is expanded, the node attribute of the first child node is a table Chinese name, and the value is the corresponding Chinese name of the table;
aiming at the node with the node attribute as the field, a second child node of the node is expanded, the node attribute of the second child node is a field Chinese name, and the value is the Chinese name corresponding to the field;
expanding a third child node of the node aiming at the node with the node attribute as the field, wherein the node attribute of the third child node is a field type, and the value is the field type corresponding to the field;
aiming at a node with a node attribute as a table, if no node attribute is a brother node of the table alias, expanding a first brother node of the node, wherein the node attribute of the first brother node is the table alias, the value is the corresponding alias of the table, and if the brother node exists, the node attribute is directly reused;
aiming at a node with a node attribute as a field, if no node attribute is a brother node with a field alias, expanding a second brother node of the node, wherein the node attribute of the second brother node is the field alias, the value is the alias corresponding to the field, and if the brother node exists, the node attribute is directly reused;
and aiming at the node with the node attribute as the field, if no node attribute is the brother node of the alias of the home table, expanding the third brother node of the node, wherein the node attribute of the third brother node is the alias of the home table, the value of the alias corresponds to the table to which the field belongs, and if the brother node exists, directly multiplexing.
2. The SQL code similarity analysis system according to claim 1, further comprising:
the word segmentation module is used for segmenting each important Chinese name to obtain words;
and the space vector calculation module calculates the vector of each important Chinese name according to the word vector of the word.
3. The SQL code similarity analysis system according to claim 2, wherein the word vectors of the words are V _ N dimensional vectors obtained by pre-training of a word2vec model, and the word2vec model is obtained by training of a corpus established based on historical SQL scripts.
4. The SQL code similarity analysis system of claim 1, wherein the space vector is obtained by performing weighted calculation on the vector of the structural code and the vector of the important Chinese name, and the weight of the important Chinese name is set according to the code type and the rank corresponding to the important Chinese name.
5. The SQL code similarity analysis system of claim 1, further comprising a storage module configured to store historical SQL scripts and the spatial vectors of the historical SQL scripts.
6. The SQL code similarity analysis system according to claim 5, further comprising a code recommendation module for recommending similar codes of the SQL codes in the historical SQL script according to the space vector.
7. The SQL code similarity analysis system according to claim 5, further comprising a code location module for locating the SQL codes in the historical SQL script according to the spatial vector.
8. The SQL code similarity analysis system according to claim 5, further comprising a code review module for determining the syntax performance used by the SQL code.
9. A SQL code similarity analysis method is characterized by comprising the following steps:
extracting structural codes in SQL codes, table Chinese names corresponding to the table codes and field Chinese names corresponding to the field codes;
sorting table Chinese names corresponding to the table codes from top to bottom according to the importance of the table codes in the SQL codes, and sorting field Chinese names corresponding to the table codes from top to bottom according to the field codes;
calculating a vector of each important Chinese name, and then calculating a space vector of the SQL code according to the vector of the structural codes and the vector of the important Chinese names, wherein the important Chinese names comprise CM _ N top table Chinese names and CM _ N top field Chinese names;
calculating the cosine distance of the two space vectors;
supplementing the SQL codes completely, and packaging according to a formulated XPATH rule; the operation of the optimization module specifically comprises:
expanding a first child node of the node aiming at the node with the node attribute as the table, wherein the node attribute of the first child node is a table Chinese name, and the value is a Chinese name corresponding to the table;
aiming at the node with the node attribute as the field, a second child node of the node is expanded, the node attribute of the second child node is a field Chinese name, and the value is the Chinese name corresponding to the field;
expanding a third child node of the node aiming at the node with the node attribute as the field, wherein the node attribute of the third child node is the field type, and the value is the field type corresponding to the field;
aiming at a node with a node attribute as a table, if no node attribute is a brother node of the table alias, expanding a first brother node of the node, wherein the node attribute of the first brother node is the table alias, the value is the corresponding alias of the table, and if the brother node exists, the node attribute is directly reused;
aiming at a node with a node attribute as a field, if no node attribute is a brother node with a field alias, expanding a second brother node of the node, wherein the node attribute of the second brother node is the field alias, the value is the alias corresponding to the field, and if the brother node exists, the node attribute is directly reused;
and aiming at the node with the node attribute as the field, if no node attribute is the brother node of the alias of the home table, expanding the third brother node of the node, wherein the node attribute of the third brother node is the alias of the home table, the value of the alias corresponds to the table to which the field belongs, and if the brother node exists, directly multiplexing.
CN202111144502.8A 2021-09-28 2021-09-28 SQL code similarity analysis method and system Active CN114003229B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111144502.8A CN114003229B (en) 2021-09-28 2021-09-28 SQL code similarity analysis method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111144502.8A CN114003229B (en) 2021-09-28 2021-09-28 SQL code similarity analysis method and system

Publications (2)

Publication Number Publication Date
CN114003229A CN114003229A (en) 2022-02-01
CN114003229B true CN114003229B (en) 2022-10-18

Family

ID=79921955

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111144502.8A Active CN114003229B (en) 2021-09-28 2021-09-28 SQL code similarity analysis method and system

Country Status (1)

Country Link
CN (1) CN114003229B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111813802A (en) * 2020-09-11 2020-10-23 杭州量之智能科技有限公司 Method for generating structured query statement based on natural language
CN111984666A (en) * 2019-05-23 2020-11-24 北京数聚鑫云信息技术有限公司 Database access method and device, computer readable storage medium and computer equipment
CN112527971A (en) * 2020-12-25 2021-03-19 华戎信息产业有限公司 Method and system for searching similar articles
WO2021066231A1 (en) * 2019-10-01 2021-04-08 (주) 더존비즈온 Sql query recommendation method and system

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10599664B2 (en) * 2016-04-22 2020-03-24 Cloudera, Inc. Interactive identification of similar SQL queries
US11269874B2 (en) * 2020-03-19 2022-03-08 Oracle International Corporation Two-valued logic primitives for SQL query processing
CN112764809B (en) * 2021-01-25 2022-07-05 广西大学 SQL code plagiarism detection method and system based on coding characteristics
CN112765201A (en) * 2021-02-01 2021-05-07 武汉思普崚技术有限公司 Method and device for analyzing SQL (structured query language) statement into specific field query statement

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111984666A (en) * 2019-05-23 2020-11-24 北京数聚鑫云信息技术有限公司 Database access method and device, computer readable storage medium and computer equipment
WO2021066231A1 (en) * 2019-10-01 2021-04-08 (주) 더존비즈온 Sql query recommendation method and system
CN111813802A (en) * 2020-09-11 2020-10-23 杭州量之智能科技有限公司 Method for generating structured query statement based on natural language
CN112527971A (en) * 2020-12-25 2021-03-19 华戎信息产业有限公司 Method and system for searching similar articles

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
sql语句相似度统计的说明;fjssharpsword;《https://blog.csdn.net/fjssharpsword/article/details/8979307?utm_medium=distribute.pc_aggpage_search_result.none-task-blog-2~aggregatepage~first_rank_ecpm_v1~rank_v31_ecpm-1-8979307-null-null.pc_agg_new_rank&utm_term》;20130527;第1-4页 *

Also Published As

Publication number Publication date
CN114003229A (en) 2022-02-01

Similar Documents

Publication Publication Date Title
CN111104794B (en) Text similarity matching method based on subject term
US11222055B2 (en) System, computer-implemented method and computer program product for information retrieval
WO2022116537A1 (en) News recommendation method and apparatus, and electronic device and storage medium
CN105989040B (en) Intelligent question and answer method, device and system
JP5721818B2 (en) Use of model information group in search
US20160217144A1 (en) Method and device for obtaining web page category standards, and method and device for categorizing web page categories
CN111125086B (en) Method, device, storage medium and processor for acquiring data resources
CN110737756B (en) Method, apparatus, device and medium for determining answer to user input data
CN110019669B (en) Text retrieval method and device
Misuraca et al. BMS: An improved Dunn index for Document Clustering validation
CN112347352A (en) Course recommendation method and device and storage medium
CN112131453A (en) Method, device and storage medium for detecting network bad short text based on BERT
CN112989813A (en) Scientific and technological resource relation extraction method and device based on pre-training language model
CN110795613A (en) Commodity searching method, device and system and electronic equipment
CN113656540B (en) BI query method, device, equipment and medium based on NL2SQL
Wei et al. Online education recommendation model based on user behavior data analysis
CN105159898A (en) Searching method and searching device
CN114116997A (en) Knowledge question answering method, knowledge question answering device, electronic equipment and storage medium
CN113591476A (en) Data label recommendation method based on machine learning
CN114003229B (en) SQL code similarity analysis method and system
US20200110769A1 (en) Machine learning (ml) based expansion of a data set
CN115936805A (en) Commodity recommendation method, commodity recommendation device, commodity recommendation equipment and commodity recommendation medium
CN114328844A (en) Text data set management method, device, equipment and storage medium
CN112100323B (en) Hidden association mining method based on representation learning
Li et al. A service mode of expert finding in social network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant