CN116501764B - Automatic SQL optimization method based on generated pre-training model - Google Patents

Automatic SQL optimization method based on generated pre-training model Download PDF

Info

Publication number
CN116501764B
CN116501764B CN202310760270.1A CN202310760270A CN116501764B CN 116501764 B CN116501764 B CN 116501764B CN 202310760270 A CN202310760270 A CN 202310760270A CN 116501764 B CN116501764 B CN 116501764B
Authority
CN
China
Prior art keywords
sql
optimized
score
instance
training model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310760270.1A
Other languages
Chinese (zh)
Other versions
CN116501764A (en
Inventor
宋云生
刘雯
陈桂武
赵新阳
李庆安
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Lan You Technology Co Ltd
Original Assignee
Shenzhen Lan You Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Lan You Technology Co Ltd filed Critical Shenzhen Lan You Technology Co Ltd
Priority to CN202310760270.1A priority Critical patent/CN116501764B/en
Publication of CN116501764A publication Critical patent/CN116501764A/en
Application granted granted Critical
Publication of CN116501764B publication Critical patent/CN116501764B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24553Query execution of query operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The application provides an automatic SQL optimization method based on a generated pre-training model, which comprises the following steps: s1, storing a large number of SQL which can be executed in a simulation test environment into a historical SQL instance library to form an SQL instance set to be optimized, transmitting the SQL instance set to be optimized to a generated pre-training model in batches to generate a plurality of SQL instance sets to be selected, and combining the generated SQL instance set to be optimized and the SQL instance set to be optimized into an SQL data set to be selected; s2, constructing an SQL evaluation model through a deep neural network algorithm, and combining the generated pre-training model and the deep learning SQL relative efficiency evaluation model to improve SQL execution efficiency and reduce SQL optimization thresholds. Meanwhile, the diversity of the SQL statement generation to be selected is improved, and the SQL performance and accuracy are improved by constructing a special evaluation function and combining fine adjustment with a generation model, so that the requirements of enterprise-level application programs are better met, and more accurate, special and convenient SQL optimization service is provided for users.

Description

Automatic SQL optimization method based on generated pre-training model
Technical Field
The application relates to the technical field of big data processing, in particular to an automatic SQL optimization method based on a generated pre-training model.
Background
Databases are one of the most commonly used data storage and processing systems in enterprise-level applications. SQL statements in a database are important languages for querying and concatenating data. However, due to the complexity and diversity of SQL sentences, the efficiency and the resource occupation condition of SQL written by a developer at a non-communication level are quite different, such as slow execution, occupation of a large amount of time, or occupation of CPU and excessive memory, a large amount of resource waste is brought to enterprises, and database crashes are caused to cause squeeze faults or downtime when serious. The inertia of the developer itself or some developers do not have the technical ability to optimize SQL. In addition, because the execution result of the SQL statement is affected by various factors, such as parameters of query, table structure, selection of index, etc., manual optimization of the SQL statement often requires a lot of manpower, and it is difficult to balance hardware resource saving and manpower resource waste.
In the prior art, an SQL optimization rule engine is generally constructed from the viewpoints of SQL grammar, table structure and machine learning to generate SQL with the same execution result. Or analyzing the source SQL sentence based on the grammar structure tree and the SQL analyzer to obtain grammar components such as keywords, table relations, conditions and the like, reconstructing the SQL sentence by combining the rules by using a machine learning and deep learning means, then evaluating the effect of the SQL sentence, and screening the superior SQL sentence, such as an automatic method and a system for realizing the optimized execution of the database SQL based on artificial intelligence, which are disclosed in the prior patent number CN 113505141A.
Generally, the existing SQL optimization scheme based on rules and machine learning and deep learning engine has a certain application value, however, the existing SQL optimization tool still has some challenges and limitations. Because the problems of analysis and acquisition of rules, the number of components analyzed and the like limit SQL optimization alternatives, the intelligent degree is very small, the traditional scheme is incomplete in consideration of SQL advantage evaluation, only the SQL time consumption is considered, and comprehensive evaluation cannot be performed from the aspects of resource use such as CPU core number, memory occupation and the like.
The prior art has the defects that:
(1) SQL reconstruction generation capability is limited: the SQL reconstruction system based on rule and grammar tree analysis is based on reconstruction of an already analyzed component, is not generated in nature, can not automatically generate the SQL to be selected based on semantic level, is less likely to learn and optimize knowledge such as enterprise database and table structure, and can not reconstruct the optimal SQL to be selected even the SQL to be recombined if the analyzed component is not added to training data in a model learning training stage.
(2) The effect evaluation of SQL optimization is not comprehensive: when the SQL alternative scheme is screened, only time consumption or execution speed is targeted, the resource occupation condition, such as CPU core occupation and memory occupation, is ignored, and the larger the resources are theoretically occupied, the faster the SQL execution speed is possible, but the advantage brought by SQL reorganization is not necessarily needed.
(3) The SQL optimized effect evaluation is not linked to machine learning: the model data after evaluation and verification need to be re-input into SQL to generate model training so as to adjust the generation direction of the model, but the current scheme does not form iterative linkage, is limited by simple targets, is reconstructed but not generated, even if the linkage effect is limited, the machine does not have a method for reorganizing more effective SQL to be selected.
Disclosure of Invention
Aiming at the defects of the technical scheme, the application provides the automatic SQL optimization method based on the generated pre-training model, which is high in efficiency, accurate and easy to use.
The application provides an automatic SQL optimization method based on a generated pre-training model, which comprises the following steps:
s1, storing a large number of SQL which can be executed in a simulation test environment into a historical SQL instance library to form an SQL instance set to be optimized, transmitting the SQL instance set to be optimized to a generated pre-training model in batches to generate a plurality of SQL instance sets to be selected, and combining the generated SQL instance set to be optimized and the SQL instance set to be optimized into an SQL data set to be selected;
s2, constructing an SQL evaluation model through a deep neural network algorithm, performing prediction evaluation on a to-be-selected SQL data set generated based on the generated pre-training model according to the SQL evaluation model, and screening the to-be-optimized SQL;
s3, constructing calling logic of the user terminal based on the generated pre-training model, the SQL evaluation model and the simulation test environment to perform automatic SQL optimization.
In the automatic SQL optimization method based on the generated pre-training model, provided by the application; the step S1 includes the steps of:
s11, SQL is transmitted to an SQL cleaning module, irregular content and multiple sentences are cleaned, and an SQL instance set to be optimized is formed;
s12, placing the SQL data set to be selected into a simulation test environment for execution, recording each SQL execution result to be selected and the corresponding SQL execution result to be optimized, judging whether the SQL execution result set to be selected is consistent with the SQL execution result set to be optimized or not, if the SQL execution result set to be selected contains the SQL execution result set to be optimized, the SQL data set to be selected is expressed as 1, and if the SQL data set to be selected does not contain the SQL execution result set to be optimized, the SQL data set to be selected is expressed as 0, wherein the execution duration of each SQL isTOccupy CPU core number asCThe occupied memory size isM
In the automatic SQL optimization method based on the generated pre-training model, provided by the application; the step S1 further includes the steps of:
s13, performing score calculation on the execution result of the SQL data set to be selected according to the SQL evaluation algorithm module to obtain an SQL scoring functionScore o The calculation formula of the SQL scoring function is as follows:
Score o consistency× (wc×c+wm×m) ×t, where Scoreo is an absolute dominance score of SQL, wc, wm represent weights of the total score for CPU occupancy and memory occupancy, respectively, and wc+wm=1.
In the automatic SQL optimization method based on the generated pre-training model, provided by the application; the step S1 further includes the steps of:
s14, collecting the SQL data set to be selected after grading by the SQL algorithm grading module into an SQL instance grading data set, screening SQL with grading larger than 0 as an iterative data set, sending the SQL to be optimized and the corresponding SQL to be selected into an SQL diversification module, analyzing and replacing library names, table names and field names in the SQL to be optimized and the SQL to be selected, merging the replaced SQL and the iterative data set into a diversified SQL instance grading data set, and finally, recording the diversified SQL instance grading data set into a generated pre-training model for fine adjustment, and selecting paired SQL to be optimized and SQL to be selected for supervised fine adjustment; after finishing the fine tuning iteration, saving the generated pre-training model for calling; when the historical SQL instance library is increased to a predetermined level, the steps S11-S14 are continuously executed to fine tune the iterative generation type pre-training model again.
In the automatic SQL optimization method based on the generated pre-training model, provided by the application; the step S2 includes the steps of:
s21, after the simulation test environment is executed by the to-be-selected SQL instance set, recording each to-be-selected SQL execution result and the corresponding to-be-optimized SQL execution result, judging whether the to-be-selected SQL execution result set is consistent with the corresponding to-be-optimized SQL execution result set, if the to-be-selected SQL execution result set contains the to-be-optimized SQL execution result set, the to-be-selected SQL execution result set is expressed as 1, and if the to-be-selected SQL execution result set does not contain the to-be-optimized SQL execution result set, the to-be-selected SQL execution result set is expressed as 0, wherein the execution duration of each SQL isTOccupy CPU core number asCThe occupied memory size isM
S22, performing score calculation on the execution result of the SQL data set to be selected according to the SQL evaluation algorithm module to obtain an SQL relative advantage score functionScore p The relative dominance score functionScore p The calculation formula of (2) is as follows:
Score p =consistency x%W c ×P C +W m ×P MP T
P C = C SQL to be selected / C SQL to be optimized
P M = M SQL to be selected / M SQL to be optimized
P T = T SQL to be selected / T SQL to be optimized
Wherein the saidW cW m Respectively representing the weights of the total scores of the CPU occupation and the memory occupation, and calculating the relative advantage score of each SQL to be selected according to the adjusted scoring algorithmScore p RemovingScore p The SQL to be selected is 0, the remained SQL to be selected corresponding to the SQL to be optimized and the score thereof are reserved, and an SQL instance relative score data set is obtained, wherein the SQL instance relative score data set is obtained by the methodP CP M AndP T respectively representing the relative advantages of CPU, memory and duration, whereinC SQL to be selected AndM SQL to be selected AndT SQL to be selected Respectively representing the occupied CPU core number, occupied memory size and execution duration of the SQL to be selected, wherein the CPU core number, the occupied memory size and the execution duration are respectivelyC SQL to be optimizedM SQL to be optimized AndT SQL to be optimized The occupied CPU core number and occupied memory size of the SQL to be optimized are respectively represented.
In the automatic SQL optimization method based on the generated pre-training model, provided by the application; the step S2 further includes the steps of:
s23, the SQL to be optimized and the corresponding SQL to be selected in the SQL instance relative scoring data set are encoded by using a generating pre-training module, matrixes encoded by the SQL instance relative scoring data set and the SQL to be selected are spliced into an integral matrix according to columns, the integral matrix is used as an independent variable sample input of a deep neural network algorithm, and the SQL to be selected relative advantage score is obtainedScoreAs a dependent variable sample input, the SQL relative dominance score is calculated due to the candidateScoreFor the continuity variable, the loss function score is set to beMSEThe saidMSEThe calculation method is as follows:
wherein , and />Respectively the firstiActual values of scores and model predictions for the pieces of sample data,nis the number of samples; and after training is completed, the model is stored as an SQL relative efficiency evaluation model to be called.
In the automatic SQL optimization method based on the generated pre-training model, provided by the application; the step S3 includes the steps of:
s31, after the training iteration of the generated pre-training model and the SQL relative efficiency evaluation model is completed, the user side develops a sentence of SQL to be optimized, transmits the SQL to be optimized to a simulation test environment, feeds back error reporting to the user side if the SQL cannot be executed, writes the SQL into a historical SQL instance library if the SQL cannot be executed, and simultaneously transmits the SQL to the SQL cleaning module;
s32, the cleaned SQL is transmitted to a generating type pre-training model to generate one or more pieces of SQL to be selected.
In the automatic SQL optimization method based on the generated pre-training model, provided by the application; the step S3 further includes the steps of:
s33, transmitting the obtained SQL to be selected and the SQL to be optimized corresponding to the SQL relative efficiency evaluation model simultaneously to obtain a relative advantage scoreScore p
S34, if the relative dominance scoreScore p If the relative dominance score is larger than 1 or the SQL cannot be normally executed in the simulation environment, the SQL is considered to be generated as an incorrect SQL or the SQL with lower efficiency, the step S1 of the generation type pre-training model and the step S3 of the SQL relative efficiency evaluation model are required to be repeatedly executed, and the SQL is continuously generated and screened until a screening result or the user side is forcedly stopped.
The automatic SQL optimization method based on the generated pre-training model is used for improving SQL execution efficiency and reducing SQL optimization thresholds by combining the generated pre-training model and the deep learning SQL relative efficiency evaluation model. Meanwhile, the diversity of the SQL statement generation to be selected is improved, and the SQL performance and accuracy are improved by constructing a special evaluation function and combining fine adjustment with a generation model, so that the requirements of enterprise-level application programs are better met, and more accurate, special and convenient SQL optimization service is provided for users.
Drawings
FIG. 1 is a flow diagram of an embodiment of an automated SQL optimization method of the application based on a generative pre-training model.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
It should be noted that the terms "first," "second," and the like in the description and the claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
FIG. 1 is a flow chart of an embodiment of an automated SQL optimization method of the application based on a generative pre-training model. The automatic SQL optimization method based on the generated pre-training model comprises the following steps:
in step S1, storing a large number of SQL which can be executed in a simulation test environment into a historical SQL instance library to form an SQL instance set to be optimized, transmitting the SQL instance set to be optimized to a generated pre-training model in batches to generate a plurality of SQL instance sets to be selected, and merging the generated SQL instance set to be optimized and the SQL instance set to be optimized into a SQL data set to be selected;
the method aims at completing iteration of the generated pre-training model, can determine whether iteration is needed according to the increase of data quantity in the historical SQL instance library, and aims at fine tuning the generated pre-training model to ensure the executable performance of the generated SQL and the consistency of the generated SQL and the SQL to be optimized.
In step S2, constructing an SQL evaluation model through a deep neural network algorithm, performing prediction evaluation on a to-be-selected SQL data set generated based on the generated pre-training model according to the SQL evaluation model, and screening the to-be-optimized SQL;
the method has the advantages that the generated pre-training model is used for generating the SQL to be selected, but the SQL to be selected needs to be evaluated, and a relatively comprehensive relative evaluation model needs to be established, namely, the performance of the SQL to be selected is improved compared with that of the SQL to be optimized, so that resource waste caused by executing the SQL to be selected on a simulation or production environment each time can be avoided.
S3, constructing calling logic of the user terminal based on the generated pre-training model, the SQL evaluation model and the simulation test environment to perform automatic SQL optimization.
Specifically, after the training iteration of the generated pre-training model and the SQL relative efficiency evaluation model is completed, an automatic SQL optimization tool which can be called by the user side is formed.
In one embodiment, step S1 includes the steps of:
in step S11, SQL is transmitted to an SQL cleaning module to clean irregular content and a plurality of sentences to form an SQL instance set to be optimized;
in step S12, the to-be-selected SQL dataset is put into the simulation test environment for execution, each to-be-selected SQL execution result is recorded and the corresponding to-be-optimized SQL execution result is recorded, whether the to-be-selected SQL execution result set is consistent with the corresponding to-be-optimized SQL execution result set is judged, if the to-be-selected SQL execution result set contains the to-be-optimized SQL execution result set, the to-be-selected SQL execution result set is expressed as 1, if not, the to-be-selected SQL execution result set is expressed as 0, wherein the execution duration of each SQL isTOccupy CPU core number asCThe occupied memory size isM
In an embodiment, step S1 further includes the steps of:
in step S13, scoring the execution result of the selected SQL data set according to the SQL evaluation algorithm moduleCalculating to obtain SQL scoring functionScore o The calculation formula of the SQL scoring function is as follows:
Score o consistency× (wc×c+wm×m) ×t, where Scoreo is an absolute dominance score of SQL, wc, wm represent weights of the total score for CPU occupancy and memory occupancy, respectively, and wc+wm=1. In particularW cW m And setting according to the tension degree of two types of resources, namely the CPU and the memory.
In an embodiment, step S1 further includes the steps of:
in step S14, collecting the SQL data set to be selected after grading by the SQL algorithm grading module into an SQL instance grading data set, screening the SQL with grading larger than 0 as an iterative data set, sending the SQL to be optimized and the corresponding SQL to be selected into an SQL diversification module, analyzing and replacing library names, table names and field names in the SQL to be optimized and the SQL to be selected, merging the replaced SQL and the iterative data set into a diversified SQL instance grading data set, and finally, recording the diversified SQL instance grading data set into a generated pre-training model for fine adjustment, and selecting the paired SQL to be optimized and the SQL to be selected for supervised fine adjustment; after finishing the fine tuning iteration, saving the generated pre-training model for calling; when the historical SQL instance library is increased to a predetermined level, the steps S11-S14 are continuously executed to fine tune the iterative generation type pre-training model again.
In one embodiment, step S2 includes the steps of:
in step S21, after the execution of the to-be-selected SQL instance set in the simulation test environment, recording each to-be-selected SQL execution result and the corresponding to-be-optimized SQL execution result, judging whether the to-be-selected SQL execution result set is consistent with the corresponding to-be-optimized SQL execution result set, if the to-be-selected SQL execution result set contains the to-be-optimized SQL execution result set, the to-be-selected SQL execution result set is expressed as 1, and if not, the to-be-selected SQL execution result set is expressed as 0, wherein the execution duration of each SQL isTOccupy CPU core number asCThe occupied memory size isM
In step S22, score calculation is performed on the execution result of the selected SQL data set according to the SQL evaluation algorithm module to obtain an SQL relative advantage scoreFunction ofScore p Relative dominance score functionScore p The calculation formula of (2) is as follows:
Score p =consistency x%W c ×P C +W m ×P MP T
P C = C SQL to be selected / C SQL to be optimized
P M = M SQL to be selected / M SQL to be optimized
P T = T SQL to be selected / T SQL to be optimized
wherein ,W cW m respectively representing the weights of the total scores of the CPU occupation and the memory occupation, and calculating the relative advantage score of each SQL to be selected according to the adjusted scoring algorithmScore p RemovingScore p The SQL to be selected is 0, the remained SQL to be selected corresponding to the SQL to be optimized and the score thereof are reserved, the SQL instance relative score data set is obtained,P CP M andP T respectively represents the relative advantages of the CPU, the relative advantages of the memory and the duration,C SQL to be selected AndM SQL to be selected AndT SQL to be selected Respectively representing the occupied CPU core number, occupied memory size and execution duration of the SQL to be selected,C SQL to be optimizedM SQL to be optimized AndT SQL to be optimized The occupied CPU core number and occupied memory size of the SQL to be optimized are respectively represented.
In an embodiment, step S2 further includes the steps of:
in step S23, the to-be-optimized SQL and the corresponding to-be-selected SQL in the SQL instance relative scoring dataset are encoded by using the generating pre-training module ebadd, and then the matrices encoded by the two are spliced into an overall matrix according to columns, and the overall matrix is used as a depthIndependent variable sample input of neural network algorithm, and SQL relative advantage score to be selectedScoreAs a dependent variable sample input, the SQL relative dominance score is calculated due to the candidateScoreFor the continuity variable, the loss function score is set to beMSEMSEThe calculation method is as follows:
wherein , and />Respectively the firstiActual values of scores and model predictions for the pieces of sample data,nis the number of samples; and after training is completed, the model is stored as an SQL relative efficiency evaluation model to be called.
In one embodiment, step S3 includes the steps of:
in step S31, after the training iteration of the generated pre-training model and the SQL relative efficiency evaluation model is completed, the user develops a sentence of SQL to be optimized, and transmits the SQL to be optimized to the simulation test environment, if the SQL cannot be executed, the error reporting feedback is performed to the user, if the SQL cannot be executed, the SQL is written into the historical SQL instance library, and meanwhile, the SQL cleaning module is transmitted;
in step S32, the cleaned SQL is transmitted to a generated pre-training model, and one or more candidate SQLs are generated.
In an embodiment, step S3 further includes the steps of:
in step S33, the obtained candidate SQL and the corresponding to-be-optimized SQL are simultaneously transmitted to the SQL relative efficiency evaluation model to obtain a relative dominance scoreScore p
In step S34, if the relative dominance score isScore p If the SQL is smaller than 1, the SQL to be selected is indicated to have a lifting space in efficiency, then the SQL to be selected is executed on a simulation environment, and if the SQL to be selected can be executed, the SQL to be selected is indicated to be executed correctly and the SQL to be executed with relative efficiency is indicated to be executed through the SQL to be selectedIf the relative dominance score is larger than 1 or the SQL statement cannot be normally executed in the simulation environment, the higher SQL statement is considered to be generated into an incorrect SQL or a SQL with lower efficiency, the step S1 of generating the pre-training model and the step S3 of evaluating the SQL relative efficiency are required to be repeatedly executed, and the SQL generation and screening are continued until the screening result or the user side is forcedly stopped.
It should be noted that the selection of the neural network type includes, but is not limited to, deep learning neural network structures such as LSTM, CNN, RNN, etc. After training, the model is saved as an SQL relative efficiency evaluation model for calling. The encoding method is an open source encoding method, character strings with semantic relations are encoded into a two-dimensional matrix which can be used for mathematical computation based on a generated pre-training model, and SQL with logical relations is encoded into the two-dimensional matrix to facilitate subsequent computation.
Before cleaning irregular content and a plurality of sentences to form an SQL instance set to be optimized, firstly, a simulation test environment is required to be constructed, the requirements of the simulation test environment are the same as the database and the data table structure of the production environment, and only the difference of the cluster size exists. A number of SQLs that can be executed in a simulation test environment are stored in a historical SQL instance library. And then cleaning SQL in the instance library through an SQL cleaning module, specifically cleaning blank, annotating and other non-execution characters, deleting unsuitable line feed and the like to form the SQL to be optimized.
And collecting the instance data scored by the SQL algorithm scoring module into an SQL instance scoring data set, and then screening SQL with score greater than 0 as an iteration data set. And then, the SQL to be optimized and the corresponding SQL to be selected are sent to an SQL diversification module, library names, table names and field names in the SQL to be optimized and the SQL to be selected are analyzed and replaced, wherein the replaced names have no practical meaning and even can be random, but the replaced names of the SQL to be optimized and the SQL to be selected are consistent with the content, the replaced SQL and the iterative data set are combined into a diversified SQL instance scoring data set, the operations can be repeated for a plurality of times, so that the diversification of the SQL is enriched, and the tendency of the generated pre-training model to replace the library names, the table names and the field names is reduced. And finally, recording the grading data set of the diversified SQL instance into a generated pre-training model, performing model fine adjustment, and selecting the SQL to be optimized and the SQL to be selected to pair and record for supervised fine adjustment. And after the fine tuning iteration is completed, the generated pre-training model is saved for calling. When the historical SQL instance library is increased to a certain level, the steps S1-S4 are continuously executed, and the iterative generation type pre-training model is fine-tuned again.
The application is further illustrated by the following examples:
50 ten thousand SQL (structured query language) which can be successfully executed in a simulation test environment are extracted from a historical SQL instance library, the instance is in information protection, the used SQL table, library, column names and execution efficiency data are all desensitized examples, only a first SQL is used as a step description instance, and the SQL instance is shown in the following table 1:
TABLE 1
The method comprises the steps of sending SQL into an SQL cleaning module, cleaning irregular content and conditions of a plurality of sentences to form a standard SQL instance set, sending cleaned sentences into a generating type pre-training model, generating a plurality of alternatives for each sentence of SQL, and taking the SQL to be optimized as the alternatives to form the alternative SQL instance set, wherein the alternatives are shown in the following table 2:
TABLE 2
Executing the SQL to be selected in the simulation test environment respectively, and calculating absolute advantage scores according to the functions in Step1Score o Where CPU occupies weightW c And memory occupancy weightW m We calculated 0.5 each and the results are shown in table 3 below:
TABLE 3 Table 3
Screening the executable SQL with the score greater than 0 as a diversification template, diversifying the library names, the table names and the field names, merging the diversified executable SQL with the original data after diversification (Torons diversification can be carried out according to the data quantity), and obtaining the results shown in the following table 4:
TABLE 4 Table 4
And sending the diversified SQL to be optimized and the corresponding SQL to be selected into the generated pre-training model for supervised fine adjustment, so as to obtain the readiness of the fine-adjusted generated pre-training model.
Training the relative efficiency assessment model may begin with a set of selected SQL instances as shown in Table 5 below:
TABLE 5
Therein, whereinScore p The relative dominance score needs to be recalculated according to the calculation mode of Step2, firstly, the SQL to be selected with the consistency of the execution result of 0 is screened out, then the relative score is calculated, and the result after calculation is shown in the following table 6:
TABLE 6
Then the SQL to be optimized and the SQL to be selected are respectively encoded by a generating pre-training model emmbedding, the SQL to be optimized and the SQL to be selected are spliced into a whole by columns to be used as independent variables,Score p and inputting a deep learning neural network model as a dependent variable, and obtaining an SQL relative efficiency evaluation model for standby call after training is completed.
When a user submits any sentence of SQL, for example, 'select a.id, sum (drl _mill) drl _mill from msp_dteach. Dteach_ dmh _d_ ve a left join ods _tstudent. Tstudent_bdp_d_ve_xe b a.id=b.id where a.bus_brand_code=' 1 'and b.state_date > =' 20230501 'and b.state_date < =' 20230505 'and a.id=' DGMT38R68GH331100 'group by a.id', according to the application procedure, respectively, the simulation environment test, SQL cleansing is performed, and a generated pre-training model is used to generate the SQL to be selected, where the generated pre-training model is shown in table 7 below:
TABLE 7
And (3) passing the SQL to be optimized and the SQL to be selected through a generation type pre-training model enabling the SQL to be optimized to be subjected to emplacement, then calling an SQL relative efficiency evaluation model to score the SQL to be selected, screening out the SQL with the score being more than 0 and less than 1 as dominant SQL, respectively putting the SQL to be subjected to simulation test environment execution, returning the executable optimal-potential SQL, namely the SQL with the lowest score, to a user side, and if the SQL with the lowest score is not dominant, repeating the steps of generating, scoring and screening. Compared with the conventional regular or simple machine learning and deep learning reconstruction SQL scheme, the method provided by the application has the advantages of greatly innovating the aspects of applicability, accuracy and specialty, completely providing an automatic SQL optimization tool based on a generated pre-training model, improving SQL execution efficiency and reducing SQL development thresholds. Meanwhile, the execution efficiency of SQL is improved by 0-3000 times, and the average efficiency is improved by more than 30%; the efficiency of developers is improved by more than 20%, the original SQL with high writing efficiency is changed into the SQL which only needs to be written to execute the expected result, and the optimization of the SQL is submitted to the tool of the application. If one company needs 100 servers as calculation engines and 60 developers to synthesize the two calculation, a lot of expenses can be saved in one year, and the cost is greatly reduced.
The automatic SQL optimization method based on the generated pre-training model has the advantages that:
1. the more generalized SQL generating capability is that a large amount of network public data set codes are generated based on the SQL of the generating pre-training model for training, compared with the prior art, SQL grammar tree analysis is not needed, SQL sentences with similar execution results and rich contents can be generated without analyzing SQL, and meanwhile, the directional fine adjustment can be performed based on own data, and the method is not limited by the type of a database or a calculation engine.
2. The application constructs a more comprehensive evaluation function, which covers the factors of the SQL merits such as the consistency of results, the occupation of CPU, the occupation of memory, the execution time length and the like, and carries out comprehensive evaluation scoring on the SQL, and the traditional scheme only takes the execution time length as the factor of the SQL evaluation, which is incomplete or even wrong; besides, the application also provides a relative performance evaluation scheme so as to facilitate SQL optimization screening.
3. The method has the advantages that the relative scores of the SQL to be optimized and the SQL to be selected are predicted by constructing the deep neural network, the SQL to be selected is screened, the relation between the SQL to be optimized and the SQL to be selected in the resource occupation is more accurately and effectively reflected, the waste of resources in execution is reduced, the reconstructed SQL performance cannot be estimated by the existing scheme, the SQL performance is required to be directly executed in a simulation test environment or a production environment, and therefore a large amount of resources are wasted and the risk of crushing a server cluster exists.
4. If the table structure or the query content of the same SQL sentence is not changed, the execution efficiency of the SQL sentence in the same environment is similar, and the application provides the capability of effectively regulating the generated pre-training model by replacing the SQL diversity of library, table and column names based on the SQL.
For the foregoing method embodiments, for simplicity of explanation, the methodologies are shown as a series of acts, but one of ordinary skill in the art will appreciate that the present application is not limited by the order of acts, as some steps may, in accordance with the present application, occur in other orders or concurrently. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required for the present application.
From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present application.
Therefore, the above description is only a preferred embodiment of the present application, and the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered by the scope of the present application, which is defined by the claims.

Claims (6)

1. An automated SQL optimization method based on a generated pre-training model is characterized by comprising the following steps:
s1, storing a large number of SQL which can be executed in a simulation test environment into a historical SQL instance library to form an SQL instance set to be optimized, transmitting the SQL instance set to be optimized to a generated pre-training model in batches to generate a plurality of SQL instance sets to be selected, and combining the generated SQL instance set to be optimized and the SQL instance set to be optimized into an SQL data set to be selected;
s2, constructing an SQL evaluation model through a deep neural network algorithm, performing prediction evaluation on a to-be-selected SQL data set generated based on the generated pre-training model according to the SQL evaluation model, and screening the to-be-optimized SQL;
the step S2 includes the steps of:
s21, after the simulation test environment is executed by the to-be-selected SQL instance set, recording each to-be-selected SQL execution result and the corresponding to-be-optimized SQL execution result, judging whether the to-be-selected SQL execution result set is consistent with the corresponding to-be-optimized SQL execution result set, if the to-be-selected SQL execution result set contains the to-be-optimized SQL execution result set, the to-be-selected SQL execution result set is expressed as 1, and if the to-be-selected SQL execution result set does not contain the to-be-optimized SQL execution result set, the to-be-selected SQL execution result set is expressed as 0, wherein the execution duration of each SQL isTOccupy CPU core number asCThe occupied memory size isM
S22, performing score calculation on the execution result of the SQL data set to be selected according to the SQL evaluation algorithm module to obtain an SQL relative advantage score functionScore p The relative dominance score functionScore p The calculation formula of (2) is as follows:
Score p =consistency x%W c ×P C +W m ×P MP T
P C = C SQL to be selected / C SQL to be optimized
P M = M SQL to be selected / M SQL to be optimized
P T = T SQL to be selected / T SQL to be optimized
Wherein the saidW cW m Respectively representing the weights of the total scores of the CPU occupation and the memory occupation, and calculating the relative advantage score of each SQL to be selected according to the adjusted scoring algorithmScore p RemovingScore p The SQL to be selected is 0, the remained SQL to be selected corresponding to the SQL to be optimized and the score thereof are reserved, and an SQL instance relative score data set is obtained, wherein the SQL instance relative score data set is obtained by the methodP CP M AndP T respectively representing the relative advantages of CPU, memory and duration, whereinC SQL to be selected AndM SQL to be selected AndT SQL to be selected Respectively representing the occupied CPU core number, occupied memory size and execution duration of the SQL to be selected, wherein the CPU core number, the occupied memory size and the execution duration are respectivelyC SQL to be optimizedM SQL to be optimized AndT SQL to be optimized Respectively representing the number of occupied CPU cores and the occupied memory size of SQL to be optimized;
the step S2 further includes the steps of:
s23, the SQL to be optimized and the corresponding SQL to be selected in the SQL instance relative scoring data set are encoded by using a generating pre-training module, matrixes encoded by the SQL instance relative scoring data set and the SQL to be selected are spliced into an integral matrix according to columns, the integral matrix is used as an independent variable sample input of a deep neural network algorithm, and the SQL to be selected relative advantage score is obtainedScoreAs a dependent variable sample input, the SQL relative dominance score is calculated due to the candidateScoreFor the continuity variable, the loss function score is set to beMSEThe saidMSEThe calculation method is as follows:
wherein , and />Respectively the firstiActual values of scores and model predictions for the pieces of sample data,nis the number of samples; after training is completed, the model is stored as an SQL relative efficiency evaluation model to be called;
s3, constructing calling logic of the user terminal based on the generated pre-training model, the SQL evaluation model and the simulation test environment to perform automatic SQL optimization.
2. The automated SQL optimization method based on the generated pre-training model according to claim 1, wherein the step S1 comprises the steps of:
s11, SQL is transmitted to an SQL cleaning module, irregular content and multiple sentences are cleaned, and an SQL instance set to be optimized is formed;
s12, placing the SQL data set to be selected into a simulation test environment for execution, recording each SQL execution result to be selected and the corresponding SQL execution result to be optimized, judging whether the SQL execution result set to be selected is consistent with the SQL execution result set to be optimized or not, and if the SQL execution result set to be selected contains the SQL execution to be optimizedThe result set of (1) is represented as 1, and the result set of (0) is represented as not including, wherein the execution duration of each SQL isTOccupy CPU core number asCThe occupied memory size isM
3. The automated SQL optimization method based on the generated pre-training model according to claim 2, wherein the step S1 further comprises the steps of:
s13, performing score calculation on the execution result of the SQL data set to be selected according to the SQL evaluation algorithm module to obtain an SQL scoring functionScore o The calculation formula of the SQL scoring function is as follows:
Score o consistency x (Wc x C + Wm x M) x T, wherein,Score o for the absolute dominance score of SQL Wc, wm represent the weights of the total score for CPU occupancy and memory occupancy, respectively, and wc+wm=1.
4. The automated SQL optimization method based on the generated pre-training model according to claim 3, wherein the step S1 further comprises the steps of:
s14, collecting the SQL data set to be selected after grading by the SQL algorithm grading module into an SQL instance grading data set, screening SQL with grading larger than 0 as an iterative data set, sending the SQL to be optimized and the corresponding SQL to be selected into an SQL diversification module, analyzing and replacing library names, table names and field names in the SQL to be optimized and the SQL to be selected, merging the replaced SQL and the iterative data set into a diversified SQL instance grading data set, and finally, recording the diversified SQL instance grading data set into a generated pre-training model for fine adjustment, and selecting paired SQL to be optimized and SQL to be selected for supervised fine adjustment; after finishing the fine tuning iteration, saving the generated pre-training model for calling; when the historical SQL instance library is increased to a predetermined level, the steps S11-S14 are continuously executed to fine tune the iterative generation type pre-training model again.
5. The automated SQL optimization method based on the generated pre-training model according to claim 4, wherein the step S3 comprises the steps of:
s31, after the training iteration of the generated pre-training model and the SQL relative efficiency evaluation model is completed, the user side develops a sentence of SQL to be optimized, transmits the SQL to be optimized to a simulation test environment, feeds back error reporting to the user side if the SQL cannot be executed, writes the SQL into a historical SQL instance library if the SQL cannot be executed, and simultaneously transmits the SQL to the SQL cleaning module;
s32, the cleaned SQL is transmitted to a generating type pre-training model to generate one or more pieces of SQL to be selected.
6. The automated SQL optimization method based on the generated pre-training model according to claim 5, wherein the step S3 further comprises the steps of:
s33, transmitting the obtained SQL to be selected and the SQL to be optimized corresponding to the SQL relative efficiency evaluation model simultaneously to obtain a relative advantage scoreScore p
S34, if the relative dominance scoreScore p If the relative dominance score is larger than 1 or the SQL cannot be normally executed in the simulation environment, the SQL is considered to be generated as an incorrect SQL or the SQL with lower efficiency, the step S1 of the generation type pre-training model and the step S3 of the SQL relative efficiency evaluation model are required to be repeatedly executed, and the SQL is continuously generated and screened until a screening result or the user side is forcedly stopped.
CN202310760270.1A 2023-06-27 2023-06-27 Automatic SQL optimization method based on generated pre-training model Active CN116501764B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310760270.1A CN116501764B (en) 2023-06-27 2023-06-27 Automatic SQL optimization method based on generated pre-training model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310760270.1A CN116501764B (en) 2023-06-27 2023-06-27 Automatic SQL optimization method based on generated pre-training model

Publications (2)

Publication Number Publication Date
CN116501764A CN116501764A (en) 2023-07-28
CN116501764B true CN116501764B (en) 2023-08-25

Family

ID=87320559

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310760270.1A Active CN116501764B (en) 2023-06-27 2023-06-27 Automatic SQL optimization method based on generated pre-training model

Country Status (1)

Country Link
CN (1) CN116501764B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104598524A (en) * 2014-12-23 2015-05-06 苏州博远容天信息科技有限公司 SQL and SERVER database cluster multiple-instance internal storage management and distribution method
CN113590647A (en) * 2021-07-29 2021-11-02 中国联合网络通信集团有限公司 SQL statement optimization method, device, equipment, storage medium and product

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9477731B2 (en) * 2013-10-01 2016-10-25 Cloudera, Inc. Background format optimization for enhanced SQL-like queries in Hadoop
US10810229B2 (en) * 2018-10-19 2020-10-20 Oracle International Corporation Database replication based on data access scores

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104598524A (en) * 2014-12-23 2015-05-06 苏州博远容天信息科技有限公司 SQL and SERVER database cluster multiple-instance internal storage management and distribution method
CN113590647A (en) * 2021-07-29 2021-11-02 中国联合网络通信集团有限公司 SQL statement optimization method, device, equipment, storage medium and product

Also Published As

Publication number Publication date
CN116501764A (en) 2023-07-28

Similar Documents

Publication Publication Date Title
CN110502361B (en) Fine granularity defect positioning method for bug report
KR102302609B1 (en) Neural Network Architecture Optimization
EP3846034A1 (en) Systems and methods for automated testing using artificial intelligence techniques
CN111368096A (en) Knowledge graph-based information analysis method, device, equipment and storage medium
CN106951231A (en) A kind of computer software development approach and device
US8255423B2 (en) Adaptive random trees integer non-linear programming
CN116522912B (en) Training method, device, medium and equipment for package design language model
CN113763031A (en) Commodity recommendation method and device, electronic equipment and storage medium
CN116501764B (en) Automatic SQL optimization method based on generated pre-training model
CN115830419A (en) Data-driven artificial intelligence technology evaluation system and method
CN114564968A (en) Intention recognition method, system, terminal and medium based on man-machine conversation
CN112181951B (en) Heterogeneous database data migration method, device and equipment
CN113448860A (en) Test case analysis method and device
CN113297206A (en) Multi-table connection processing method, device, equipment and machine-readable storage medium
CN116976294B (en) Method and system for realizing automatic filling of complex electronic forms
CN111160662A (en) Risk prediction method, electronic equipment and storage medium
CN114493379B (en) Enterprise evaluation model automatic generation method, device and system based on government affair data
CN117556264B (en) Training method and device for evaluation model and electronic equipment
US20240086768A1 (en) Learning device, inference device, non-transitory computer-readable medium, learning method, and inference method
Twigg et al. Predicting Formal Verification Resource Needs (Computation Time and Memory) through Machine Learning
CN117763099A (en) Interaction method and device of intelligent customer service system
CN117371950A (en) Robot flow automation method, device, all-in-one machine and storage medium
Koromyslova FEATURES SELECTION OF EVOLUTIONARY ALGORITHMS IN TEXT CLASSIFICATION PROBLEMS
CN117331535A (en) Optimized intelligent customer service quality inspection robot development method
CN115908006A (en) Financial product recommendation method, system, equipment and medium based on decision tree

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant