CN112667666A

CN112667666A - SQL operation time prediction method and system based on N-gram

Info

Publication number: CN112667666A
Application number: CN202011643906.7A
Authority: CN
Inventors: 李振; 张刚; 鲍东岳; 宋璞; 尹正; 李千惠; 张晨星; 吕亚波; 苑云飞; 马圣楠; 邸璇
Original assignee: Minsheng Science And Technology Co ltd
Current assignee: Minsheng Science And Technology Co ltd
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2021-04-16

Abstract

A SQL operation time prediction method and a system based on N-gram relate to the technical field of database, the method comprises the following steps: s1: acquiring an SQL statement and an execution plan and an operation time corresponding to the SQL statement to form a data set; s2: preprocessing a data set to obtain a sample set; s3: constructing a corpus by utilizing the sample set based on the N-gram; s4: extracting features from the corpus and reducing the dimension of the extracted features; s5: based on the characteristics after dimension reduction, the SQL operation time prediction is completed by utilizing a Bayesian prediction model. The invention can embed the SQL operation time prediction function on the basis of the related basic functions of database organization, storage and management data and SQL examination and provides reference and decision support for database administrators and database users.

Description

SQL operation time prediction method and system based on N-gram

Technical Field

The invention relates to the technical field of databases, in particular to a method and a system for predicting SQL operation time based on N-gram.

Background

With the rapid development and wide application of big data and artificial intelligence, the data volume generated by various industries presents an exponential growth situation, the big data has the characteristics of large data volume, high speed, multiple types, high value and authenticity, and the safety, stability, integrity, convenience and the like of data storage face higher requirements and challenges.

In the operation and maintenance process of the database, the operation of the whole system is directly influenced by the quality of the SQL execution performance. The SQL performance has hidden danger, which not only causes low self operation time, but also occupies a large amount of computing resources and influences the operation efficiency of the whole data system. The performance of the database is improved by 60% or even higher degree, and the operation of the application program on the database is finally expressed as SQL statement operation, and the operation without audit or perfect audit mechanism is a risk point after the code is on line. Manual review suffers from development and DBA human resource bottlenecks that are difficult to control in terms of standard landing and regulatory constraints. Even if a complete manual review mechanism is available, an effective alarm and optimization suggestion can be given to SQL, and due to the high iteration work of repeated review, optimization and re-review, a large amount of human resources are consumed, and the production project timeliness is also influenced.

Therefore, it is very important to realize standardization, tool, automation and intelligence of the SQL auditing process by the AI technology. The SQL intelligent auditing system can standardize SQL grammar and on-line flow, provide standardized auditing flow for database administrators and reduce the labor cost of DBA; and the DQL, DDL and DML can be ensured to accurately and efficiently operate, and the daily data management, maintenance and operation efficiency is improved.

Disclosure of Invention

In view of the above, the invention provides an SQL operation time prediction method and system based on an N-gram, and the system embeds an SQL operation efficiency prediction function on the basis of related basic functions of database organization, data storage and management and SQL examination, and provides reference and decision support for database administrators and database users.

The SQL operation efficiency prediction in the system comprehensively considers factors such as execution plans, binding variables, historical data operation efficiency, code habits, audit opinions, server and database configuration and the like by applying technologies such as machine learning, natural language processing and the like, predicts the SQL operation time and provides optimization opinions and suggestions.

In order to achieve the purpose, the invention adopts the following technical scheme:

according to a first aspect of the present invention, there is provided an SQL runtime prediction method based on N-gram, the method comprising the steps of:

s1: acquiring an SQL statement and an execution plan and an operation time corresponding to the SQL statement to form a data set;

s2: preprocessing a data set to obtain a sample set;

s3: constructing a corpus by utilizing the sample set based on the N-gram;

s4: extracting features from the corpus and reducing the dimension of the extracted features;

s5: based on the characteristics after dimension reduction, the SQL operation time prediction is completed by utilizing a Bayesian prediction model.

Further, the S2 specifically includes:

s21: deleting or filling missing fields in the data set, and removing outliers in the data set to obtain a complete data set;

s22: performing word segmentation processing on SQL sentences in the complete data set to obtain a primary sample set;

s23: and adding other fields influencing the SQL sentence running time in the primary sample set based on the SQL bottom running logic to complete the construction of the sample set.

Further, the S22 includes:

adding a space before and after the symbol in the SQL sentence, and performing word segmentation on the SQL sentence by taking the space as a separator.

Further, the S3 specifically includes:

s31: restoring tables and fields with aliases in the sample set;

s32: removing the duplicate of the reduced sample set to obtain a preliminary corpus;

s33: and expanding the corpus in the preliminary corpus based on the N-gram to complete the construction of the corpus.

Further, the step S33 of expanding the preliminary corpus based on the N-gram specifically includes:

performing word granularity level N-gram processing on the corpus: performing word granularity sliding with a window of 2 on SQL sentences in the preliminary corpus to form a plurality of byte fragment sequences, and adding the byte fragment sequences into the corpus;

performing character level N-gram processing on the corpus: and (3) carrying out character sliding based on _' on the SQL sentences in the preliminary corpus to form a plurality of character fragment sequences, and adding the character fragment sequences into the corpus.

Further, the S4 specifically includes:

s41: vectorizing the corpus by using a TF-IDF algorithm and a word2vec algorithm respectively to obtain primary characteristics;

s42: training the preliminary feature output textCNN model to obtain deep features;

s43: and carrying out dimensionality reduction operation by substituting the deep features into a variational self-encoder.

Further, the method can complete the optimization of the prediction result by updating the data set.

According to a second aspect of the present invention, there is provided an N-gram based SQL runtime prediction system, comprising:

the data acquisition module is used for acquiring SQL sentences and operation information in the system to form a data set;

the data processing module is used for preprocessing the data set to obtain a sample set;

a corpus construction module, which is used for constructing a corpus by utilizing the sample set based on the N-gram;

the characteristic extraction module is used for extracting characteristics from the corpus and reducing the dimension of the extracted characteristics;

and the efficiency prediction module is used for completing SQL operation time prediction by utilizing a Bayesian prediction model based on the characteristics after dimension reduction.

According to a third aspect of the present invention, a computer-readable storage medium is provided, having a computer program stored thereon, characterized in that the computer program, when being executed by a processor, is adapted to carry out the steps of the above-mentioned method.

According to a fourth aspect of the present invention, there is provided a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the above method when executing the program.

Compared with the prior art, the SQL runtime prediction method and the SQL runtime prediction system based on the N-gram have the following advantages:

on the premise of guaranteeing the specification of the database and the SQL sentences, the SQL performance is quantitatively analyzed, factors such as execution plans, binding variables, historical data operation efficiency, code habits, audit opinions, server and database configuration and the like are comprehensively considered, the SQL operation time is predicted, and reference and decision basis is provided for developers and database managers. Meanwhile, the invention provides a self-learning and tracking mechanism, and the model and related parameters are updated according to a certain period so as to ensure the timeliness and accuracy of the model.

Meanwhile, the invention also has the following innovation:

(1) on the basis of traditional expert experience-based auditing and evaluation, the SQL runtime prediction system constructed by the invention adopts technologies such as machine learning, TF-IDF and textCNN to predict the SQL runtime, comprehensively considers factors such as execution plans, binding variables, historical data operation efficiency, code habits, auditing opinions and server and database configuration, and brings the factors into a model training process as derivative features, thereby continuously improving the model performance.

(2) The invention applies the N-gram to the SQL examination processing, considers the particularity of the SQL language, solves the problem of inaccurate word segmentation of the traditional SQL, improves the similarity between similar SQL, reduces the similarity between different SQL, and identifies the conditions of different tables and columns in the same library and the same table by special processing of library names, table names and column names, thereby improving the accuracy of model prediction.

(3) According to the invention, a self-learning and tracking mechanism is introduced, and on one hand, training data can be updated regularly, so that the model and related parameters thereof are optimized. Secondly, the method supports developers to manually optimize the SQL, compares the SQL before and after optimization, and learns the optimization method, so that the SQL can be automatically optimized in the future.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate an embodiment of the invention and, together with the description, serve to explain the invention and not to limit the invention. In the drawings:

FIG. 1 is a flow chart of the SQL runtime prediction method of the present invention.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

The terms "first," "second," and the like in the description and in the claims of the present disclosure are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

A plurality, including two or more.

And/or, it should be understood that, for the term "and/or" as used in this disclosure, it is merely one type of association that describes an associated object, meaning that three types of relationships may exist. For example, a and/or B, may represent: a exists alone, A and B exist simultaneously, and B exists alone.

S1: and (4) preparing data. The data of the invention comes from stock data in the system, production test data and online data generated according to the requirements of different modules and business promotion, and also covers historical data of version iteration. And (4) sorting all SQL statements, historical running time, execution plans and other information to form a complete data set.

S2: and data preprocessing, namely exploring the data quality in the data set, and performing data cleaning, data integration, word segmentation, characteristic derivation and other processing to facilitate the subsequent application of the data in model construction.

S2.1: and (4) processing missing values and abnormal values, deleting fields with high missing rate according to different types and attributes of the fields, and filling missing values of other fields by mode, average value or 0. And adopting a culling method for outliers in the data. The SQL operation time is influenced by various conditions such as software and hardware, certain fluctuation exists, the pressing quantile is limited to avoid the interference of external factors on the model, and the actual longer sample data is removed.

S2.2: the SQL statement has the advantages of specificity and naturalness of the language, and can directly perform word segmentation by taking a blank space as a separator. Meanwhile, adding a delimiter to a special symbol, for example, replacing "(" with "(", "═" with "═ can improve the accuracy and consistency of word segmentation.

S2.3: the characteristic derivation analyzes key factors influencing the operation of the SQL statement according to the SQL bottom logic and uses the key factors as a characteristic adding model. The SQL execution plan can analyze the performance bottleneck of a query statement or a TABLE structure, and analyze the core elements of SQL such as execution sequence, main KEY, index and the like from the aspects of ID, SELECT _ TYPE, TABLE, TYPE, POSSIBLE _ KEYS, KEY _ LEN, REF, ROWS and Extra. In the embodiment, the SQL statement is used as basic data, and related underlying logic data such as execution plan and statistical information are also output and brought into the research category, so that the dimensionality of the data is increased, and the comprehensiveness of SQL examination and the accuracy of model prediction are improved.

S3: constructing language models

S3.1 constructing a primary corpus.

S3.1.1: and restoring the table with the set alias. Firstly, locating the alias of a table through a table name, wherein (a) the alias of a table is in two conditions; (b) table as alias (alias does not include left/join/inner/on/order /)/… …). After the alias is positioned, all the aliases in the SQL are replaced by the original table name through replacement, so that the interference caused by the table aliases is eliminated;

s3.1.2: and restoring the field with the set alias. The column alias (difficult to locate without an as) is located by the "as" + "alias" + "," combination, and then the alias in SQL is deleted.

S3.1.3: after all words are converged and the duplication is removed, a corpus is preliminarily constructed.

S3.2: updating corpus based on N-gram

N-Gram is an algorithm based on a statistical language model. The basic idea is to perform a sliding window operation with the size of N on the content in the text according to bytes, and form a byte fragment sequence with the length of N.

Each byte segment is called as a gram, the occurrence frequency of all the grams is counted, and filtering is performed according to a preset threshold value to form a key gram list, namely a vector feature space of the text, wherein each gram in the list is a feature vector dimension.

S3.2.1: word granularity level N-gram: and performing word granularity sliding of N-2 on the SQL after word segmentation based on the N-gram principle. The N-gram is an algorithm based on a language model, and the basic idea is to perform window sliding operation with the size of N on text content according to the sequence of subsections, and finally form a byte fragment sequence with the window of N. Examples are: s2.2 SQL carries out word segmentation by taking a blank as a separator, and the group/by originally is 2 words and now becomes a word group by. r360__ user _ info. "was replaced with a space in the original segmentation process for the user id, so r360_ user _ info/user id after the segmentation process is 2 words, now becoming a word group r360__ user _ info user id. As can be seen from the above, the meaningful phrases in SQL are basically composed of 2 basic words, so N is set to 2.

S3.2.2: character level N-gram: and performing character-level sliding based on the N-gram to the basic participle. Namely, the basic word is divided by using the _ "as a separator. The SQL contains basic words of '_', and only table names, column names and alias names are possible. S2.3, all aliases are restored, so that the basic words including '_' in SQL are only possible to be table names and column names. This process may improve the degree of association between tables, e.g., r360__ user _ info and r360__ bank _ detail are originally two words that are unrelated at all, now r360__ user _ info is decomposed into r360/user/info and r360__ bank _ detail is decomposed into r 360/bank/detail. They have the common word r360 and indeed they are all data generated based on the r360 database.

S3.2.3: and rearranging the data processed by the N-gram and updating the corpus.

S4: feature scaling and feature encoding.

In order to extract the characteristics of the SQL statement, the invention adopts TF-IDF, word2vec and textCNN to extract the characteristics from different angles so as to obtain more comprehensive and deeper strong correlation characteristics.

S4.1: TF-IDF (Term Frequency-Inverse Document Frequency) is a commonly used weighting technique for information retrieval and data mining, where TF is Term Frequency (Term Frequency) and IDF is Inverse text Frequency index (Inverse Document Frequency). TF-IDF is used to assess the importance of a word to one of a set of documents or a corpus, the importance of a word increasing in direct proportion to the number of times it appears in the document, but decreasing in inverse proportion to the frequency with which it appears in the corpus. In this case, TF-IDF calculation is performed on the SQL statement and the execution plan, respectively, so as to be vectorized, thereby obtaining a feature vector.

TF-IDF＝TF*IDF

Namely:

namely:

s4.2: word2vec is a technology for converting words into fixed dimension vectors, is very suitable for processing data with strong association between sequence parts, and is very suitable for processing the condition that the strong logical relationship exists between keywords of SQL statements, so that more comprehensive characteristic vectors can be obtained by using the word2 vec. The following gives the processing method of word2 vec:

1) converting the keywords of one in SQL into one-hot format according to the corpus in S3.2.3, and assuming that the vocabulary size in the corpus is V, then converting the keywords into one-hot formatThe vector is represented as x₁,x₂,…,x_v. In which there is only one node x_kIs not 0.

2) The weight between the input layer and the hidden layer may be represented by a matrix W of V x W

Each row in W represents an N-dimensional vector representation of a word associated with the input layer, V _ W. Since the input data is in one-hot format, the h-level output is completely calculated from the k-th row in the matrix W.

3) From hidden layer to output layer, there is a new V N weight matrix W ═ W'_ijThe score u of each word in the vocabulary may be calculated_j

We use a log-linear classification model soft-max satisfying a polynomial distribution to obtain the posterior probability of words

Wherein, y_jIs the output of node J in the output layer, and the hidden layer output and the word score are substituted into the above formula to obtain the complete formula

4) Defining the loss function E ═ P (w)₀|w_I) And updating the weight formula according to a gradient descent method.

S4.3: compared with other deep learning models in text processing, the textCNN has the advantages of simple network structure, high training speed and high precision. The algorithm flow is described in detail below.

1) And (4) embedding the layer. Meanwhile, the feature vectors output by the TF-IDF and word2vec are used as vector representation of the text, the length n of the embedding layer is defined according to the statistical result of the SQL length, and the matrix representation of the embedding layer is M.

2) And (5) performing convolution pooling. The relevance between adjacent keywords in SQL is high, and meanwhile, some filtering conditions have great influence on the execution performance of SQL, so the invention considers the use of a plurality of convolution kernels { d }₁,d₂,d₃And extracting the relation of keywords at different positions. Since convolution kernels of different sizes yield different feature sizes, a pooling function is used for each feature map to make them the same dimension, and the 1-max boosting method is used in the present invention.

The key point of the part is to extract the associated information between the keywords in the SQL statement, so that the output of the last layer of pooling layer is output as the characteristic of the SQL statement after the error is minimized.

S4.4: the invention adopts the self-encoder to compress and reduce the dimension of the sparse matrix, and extracts effective key information for model construction. The self-encoder projects input data to a hidden space so as to achieve the purpose of dimension reduction, and meanwhile, the purpose of collinearity removal can be achieved, and the accuracy of the model is improved. The Variational auto-encoder (VAE) is widely applied to the field of image recognition as an important generation model, and the VAE has excellent performance in terms of convenience in processing text data. VAE is a probabilistic automatic encoder that is fast in sampling and easy to train, and the encoder generates mean codes μ and standard deviation σ, randomly extracts actual codes from the gaussian distribution of μ and σ, then decodes the sampled codes, and finally outputs a training example.

S5: and (5) constructing an SQL operation efficiency prediction model.

S5.1 construction of Bayesian regression TreeThe model, the construction of Bayesian Regression Tree, employs a top-down recursive approach based on the principle of divide and conquer. Assume that a Bayesian network node contains an attribute of { X }₁,X₂,……,X_nY, Y are continuous random variables, and X is_iThe conditions are independent of Y, i.e.:

combining Bayes theorem:

the following can be obtained:

where α ═ 1/p (x), is the regularization constant. The Bayesian regression outputs the regression value with the maximum conditional probability density as the target value, as follows:

s5.2 error measurement and model optimization, feature X_iThere are k values for regression analysis, where X_iThe degree of error due to the standard deviation of (a) is:

wherein, T_jIs shown when X_i＝x_ijSubset of samples in time, | T | represents the amount of samples, sd (T)_j) Is shown when X_i＝x_ijStandard deviation of time. In order to reduce the influence of noise points on the model, an estimated value based on probability density is adopted to replace a mean value, and is used for calculating a standard deviation,

s6: self-learning and tracking mechanisms.

S6.1: the machine learning model is usually applied in an offline training and online prediction mode, namely offline data offline training is utilized, the model and parameters are stored, then the model and the parameters are packaged into an interface, and the model is deployed online for other programs to call; in this way, because the model parameters are fixed, the model parameters cannot be updated for a long time, and the accuracy of the original model is gradually reduced along with the increase of the database table and the data volume; even causing auditing errors and seriously affecting the system operation. In order to solve the defects and strengthen the monitoring and tracking of the online model, the invention introduces a self-learning and tracking mechanism.

S6.2: the self-learning is to ensure the accuracy of model prediction and to update model parameters at regular time; because the data volume in the database system is relatively slowly increased, the timeliness requirement of SQL audit data is not high, and the difficulty degree of system realization is comprehensively considered, the invention adopts a triggered model updating mode to trigger model updating in the following two modes:

s6.2.1: and carrying out statistics on the SQL passed and rejected by the auditing platform, the corresponding optimization condition and the SQL execution efficiency after online, triggering model training when the SQL execution efficiency of online is lower than a certain threshold, and training the model and updating parameters by using all data before the current time point.

S6.2.2: because the statistical result reflects the average SQL condition in a week, when the statistical result reaches the threshold value, the situation that a large difference exists between the predicted value and the part of SQL after being on line exists, another trigger condition is designed from the time dimension, and the model is set to be updated by using all real data every 2 months.

S6.2: and tracking the execution efficiency of the checked and on-line SQL by a tracking mechanism, comparing the execution efficiency with a predicted value, adding the SQL and the execution plan thereof into a manual check library if the difference with the predicted value is large, re-on-line after the DBA is checked and optimized, recording the optimized SQL execution efficiency, and recording the SQL before and after optimization into the manual optimization library.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the above implementation method can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation method. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. An SQL runtime prediction method based on N-gram is characterized by comprising the following steps:

s2: preprocessing a data set to obtain a sample set;

s3: constructing a corpus by utilizing the sample set based on the N-gram;

2. The method according to claim 1, wherein the S2 specifically includes:

3. The N-gram-based SQL runtime prediction method according to claim 2, wherein the S22 includes:

4. The method according to claim 1, wherein the S3 specifically includes:

s31: restoring tables and fields with aliases in the sample set;

5. The method according to claim 4, wherein the S33 expanding the preliminary corpus based on the N-gram specifically includes:

6. The method according to claim 1, wherein the S4 specifically includes:

7. The method of claim 1, wherein the method can optimize the prediction result by updating the data set.

8. An N-gram based SQL runtime prediction system, comprising:

9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.

10. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the method according to any of claims 1 to 7 are carried out when the program is executed by the processor.