CN112667666A - SQL operation time prediction method and system based on N-gram - Google Patents

SQL operation time prediction method and system based on N-gram Download PDF

Info

Publication number
CN112667666A
CN112667666A CN202011643906.7A CN202011643906A CN112667666A CN 112667666 A CN112667666 A CN 112667666A CN 202011643906 A CN202011643906 A CN 202011643906A CN 112667666 A CN112667666 A CN 112667666A
Authority
CN
China
Prior art keywords
sql
corpus
gram
data set
prediction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011643906.7A
Other languages
Chinese (zh)
Inventor
李振
张刚
鲍东岳
宋璞
尹正
李千惠
张晨星
吕亚波
苑云飞
马圣楠
邸璇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Minsheng Science And Technology Co ltd
Original Assignee
Minsheng Science And Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Minsheng Science And Technology Co ltd filed Critical Minsheng Science And Technology Co ltd
Priority to CN202011643906.7A priority Critical patent/CN112667666A/en
Publication of CN112667666A publication Critical patent/CN112667666A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A SQL operation time prediction method and a system based on N-gram relate to the technical field of database, the method comprises the following steps: s1: acquiring an SQL statement and an execution plan and an operation time corresponding to the SQL statement to form a data set; s2: preprocessing a data set to obtain a sample set; s3: constructing a corpus by utilizing the sample set based on the N-gram; s4: extracting features from the corpus and reducing the dimension of the extracted features; s5: based on the characteristics after dimension reduction, the SQL operation time prediction is completed by utilizing a Bayesian prediction model. The invention can embed the SQL operation time prediction function on the basis of the related basic functions of database organization, storage and management data and SQL examination and provides reference and decision support for database administrators and database users.

Description

SQL operation time prediction method and system based on N-gram
Technical Field
The invention relates to the technical field of databases, in particular to a method and a system for predicting SQL operation time based on N-gram.
Background
With the rapid development and wide application of big data and artificial intelligence, the data volume generated by various industries presents an exponential growth situation, the big data has the characteristics of large data volume, high speed, multiple types, high value and authenticity, and the safety, stability, integrity, convenience and the like of data storage face higher requirements and challenges.
In the operation and maintenance process of the database, the operation of the whole system is directly influenced by the quality of the SQL execution performance. The SQL performance has hidden danger, which not only causes low self operation time, but also occupies a large amount of computing resources and influences the operation efficiency of the whole data system. The performance of the database is improved by 60% or even higher degree, and the operation of the application program on the database is finally expressed as SQL statement operation, and the operation without audit or perfect audit mechanism is a risk point after the code is on line. Manual review suffers from development and DBA human resource bottlenecks that are difficult to control in terms of standard landing and regulatory constraints. Even if a complete manual review mechanism is available, an effective alarm and optimization suggestion can be given to SQL, and due to the high iteration work of repeated review, optimization and re-review, a large amount of human resources are consumed, and the production project timeliness is also influenced.
Therefore, it is very important to realize standardization, tool, automation and intelligence of the SQL auditing process by the AI technology. The SQL intelligent auditing system can standardize SQL grammar and on-line flow, provide standardized auditing flow for database administrators and reduce the labor cost of DBA; and the DQL, DDL and DML can be ensured to accurately and efficiently operate, and the daily data management, maintenance and operation efficiency is improved.
Disclosure of Invention
In view of the above, the invention provides an SQL operation time prediction method and system based on an N-gram, and the system embeds an SQL operation efficiency prediction function on the basis of related basic functions of database organization, data storage and management and SQL examination, and provides reference and decision support for database administrators and database users.
The SQL operation efficiency prediction in the system comprehensively considers factors such as execution plans, binding variables, historical data operation efficiency, code habits, audit opinions, server and database configuration and the like by applying technologies such as machine learning, natural language processing and the like, predicts the SQL operation time and provides optimization opinions and suggestions.
In order to achieve the purpose, the invention adopts the following technical scheme:
according to a first aspect of the present invention, there is provided an SQL runtime prediction method based on N-gram, the method comprising the steps of:
s1: acquiring an SQL statement and an execution plan and an operation time corresponding to the SQL statement to form a data set;
s2: preprocessing a data set to obtain a sample set;
s3: constructing a corpus by utilizing the sample set based on the N-gram;
s4: extracting features from the corpus and reducing the dimension of the extracted features;
s5: based on the characteristics after dimension reduction, the SQL operation time prediction is completed by utilizing a Bayesian prediction model.
Further, the S2 specifically includes:
s21: deleting or filling missing fields in the data set, and removing outliers in the data set to obtain a complete data set;
s22: performing word segmentation processing on SQL sentences in the complete data set to obtain a primary sample set;
s23: and adding other fields influencing the SQL sentence running time in the primary sample set based on the SQL bottom running logic to complete the construction of the sample set.
Further, the S22 includes:
adding a space before and after the symbol in the SQL sentence, and performing word segmentation on the SQL sentence by taking the space as a separator.
Further, the S3 specifically includes:
s31: restoring tables and fields with aliases in the sample set;
s32: removing the duplicate of the reduced sample set to obtain a preliminary corpus;
s33: and expanding the corpus in the preliminary corpus based on the N-gram to complete the construction of the corpus.
Further, the step S33 of expanding the preliminary corpus based on the N-gram specifically includes:
performing word granularity level N-gram processing on the corpus: performing word granularity sliding with a window of 2 on SQL sentences in the preliminary corpus to form a plurality of byte fragment sequences, and adding the byte fragment sequences into the corpus;
performing character level N-gram processing on the corpus: and (3) carrying out character sliding based on _' on the SQL sentences in the preliminary corpus to form a plurality of character fragment sequences, and adding the character fragment sequences into the corpus.
Further, the S4 specifically includes:
s41: vectorizing the corpus by using a TF-IDF algorithm and a word2vec algorithm respectively to obtain primary characteristics;
s42: training the preliminary feature output textCNN model to obtain deep features;
s43: and carrying out dimensionality reduction operation by substituting the deep features into a variational self-encoder.
Further, the method can complete the optimization of the prediction result by updating the data set.
According to a second aspect of the present invention, there is provided an N-gram based SQL runtime prediction system, comprising:
the data acquisition module is used for acquiring SQL sentences and operation information in the system to form a data set;
the data processing module is used for preprocessing the data set to obtain a sample set;
a corpus construction module, which is used for constructing a corpus by utilizing the sample set based on the N-gram;
the characteristic extraction module is used for extracting characteristics from the corpus and reducing the dimension of the extracted characteristics;
and the efficiency prediction module is used for completing SQL operation time prediction by utilizing a Bayesian prediction model based on the characteristics after dimension reduction.
According to a third aspect of the present invention, a computer-readable storage medium is provided, having a computer program stored thereon, characterized in that the computer program, when being executed by a processor, is adapted to carry out the steps of the above-mentioned method.
According to a fourth aspect of the present invention, there is provided a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the above method when executing the program.
Compared with the prior art, the SQL runtime prediction method and the SQL runtime prediction system based on the N-gram have the following advantages:
on the premise of guaranteeing the specification of the database and the SQL sentences, the SQL performance is quantitatively analyzed, factors such as execution plans, binding variables, historical data operation efficiency, code habits, audit opinions, server and database configuration and the like are comprehensively considered, the SQL operation time is predicted, and reference and decision basis is provided for developers and database managers. Meanwhile, the invention provides a self-learning and tracking mechanism, and the model and related parameters are updated according to a certain period so as to ensure the timeliness and accuracy of the model.
Meanwhile, the invention also has the following innovation:
(1) on the basis of traditional expert experience-based auditing and evaluation, the SQL runtime prediction system constructed by the invention adopts technologies such as machine learning, TF-IDF and textCNN to predict the SQL runtime, comprehensively considers factors such as execution plans, binding variables, historical data operation efficiency, code habits, auditing opinions and server and database configuration, and brings the factors into a model training process as derivative features, thereby continuously improving the model performance.
(2) The invention applies the N-gram to the SQL examination processing, considers the particularity of the SQL language, solves the problem of inaccurate word segmentation of the traditional SQL, improves the similarity between similar SQL, reduces the similarity between different SQL, and identifies the conditions of different tables and columns in the same library and the same table by special processing of library names, table names and column names, thereby improving the accuracy of model prediction.
(3) According to the invention, a self-learning and tracking mechanism is introduced, and on one hand, training data can be updated regularly, so that the model and related parameters thereof are optimized. Secondly, the method supports developers to manually optimize the SQL, compares the SQL before and after optimization, and learns the optimization method, so that the SQL can be automatically optimized in the future.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate an embodiment of the invention and, together with the description, serve to explain the invention and not to limit the invention. In the drawings:
FIG. 1 is a flow chart of the SQL runtime prediction method of the present invention.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
The terms "first," "second," and the like in the description and in the claims of the present disclosure are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
A plurality, including two or more.
And/or, it should be understood that, for the term "and/or" as used in this disclosure, it is merely one type of association that describes an associated object, meaning that three types of relationships may exist. For example, a and/or B, may represent: a exists alone, A and B exist simultaneously, and B exists alone.
S1: and (4) preparing data. The data of the invention comes from stock data in the system, production test data and online data generated according to the requirements of different modules and business promotion, and also covers historical data of version iteration. And (4) sorting all SQL statements, historical running time, execution plans and other information to form a complete data set.
S2: and data preprocessing, namely exploring the data quality in the data set, and performing data cleaning, data integration, word segmentation, characteristic derivation and other processing to facilitate the subsequent application of the data in model construction.
S2.1: and (4) processing missing values and abnormal values, deleting fields with high missing rate according to different types and attributes of the fields, and filling missing values of other fields by mode, average value or 0. And adopting a culling method for outliers in the data. The SQL operation time is influenced by various conditions such as software and hardware, certain fluctuation exists, the pressing quantile is limited to avoid the interference of external factors on the model, and the actual longer sample data is removed.
S2.2: the SQL statement has the advantages of specificity and naturalness of the language, and can directly perform word segmentation by taking a blank space as a separator. Meanwhile, adding a delimiter to a special symbol, for example, replacing "(" with "(", "═" with "═ can improve the accuracy and consistency of word segmentation.
S2.3: the characteristic derivation analyzes key factors influencing the operation of the SQL statement according to the SQL bottom logic and uses the key factors as a characteristic adding model. The SQL execution plan can analyze the performance bottleneck of a query statement or a TABLE structure, and analyze the core elements of SQL such as execution sequence, main KEY, index and the like from the aspects of ID, SELECT _ TYPE, TABLE, TYPE, POSSIBLE _ KEYS, KEY _ LEN, REF, ROWS and Extra. In the embodiment, the SQL statement is used as basic data, and related underlying logic data such as execution plan and statistical information are also output and brought into the research category, so that the dimensionality of the data is increased, and the comprehensiveness of SQL examination and the accuracy of model prediction are improved.
S3: constructing language models
S3.1 constructing a primary corpus.
S3.1.1: and restoring the table with the set alias. Firstly, locating the alias of a table through a table name, wherein (a) the alias of a table is in two conditions; (b) table as alias (alias does not include left/join/inner/on/order /)/… …). After the alias is positioned, all the aliases in the SQL are replaced by the original table name through replacement, so that the interference caused by the table aliases is eliminated;
s3.1.2: and restoring the field with the set alias. The column alias (difficult to locate without an as) is located by the "as" + "alias" + "," combination, and then the alias in SQL is deleted.
S3.1.3: after all words are converged and the duplication is removed, a corpus is preliminarily constructed.
S3.2: updating corpus based on N-gram
N-Gram is an algorithm based on a statistical language model. The basic idea is to perform a sliding window operation with the size of N on the content in the text according to bytes, and form a byte fragment sequence with the length of N.
Each byte segment is called as a gram, the occurrence frequency of all the grams is counted, and filtering is performed according to a preset threshold value to form a key gram list, namely a vector feature space of the text, wherein each gram in the list is a feature vector dimension.
S3.2.1: word granularity level N-gram: and performing word granularity sliding of N-2 on the SQL after word segmentation based on the N-gram principle. The N-gram is an algorithm based on a language model, and the basic idea is to perform window sliding operation with the size of N on text content according to the sequence of subsections, and finally form a byte fragment sequence with the window of N. Examples are: s2.2 SQL carries out word segmentation by taking a blank as a separator, and the group/by originally is 2 words and now becomes a word group by. r360__ user _ info. "was replaced with a space in the original segmentation process for the user id, so r360_ user _ info/user id after the segmentation process is 2 words, now becoming a word group r360__ user _ info user id. As can be seen from the above, the meaningful phrases in SQL are basically composed of 2 basic words, so N is set to 2.
S3.2.2: character level N-gram: and performing character-level sliding based on the N-gram to the basic participle. Namely, the basic word is divided by using the _ "as a separator. The SQL contains basic words of '_', and only table names, column names and alias names are possible. S2.3, all aliases are restored, so that the basic words including '_' in SQL are only possible to be table names and column names. This process may improve the degree of association between tables, e.g., r360__ user _ info and r360__ bank _ detail are originally two words that are unrelated at all, now r360__ user _ info is decomposed into r360/user/info and r360__ bank _ detail is decomposed into r 360/bank/detail. They have the common word r360 and indeed they are all data generated based on the r360 database.
S3.2.3: and rearranging the data processed by the N-gram and updating the corpus.
S4: feature scaling and feature encoding.
In order to extract the characteristics of the SQL statement, the invention adopts TF-IDF, word2vec and textCNN to extract the characteristics from different angles so as to obtain more comprehensive and deeper strong correlation characteristics.
S4.1: TF-IDF (Term Frequency-Inverse Document Frequency) is a commonly used weighting technique for information retrieval and data mining, where TF is Term Frequency (Term Frequency) and IDF is Inverse text Frequency index (Inverse Document Frequency). TF-IDF is used to assess the importance of a word to one of a set of documents or a corpus, the importance of a word increasing in direct proportion to the number of times it appears in the document, but decreasing in inverse proportion to the frequency with which it appears in the corpus. In this case, TF-IDF calculation is performed on the SQL statement and the execution plan, respectively, so as to be vectorized, thereby obtaining a feature vector.
TF-IDF=TF*IDF
Figure BDA0002877011790000061
Namely:
Figure BDA0002877011790000062
Figure BDA0002877011790000063
namely:
Figure BDA0002877011790000064
s4.2: word2vec is a technology for converting words into fixed dimension vectors, is very suitable for processing data with strong association between sequence parts, and is very suitable for processing the condition that the strong logical relationship exists between keywords of SQL statements, so that more comprehensive characteristic vectors can be obtained by using the word2 vec. The following gives the processing method of word2 vec:
1) converting the keywords of one in SQL into one-hot format according to the corpus in S3.2.3, and assuming that the vocabulary size in the corpus is V, then converting the keywords into one-hot formatThe vector is represented as x1,x2,…,xv. In which there is only one node xkIs not 0.
2) The weight between the input layer and the hidden layer may be represented by a matrix W of V x W
Figure BDA0002877011790000071
Each row in W represents an N-dimensional vector representation of a word associated with the input layer, V _ W. Since the input data is in one-hot format, the h-level output is completely calculated from the k-th row in the matrix W.
Figure BDA0002877011790000072
3) From hidden layer to output layer, there is a new V N weight matrix W ═ W'ijThe score u of each word in the vocabulary may be calculatedj
Figure BDA0002877011790000073
We use a log-linear classification model soft-max satisfying a polynomial distribution to obtain the posterior probability of words
Figure BDA0002877011790000074
Wherein, yjIs the output of node J in the output layer, and the hidden layer output and the word score are substituted into the above formula to obtain the complete formula
Figure BDA0002877011790000075
4) Defining the loss function E ═ P (w)0|wI) And updating the weight formula according to a gradient descent method.
S4.3: compared with other deep learning models in text processing, the textCNN has the advantages of simple network structure, high training speed and high precision. The algorithm flow is described in detail below.
1) And (4) embedding the layer. Meanwhile, the feature vectors output by the TF-IDF and word2vec are used as vector representation of the text, the length n of the embedding layer is defined according to the statistical result of the SQL length, and the matrix representation of the embedding layer is M.
2) And (5) performing convolution pooling. The relevance between adjacent keywords in SQL is high, and meanwhile, some filtering conditions have great influence on the execution performance of SQL, so the invention considers the use of a plurality of convolution kernels { d }1,d2,d3And extracting the relation of keywords at different positions. Since convolution kernels of different sizes yield different feature sizes, a pooling function is used for each feature map to make them the same dimension, and the 1-max boosting method is used in the present invention.
The key point of the part is to extract the associated information between the keywords in the SQL statement, so that the output of the last layer of pooling layer is output as the characteristic of the SQL statement after the error is minimized.
S4.4: the invention adopts the self-encoder to compress and reduce the dimension of the sparse matrix, and extracts effective key information for model construction. The self-encoder projects input data to a hidden space so as to achieve the purpose of dimension reduction, and meanwhile, the purpose of collinearity removal can be achieved, and the accuracy of the model is improved. The Variational auto-encoder (VAE) is widely applied to the field of image recognition as an important generation model, and the VAE has excellent performance in terms of convenience in processing text data. VAE is a probabilistic automatic encoder that is fast in sampling and easy to train, and the encoder generates mean codes μ and standard deviation σ, randomly extracts actual codes from the gaussian distribution of μ and σ, then decodes the sampled codes, and finally outputs a training example.
S5: and (5) constructing an SQL operation efficiency prediction model.
S5.1 construction of Bayesian regression TreeThe model, the construction of Bayesian Regression Tree, employs a top-down recursive approach based on the principle of divide and conquer. Assume that a Bayesian network node contains an attribute of { X }1,X2,……,XnY, Y are continuous random variables, and X isiThe conditions are independent of Y, i.e.:
Figure BDA0002877011790000081
combining Bayes theorem:
Figure BDA0002877011790000082
the following can be obtained:
Figure BDA0002877011790000083
where α ═ 1/p (x), is the regularization constant. The Bayesian regression outputs the regression value with the maximum conditional probability density as the target value, as follows:
Figure BDA0002877011790000084
s5.2 error measurement and model optimization, feature XiThere are k values for regression analysis, where XiThe degree of error due to the standard deviation of (a) is:
Figure BDA0002877011790000085
wherein, TjIs shown when Xi=xijSubset of samples in time, | T | represents the amount of samples, sd (T)j) Is shown when Xi=xijStandard deviation of time. In order to reduce the influence of noise points on the model, an estimated value based on probability density is adopted to replace a mean value, and is used for calculating a standard deviation,
Figure BDA0002877011790000091
s6: self-learning and tracking mechanisms.
S6.1: the machine learning model is usually applied in an offline training and online prediction mode, namely offline data offline training is utilized, the model and parameters are stored, then the model and the parameters are packaged into an interface, and the model is deployed online for other programs to call; in this way, because the model parameters are fixed, the model parameters cannot be updated for a long time, and the accuracy of the original model is gradually reduced along with the increase of the database table and the data volume; even causing auditing errors and seriously affecting the system operation. In order to solve the defects and strengthen the monitoring and tracking of the online model, the invention introduces a self-learning and tracking mechanism.
S6.2: the self-learning is to ensure the accuracy of model prediction and to update model parameters at regular time; because the data volume in the database system is relatively slowly increased, the timeliness requirement of SQL audit data is not high, and the difficulty degree of system realization is comprehensively considered, the invention adopts a triggered model updating mode to trigger model updating in the following two modes:
s6.2.1: and carrying out statistics on the SQL passed and rejected by the auditing platform, the corresponding optimization condition and the SQL execution efficiency after online, triggering model training when the SQL execution efficiency of online is lower than a certain threshold, and training the model and updating parameters by using all data before the current time point.
S6.2.2: because the statistical result reflects the average SQL condition in a week, when the statistical result reaches the threshold value, the situation that a large difference exists between the predicted value and the part of SQL after being on line exists, another trigger condition is designed from the time dimension, and the model is set to be updated by using all real data every 2 months.
S6.2: and tracking the execution efficiency of the checked and on-line SQL by a tracking mechanism, comparing the execution efficiency with a predicted value, adding the SQL and the execution plan thereof into a manual check library if the difference with the predicted value is large, re-on-line after the DBA is checked and optimized, recording the optimized SQL execution efficiency, and recording the SQL before and after optimization into the manual optimization library.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the above implementation method can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation method. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.
While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (10)

1. An SQL runtime prediction method based on N-gram is characterized by comprising the following steps:
s1: acquiring an SQL statement and an execution plan and an operation time corresponding to the SQL statement to form a data set;
s2: preprocessing a data set to obtain a sample set;
s3: constructing a corpus by utilizing the sample set based on the N-gram;
s4: extracting features from the corpus and reducing the dimension of the extracted features;
s5: based on the characteristics after dimension reduction, the SQL operation time prediction is completed by utilizing a Bayesian prediction model.
2. The method according to claim 1, wherein the S2 specifically includes:
s21: deleting or filling missing fields in the data set, and removing outliers in the data set to obtain a complete data set;
s22: performing word segmentation processing on SQL sentences in the complete data set to obtain a primary sample set;
s23: and adding other fields influencing the SQL sentence running time in the primary sample set based on the SQL bottom running logic to complete the construction of the sample set.
3. The N-gram-based SQL runtime prediction method according to claim 2, wherein the S22 includes:
adding a space before and after the symbol in the SQL sentence, and performing word segmentation on the SQL sentence by taking the space as a separator.
4. The method according to claim 1, wherein the S3 specifically includes:
s31: restoring tables and fields with aliases in the sample set;
s32: removing the duplicate of the reduced sample set to obtain a preliminary corpus;
s33: and expanding the corpus in the preliminary corpus based on the N-gram to complete the construction of the corpus.
5. The method according to claim 4, wherein the S33 expanding the preliminary corpus based on the N-gram specifically includes:
performing word granularity level N-gram processing on the corpus: performing word granularity sliding with a window of 2 on SQL sentences in the preliminary corpus to form a plurality of byte fragment sequences, and adding the byte fragment sequences into the corpus;
performing character level N-gram processing on the corpus: and (3) carrying out character sliding based on _' on the SQL sentences in the preliminary corpus to form a plurality of character fragment sequences, and adding the character fragment sequences into the corpus.
6. The method according to claim 1, wherein the S4 specifically includes:
s41: vectorizing the corpus by using a TF-IDF algorithm and a word2vec algorithm respectively to obtain primary characteristics;
s42: training the preliminary feature output textCNN model to obtain deep features;
s43: and carrying out dimensionality reduction operation by substituting the deep features into a variational self-encoder.
7. The method of claim 1, wherein the method can optimize the prediction result by updating the data set.
8. An N-gram based SQL runtime prediction system, comprising:
the data acquisition module is used for acquiring SQL sentences and operation information in the system to form a data set;
the data processing module is used for preprocessing the data set to obtain a sample set;
a corpus construction module, which is used for constructing a corpus by utilizing the sample set based on the N-gram;
the characteristic extraction module is used for extracting characteristics from the corpus and reducing the dimension of the extracted characteristics;
and the efficiency prediction module is used for completing SQL operation time prediction by utilizing a Bayesian prediction model based on the characteristics after dimension reduction.
9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.
10. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the method according to any of claims 1 to 7 are carried out when the program is executed by the processor.
CN202011643906.7A 2020-12-31 2020-12-31 SQL operation time prediction method and system based on N-gram Pending CN112667666A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011643906.7A CN112667666A (en) 2020-12-31 2020-12-31 SQL operation time prediction method and system based on N-gram

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011643906.7A CN112667666A (en) 2020-12-31 2020-12-31 SQL operation time prediction method and system based on N-gram

Publications (1)

Publication Number Publication Date
CN112667666A true CN112667666A (en) 2021-04-16

Family

ID=75412331

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011643906.7A Pending CN112667666A (en) 2020-12-31 2020-12-31 SQL operation time prediction method and system based on N-gram

Country Status (1)

Country Link
CN (1) CN112667666A (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108052394A (en) * 2017-12-27 2018-05-18 福建星瑞格软件有限公司 The method and computer equipment of resource allocation based on SQL statement run time
CN108875366A (en) * 2018-05-23 2018-11-23 四川大学 A kind of SQL injection behavioral value system towards PHP program
CN109299248A (en) * 2018-12-12 2019-02-01 成都航天科工大数据研究院有限公司 A kind of business intelligence collection method based on natural language processing
CN109508441A (en) * 2018-08-21 2019-03-22 江苏赛睿信息科技股份有限公司 Data analysing method, device and electronic equipment
CN109902159A (en) * 2019-01-29 2019-06-18 华融融通(北京)科技有限公司 A kind of intelligent O&M statement similarity matching process based on natural language processing
CN111400338A (en) * 2020-03-04 2020-07-10 平安医疗健康管理股份有限公司 SQ L optimization method, device, storage medium and computer equipment
CN111600919A (en) * 2019-02-21 2020-08-28 北京金睛云华科技有限公司 Web detection method and device based on artificial intelligence

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108052394A (en) * 2017-12-27 2018-05-18 福建星瑞格软件有限公司 The method and computer equipment of resource allocation based on SQL statement run time
CN108875366A (en) * 2018-05-23 2018-11-23 四川大学 A kind of SQL injection behavioral value system towards PHP program
CN109508441A (en) * 2018-08-21 2019-03-22 江苏赛睿信息科技股份有限公司 Data analysing method, device and electronic equipment
CN109299248A (en) * 2018-12-12 2019-02-01 成都航天科工大数据研究院有限公司 A kind of business intelligence collection method based on natural language processing
CN109902159A (en) * 2019-01-29 2019-06-18 华融融通(北京)科技有限公司 A kind of intelligent O&M statement similarity matching process based on natural language processing
CN111600919A (en) * 2019-02-21 2020-08-28 北京金睛云华科技有限公司 Web detection method and device based on artificial intelligence
CN111400338A (en) * 2020-03-04 2020-07-10 平安医疗健康管理股份有限公司 SQ L optimization method, device, storage medium and computer equipment

Similar Documents

Publication Publication Date Title
CN113011533B (en) Text classification method, apparatus, computer device and storage medium
US11170179B2 (en) Systems and methods for natural language processing of structured documents
US9606984B2 (en) Unsupervised clustering of dialogs extracted from released application logs
CN112307741B (en) Insurance industry document intelligent analysis method and device
CN111860981B (en) Enterprise national industry category prediction method and system based on LSTM deep learning
CN110928981A (en) Method, system and storage medium for establishing and perfecting iteration of text label system
CN111274817A (en) Intelligent software cost measurement method based on natural language processing technology
CN109522396B (en) Knowledge processing method and system for national defense science and technology field
US20220358379A1 (en) System, apparatus and method of managing knowledge generated from technical data
CN111985228A (en) Text keyword extraction method and device, computer equipment and storage medium
CN115827819A (en) Intelligent question and answer processing method and device, electronic equipment and storage medium
CN114818643A (en) Log template extraction method for reserving specific service information
CN113886562A (en) AI resume screening method, system, equipment and storage medium
CN115146062A (en) Intelligent event analysis method and system fusing expert recommendation and text clustering
CN114676346A (en) News event processing method and device, computer equipment and storage medium
CN114722198A (en) Method, system and related device for determining product classification code
CN114282513A (en) Text semantic similarity matching method and system, intelligent terminal and storage medium
CN117271558A (en) Language query model construction method, query language acquisition method and related devices
CN117271701A (en) Method and system for extracting system operation abnormal event relation based on TGGAT and CNN
CN111104422A (en) Training method, device, equipment and storage medium of data recommendation model
CN116578700A (en) Log classification method, log classification device, equipment and medium
CN115934936A (en) Intelligent traffic text analysis method based on natural language processing
Zhai et al. TRIZ technical contradiction extraction method based on patent semantic space mapping
CN110941713A (en) Self-optimization financial information plate classification method based on topic model
CN115129890A (en) Feedback data map generation method and generation device, question answering device and refrigerator

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination