CN112784273A

CN112784273A - SQL risk identification method, device and equipment

Info

Publication number: CN112784273A
Application number: CN202110183503.7A
Authority: CN
Inventors: 李海波; 夏雯君; 廖云; 卫欣音
Original assignee: Industrial and Commercial Bank of China Ltd ICBC
Current assignee: Industrial and Commercial Bank of China Ltd ICBC
Priority date: 2021-02-10
Filing date: 2021-02-10
Publication date: 2021-05-11

Abstract

The embodiment of the specification provides a method, a device and equipment for SQL risk identification in the technical field of big data. The method comprises the steps of obtaining submitted SQL codes and judging whether risk SQL exists in the SQL codes or not based on preset code check rules; when the SQL code does not exist, the SQL code is issued to a test environment; calling a first risk identification model in the test environment to carry out risk identification on the SQL code; the first risk identification model is obtained based on SQL feature data training in a test environment; when risk SQL does not exist in the identification result, the SQL code is issued to the production environment; calling a second risk identification model in the production environment to carry out risk identification on slow SQL in the SQL codes to obtain a risk identification result; and the second risk identification model is obtained based on SQL characteristic data training in the production environment. By utilizing the embodiment of the specification, the low-efficiency SQL can be recognized accurately in a full-flow and all-round way.

Description

SQL risk identification method, device and equipment

Technical Field

The application relates to the field of big data, in particular to a method, a device and equipment for SQL risk identification.

Background

With the wide rise of big data technology, more and more banks establish enterprise-level big data technology middle platforms, the business of data lakes and data warehouses is increasingly complex, the data volume is exponentially growing, meanwhile, the applications of the data middle platforms are increasingly incorporated, and the performance and the stability of the data middle platforms are increasingly challenged as important indexes of the quality of the big data technology platforms. In the operation process of the big data technology platform, performance problems caused by low-efficiency SQL (Structured Query Language) are increasingly prominent, once performance problems occur in the production environment, the whole big data technology platform resources reach a bottleneck in a short time, and the performance and the stability of the platform are seriously influenced.

At present, the performance problem caused by low-efficiency SQL can only be analyzed, checked and solved after the fact, and the problems that the problem is not found timely, the problem solving period is long and the like exist, so that the normal operation of the service is influenced.

Therefore, there is a need for a solution to the above technical problems.

Disclosure of Invention

The embodiment of the specification provides an SQL risk identification method, an SQL risk identification device and SQL risk identification equipment, which can identify low-efficiency SQL in a full-flow, all-around and accurate mode and optimize performance, so that the problems of untimely problem discovery, long problem solving period and the like are effectively solved, and the service is normally operated.

The SQL risk identification method, the SQL risk identification device and the SQL risk identification equipment are realized in the following modes.

An SQL risk identification method comprises the following steps: acquiring submitted SQL codes, and judging whether risk SQL exists in the SQL codes or not based on a preset code check rule; when the SQL code does not exist, the SQL code is issued to a test environment; calling a first risk identification model in the test environment to carry out risk identification on the SQL code; the first risk identification model is obtained based on SQL characteristic data training in a test environment; when risk SQL does not exist in the identification result, the SQL code is issued to a production environment; calling a second risk identification model in the production environment to carry out risk identification on slow SQL in the SQL codes to obtain a risk identification result; wherein the second risk identification model is obtained based on SQL feature data training in a production environment; the slow SQL is the SQL exceeding a preset threshold value.

An SQL risk identification apparatus comprising: the acquisition module is used for acquiring the submitted SQL codes and judging whether the risk SQL exists in the SQL codes or not based on a preset code check rule; the first release module is used for releasing the SQL code to a test environment when the SQL code does not exist; the first identification module is used for calling a first risk identification model in the test environment to carry out risk identification on the SQL code; the first risk identification model is obtained based on SQL characteristic data training in a test environment; the second issuing module is used for issuing the SQL code to the production environment when no risk SQL exists in the identification result; the second identification module is used for calling a second risk identification model in the production environment to carry out risk identification on the slow SQL in the SQL codes to obtain a risk identification result; wherein the second risk identification model is obtained based on SQL feature data training in a production environment; the slow SQL is the SQL exceeding a preset threshold value.

An SQL risk identification device comprising at least one processor and a memory storing computer executable instructions that when executed by the processor implement the steps of the method of any one of the method embodiments of the present specification.

A computer readable storage medium having stored thereon computer instructions which, when executed, implement the steps of any one of the method embodiments in the present specification.

The specification provides a SQL risk identification method, a SQL risk identification device and equipment. In some embodiments, the submitted SQL code may be acquired, whether risky SQL exists in the SQL code is determined based on a preset code check rule, and when it is determined that risky SQL does not exist, the SQL code may be issued to a test environment, and then a first risk identification model in the test environment is invoked to perform risk identification on the SQL code. And when the risk SQL does not exist in the identification result, the SQL code is issued to the production environment, and a second risk identification model in the production environment is called to carry out risk identification on the slow SQL in the SQL code, so that a risk identification result is obtained. By adopting the implementation scheme provided by the specification, the low-efficiency SQL can be recognized in a full-flow, all-around and accurate manner and the performance can be optimized, so that the problems of untimely problem discovery, long problem solving period and the like are effectively solved, and the service is normally operated.

Drawings

The accompanying drawings, which are included to provide a further understanding of the specification, are incorporated in and constitute a part of this specification, and are not intended to limit the specification. In the drawings:

FIG. 1 is a schematic flow chart diagram illustrating one embodiment of a SQL risk identification method provided by the present specification;

FIG. 2 is a flowchart illustrating an embodiment of an SQL risk identification method provided by the present specification;

FIG. 3 is a block diagram of an embodiment of an SQL risk identification device provided in the present specification;

fig. 4 is a block diagram of a hardware structure of an embodiment of an SQL risk identification server provided by the present specification.

Detailed Description

In order to make those skilled in the art better understand the technical solutions in the present specification, the technical solutions in the embodiments of the present specification will be clearly and completely described below with reference to the drawings in the embodiments of the present specification, and it is obvious that the described embodiments are only a part of the embodiments in the present specification, and not all of the embodiments. All other embodiments that can be obtained by a person skilled in the art on the basis of one or more embodiments of the present description without inventive step shall fall within the scope of protection of the embodiments of the present description.

The following describes an embodiment of the present disclosure with a specific application scenario as an example. Specifically, fig. 1 is a schematic flow chart of an embodiment of an SQL risk identification method provided in this specification. Although the present specification provides the method steps or apparatus structures as shown in the following examples or figures, more or less steps or modules may be included in the method or apparatus structures based on conventional or non-inventive efforts.

One embodiment provided by the present specification can be applied to a client, a server, and the like. The client may include a terminal device, such as a smart phone, a tablet computer, and the like. The server may include a single computer device, or may include a server cluster formed by a plurality of servers, or a server structure of a distributed system, and the like.

It should be noted that the following description of the embodiments does not limit the technical solutions in other extensible application scenarios based on the present specification. In a specific embodiment, as shown in fig. 1, in an embodiment of an SQL risk identification method provided in the present specification, the method may include the following steps.

S0: and acquiring the submitted SQL codes, and judging whether the risk SQL exists in the SQL codes or not based on a preset code check rule.

In some embodiments, after the developer finishes writing the SQL code, the developer may submit the SQL code through the SQL code submitter. The SQL code submitter can be used to submit SQL code, for example, a developer can submit SQL code through Git. Git is an open-source distributed version control system that can handle very small to very large project version management efficiently and at high speed.

In some embodiments, the server may obtain the submitted SQL code when the developer submits the SQL code, and determine whether risky SQL exists in the SQL code based on a preset code check rule. The preset code checking rules may also be referred to as SQL related performance risk rules, which may be used to provide SQL code rule checking services. The SQL code rule checking service may check whether risky SQL exists in the SQL code using preset code checking rules.

In some embodiments, the preset code inspection rules can be created by analyzing production performance problems in conjunction with industry big data performance cases. For example, in some implementations, when the preset code checking rule is created, SQL may be parsed with metadata. The metadata may include a data dictionary (e.g., database, data table, table field, primary key definition), data capacity (e.g., database capacity size, data table partition capacity size), statistical information (e.g., data table record number, data table partition and distribution key, data distribution situation), and the like.

In some implementation scenarios, after the preset code check rule is created, the preset code check rule can be stored in the code rule base, so that the preset code check rule can be directly called from the code rule base when needed subsequently, and thus the data processing efficiency can be effectively improved. The code rule base can be used for storing preset code checking rules, and can be an Oracle database, a MySQL database and the like.

In some embodiments, the preset code checking rules may include at least one of: the method comprises the steps that a Cartesian product is used for a large-capacity table, full-table scanning risk exists for the large-capacity table, no clipping exists in a query column of the large-capacity table, a partition limiting condition is not added to the partition table, a count (distint) function is used for the large-capacity table, a row _ number () function is used for the large-capacity table, an order by function is used for the large-capacity table, a collect _ set function is used for the large-capacity table, a collect _ list function is used for the large-capacity table, a unixtimemap function is used for the large-capacity table, a regular expression is used for the large-capacity table, a join field and a group by field are consistent but a performance optimization switch is not opened, non-primary key association risk exists in the large-capacity table, the number of parts of unions exceeds a standard threshold.

In some embodiments, when the SQL code is submitted by a developer, the SQL code rule checking service may be automatically triggered to determine whether risky SQL exists in the SQL code.

In some embodiments, determining whether risky SQL exists in the SQL code based on the preset code checking rule may include: and analyzing the SQL code by using a rule matching algorithm, then matching the analysis result with a preset code check rule, and when the matching is successful, indicating that the risk SQL exists in the SQL code, and if the risk SQL is positive or negative, the risk SQL does not exist. In some implementation scenarios, a rule matching algorithm corresponding to the code rule base may be constructed based on the code rule base and the metadata. Rule matching algorithms can be used to identify SQL where there is a performance risk.

In some embodiments, when it is determined that risky SQL exists in the SQL code, risky SQL may be intercepted, submitted rejected, and risk information returned for performance optimization by developers. The risk information may include a risk number, risk description information, a risk name, a risk row number, a risk level, an optimization suggestion, and the like.

In some implementation scenarios, after the developer optimizes the performance of the risky SQL, the optimized SQL may be submitted again through the SQL code submitter.

In the embodiment of the specification, risk identification is carried out when the SQL codes are pre-submitted through the preset code check rule, so that the low-efficiency SQL with performance risk can be found in time before release production, and the risk is avoided.

S2: and when the SQL code does not exist, releasing the SQL code to a test environment.

In this embodiment of the present specification, when it is determined that no risky SQL exists in the submitted SQL code, the SQL code may be issued to a test environment for risk identification. The test environment can be understood as an environment for performing performance test before the SQL code is on-line. The test environment is typically a configuration that replicates a production environment, and a program that is not working properly in the test environment will not be distributed to the production machines.

In some implementations, a monitoring program may be included in the testing environment, and the monitoring program may be configured to invoke the first risk identification model.

S4: calling a first risk identification model in the test environment to carry out risk identification on the SQL code; wherein the first risk identification model is obtained based on SQL feature data training in a test environment.

The first risk identification model can perform SQL risk prediction identification in a test environment. The first risk identification model may also be referred to as a test environment SQL risk identification prediction model. The first risk identification model may provide a test environment SQL risk identification prediction service.

In some embodiments, the first risk identification model may be obtained by: acquiring characteristic data associated with SQL in a test environment; distributing a category label to the SQL according to the operation and maintenance diagnosis information of the test environment; the category labels comprise inefficient SQL and non-inefficient SQL; training the XGboost model by using the characteristic data and the class labels of the SQL to obtain a first risk identification model; and in the training process, parameters of the model are continuously optimized based on the characteristic engineering.

In some implementations, each piece of feature data associated with SQL may include base feature data, risk feature data, SQL execution plan feature data, and/or the like. The basic feature data can be obtained by analyzing SQL statements in a test environment/production environment based on metadata. The risk characteristic data can be obtained by checking the SQL statement based on preset code checking rules. The SQL execution plan feature data may be obtained based on SQL execution plan parsing of the test environment/production environment.

In some implementation scenarios, the SQL statements in the test environment may be obtained, and then the SQL statements may be parsed in combination with the metadata and the statistical information of the corresponding environment to generate the characteristic base data related to the SQL performance. For example, the number of SQL statement nesting times, the number of SQL statement partition access times, the size of SQL statement partition access capacity, the number of SQL statement table access times, the number of SQL statement partition access times, the number of SQL statement table access times, the size of SQL statement table capacity, and the like.

In some implementation scenarios, the SQL code rule checker may be invoked to send an SQL statement as a parameter to the SQL code rule checker, and obtain risk characteristic data checked by the SQL code rule. For example, whether the large-capacity table in SQL uses cartesian products, whether the large-capacity table in SQL queries columns without clipping, whether the large-capacity table in SQL has full-table scan risk, high risk number, medium risk number, low risk number, etc. The SQL code rule checker can be obtained by packaging preset code check rules.

In some implementation scenarios, an SQL execution plan of the test environment may be obtained, and then the SQL execution plan may be parsed to generate SQL execution plan feature data. Such as SQL execution plan cost values, etc. Of course, the above description is only exemplary, the SQL-related feature data is not limited to the above examples, and other modifications may be made by those skilled in the art in light of the technical spirit of the present application, but the present application is intended to cover the scope of the present application as long as the achieved functions and effects are the same or similar to the present application.

In some implementations, the SQL may be labeled after the characteristic data associated with the SQL in the test environment is obtained. Wherein, labeling can be understood as assigning a category label, such as 1 or 0, to each SQL. In some implementation scenarios, it may be determined whether the related SQL is an inefficient SQL according to the operation and maintenance diagnostic information in the test environment, for example, whether the SQL running time exceeds a preset value. The preset value may be set according to an actual scene, which is not limited in this specification. For example, in some implementations, SQL categories that have run times that exceed a preset value may be labeled as 1, and SQL categories that have run times that do not exceed the preset value may be labeled as 0. Wherein 1 can represent that SQL is inefficient SQL, and 0 can represent that SQL is non-inefficient SQL. In some implementation scenarios, after the SQL is labeled, the category label may be stored as a feature field in the corresponding feature data.

In some implementation scenarios, after the feature data associated with SQL in the test environment is acquired and a category label is assigned to each SQL, the feature data associated with SQL may be structured to obtain a feature data set for subsequent model training, and then stored in a database. Each piece of feature data associated with SQL may include a plurality of feature fields, and each feature field may correspond to a feature factor importance. Wherein, the characteristic field can be understood as an index describing the SQL performance. The feature factor importance may identify how much the feature field affects SQL performance.

In some implementation scenarios, after obtaining the feature data set for model training, the feature data set may be divided into a training set and a test set, and then the XGBoost model is trained by using the training set, and the model obtained by training is verified by using the test set. For example, in some implementations, after obtaining the feature data set for model training, 80% of the data in the feature training set may be used as the training set, and the rest of the data may be used as the test set. The XGBoost is an ensemble learning algorithm, and the base model is generally a tree model, but other types of models such as logistic regression may be selected. Of course, the above description is only an exemplary illustration, and in the embodiments of the present disclosure, other machine learning models may be trained by using a training set to obtain the first risk identification model, and other modifications may be made by those skilled in the art within the spirit of the present disclosure, but all that is needed is to cover the scope of the present disclosure as long as the achieved functions and effects are the same as or similar to those of the present disclosure.

In some implementation scenarios, in the process of training the XGBoost model by using the training set, parameters of the training model may be continuously optimized through feature engineering, so that the risk identification accuracy of the obtained first risk identification model is higher.

In some implementation scenarios, the feature engineering optimization method may include: feature selection, feature transformation, feature combination, and the like. Wherein the selected feature field in the feature selection may be a field that results in inefficient SQL. For example, in some implementations, the selected feature field in the feature selection includes at least one of: SQL runtime, SQL execution start time, SQL execution end time, SQL length, SQL nesting times, SQL partition access times, SQL access table partition capacity, SQL access table partition number, SQL access table times, SQL access table number, SQL access table capacity, non-primary key association times, SQL execution plan cost value, SQL hit high risk times, SQL hit low risk times, risk category of SQL hit, SQL hit large capacity table times using Cartesian product, SQL hit large capacity table times with full table scanning risk, SQL hit table times with query column no clipping, SQL hit partition table times without partition condition, SQL hit large capacity table times using count (cancel) function, SQL hit large capacity table times using row _ number () function, SQL hit large capacity table times using order by ordering operation times, The number of times that the SQL hit large-capacity table uses a collect _ set function, the number of times that the SQL hit large-capacity table uses a collect _ list function, the number of times that the SQL hit large-capacity table uses a unixtimetamp function, the number of times that the SQL hit large-capacity table uses a regular expression, the number of times that the SQL hit join field is consistent with the group by field but the performance optimization switch is not opened, the number of times that the SQL hit large-capacity table has a non-primary key correlation risk, the number of times that the SQL hit join part exceeds a standard threshold, and the number of times that the SQL hit nested part exceeds the standard threshold. Of course, the above description is only exemplary, the feature fields selected in the training process are not limited to the above examples, and other modifications are possible for those skilled in the art in light of the technical spirit of the present application, and all that can be achieved is intended to be covered by the scope of the present application as long as the functions and effects achieved by the present application are the same as or similar to those of the present application. It should be noted that the feature fields selected in each training may be the same or different.

The feature transformation may be understood as transforming the feature field to derive a new feature field, for example, transforming the feature field to "SQL execution start time", converting the date into weeks, and generating a new feature field "SQL execution start time (weeks)", which takes values as examples: wednesday and saturday. The feature combination can be understood as that two or more feature fields are combined and calculated to derive a new feature field, for example, a new feature field "SQL hit risk total" is combined from "SQL hit high risk number", "SQL hit low risk number" as a new feature field.

In the embodiment of the description, the XGboost model is subjected to parameter adjustment through continuous feature selection, feature transformation and feature combination, and the model is continuously optimized, so that the accuracy of the obtained model is higher.

In some embodiments, the first identified risk model may be stored after it is obtained, and thus may be directly invoked when needed, thereby increasing processing efficiency.

In the embodiment of the present specification, after the SQL code is issued to the test environment, the first risk identification model in the test environment may be called to perform risk identification on the SQL code.

In some embodiments, when the submitted SQL code passes the check of the preset code check rule and does not have risk SQL, the SQL code may be issued to a test environment, a monitor in the test environment may automatically detect SQL with changed versions, call a first risk identification model in the test environment to perform risk identification on the SQL code, and return an identification result. The identification result may include whether the risk SQL exists, the importance of the corresponding feature factor when the risk SQL exists, and the like.

In some embodiments, the invoking a first risk identification model in the test environment to perform risk identification on the SQL code may include: analyzing the SQL code to obtain the SQL code comprising an SQL statement; analyzing the SQL statement based on metadata to obtain basic feature data corresponding to the SQL code; checking the SQL statement based on a preset code checking rule to obtain risk characteristic data corresponding to the SQL code; analyzing the SQL execution plan of the SQL statement to obtain the SQL execution plan characteristic data corresponding to the SQL code; and inputting basic characteristic data, risk characteristic data and SQL execution plan characteristic data corresponding to the SQL code into the first risk identification model for risk identification.

In some embodiments, after obtaining a result of the first risk identification model for performing risk identification on the SQL code, the identification result may be pushed to the research and development personnel, so that the research and development personnel perform targeted performance tuning according to the identification result.

In some implementation scenarios, after the research and development personnel optimize the performance of the risky SQL according to the recognition result, the optimized SQL can be submitted again through the SQL code submitter.

In the embodiment of the specification, risk identification is performed on the SQL code by using the first risk identification model in the test stage, so that low-efficiency SQL with performance risk can be found in time before release production, and the risk is avoided.

S6: and when the risk SQL does not exist in the identification result, the SQL code is issued to the production environment.

In the embodiment of the present specification, after the first risk identification model in the test environment is called to perform risk identification on the SQL code, the SQL code may be issued to the production environment when no risk SQL exists in the obtained identification result. The production environment can be understood as an environment in which SQL codes are released online. The production environment can formally provide external services, and generally turns off error reports and turns on error logs. A production environment is understood to be an environment that contains all the functions.

In some implementation scenarios, a monitoring and early warning program may be included in the test environment, and the monitoring and early warning program may be configured to invoke the second risk identification model.

In some implementation scenarios, after risk identification is performed on the SQL code by calling the SQL risk identification prediction service of the test environment, if no performance risk is predicted, the SQL code may be released on-line for production.

S8: calling a second risk identification model in the production environment to carry out risk identification on slow SQL in the SQL codes to obtain a risk identification result; wherein the second risk identification model is obtained based on SQL feature data training in a production environment; the slow SQL is the SQL exceeding a preset threshold value.

Wherein the second risk identification model can perform SQL risk prediction identification in the production environment. The second risk identification model may also be referred to as a production environment SQL risk identification prediction model. The second risk identification model may provide a production environment SQL risk identification prediction service.

In some embodiments, the second risk identification model may be obtained by: acquiring characteristic data associated with SQL in a production environment; allocating a category label to the SQL according to operation and maintenance diagnosis information of a production environment; the category labels comprise inefficient SQL and non-inefficient SQL; training the XGboost model by using the characteristic data and the class labels of the SQL to obtain a second risk identification model; and in the training process, parameters of the model are continuously optimized based on the characteristic engineering.

It should be noted that the training process of the second risk identification model is similar to the training process of the first risk identification model, and specific reference may be made to the embodiment of the training process of the first risk identification model in step S4, which is not described in detail herein.

In this embodiment of the present specification, after the SQL code is released to the production environment, the second risk identification model in the production environment may be called to perform risk identification on slow SQL in the SQL code.

In some embodiments, after the SQL code is released to the production environment, the monitoring and early warning program in the production environment may automatically detect the version-changed SQL, call the second risk identification model in the production environment to perform risk identification on the slow SQL in the SQL code, and return an identification result. The identification result may include whether the risk SQL exists, the importance of the corresponding feature factor when the risk SQL exists, and the like.

In some embodiments, said invoking a second risk identification model in the production environment to risk identify slow SQL in the SQL code may include: analyzing the SQL code to obtain the SQL code comprising an SQL statement; selecting SQL sentences exceeding a preset threshold value from the SQL sentences to obtain slow SQL; analyzing the slow SQL based on metadata to obtain basic characteristic data corresponding to the SQL code; checking the slow SQL based on a preset code checking rule to obtain risk characteristic data corresponding to the SQL code; analyzing the SQL execution plan of the slow SQL to obtain SQL execution plan characteristic data corresponding to the SQL code; and inputting the basic characteristic data, the risk characteristic data and the SQL execution plan characteristic data corresponding to the SQL code into the second risk identification model for risk identification. The slow SQL can be understood as SQL exceeding a preset threshold. The preset threshold may be set according to an actual scenario, for example, SQL for different application different service scenarios may set a suitable threshold, and for SQL exceeding the threshold, slow SQL may be defined.

In some embodiments, after obtaining a result of the second risk identification model for performing risk identification on slow SQL in the SQL codes, the identification result may be pushed to the research and development personnel, so that the research and development personnel perform targeted performance tuning according to the identification result.

In the embodiment of the specification, the risk identification is carried out on the slow SQL in the SQL codes by using the second risk identification model in the production stage, so that the low-efficiency SQL with the performance risk can be found in time in the production operation, and the risk is avoided.

In the embodiment of the specification, when the SQL code is identified to have the risk SQL based on a preset code check rule or a first risk identification model or a second risk identification model, the risk SQL and the corresponding characteristic field can be acquired; and optimizing the risk SQL based on the importance of the characteristic factor corresponding to the characteristic field to obtain the optimized SQL.

In some embodiments, after obtaining the optimized SQL, the optimized SQL may be submitted to a simulation environment; calling a preset code check rule, a first risk identification model and a second risk identification model in the simulation environment to carry out risk identification on the optimized SQL; when confirming that no risk exists, scheduling risk SQL and optimizing SQL; comparing the running time of the risk SQL and the optimization SQL, and determining whether the optimization SQL reaches an expected target or not based on the comparison result; and when the expected target is confirmed to be reached, releasing the optimized SQL to a production environment. Wherein the simulation environment may be used to simulate a real environment, such as a test environment, a production environment, and the like.

In some implementation scenarios, after the risk SQL is tuned, in order to verify whether risks still exist, the obtained optimized SQL may be submitted to a simulation environment, and after the simulation environment is identified and the SQL version is changed, a preset code check rule, a first risk identification model, and a second risk identification model may be automatically invoked to perform risk identification.

In some implementation scenarios, after risk identification is performed on the optimized SQL, if the identification result shows no risk, the SQL runtime of the SQL before optimization (risk SQL) and the SQL runtime of the SQL after optimization may be obtained, and then whether the performance of the SQL after optimization is improved is determined by comparing the corresponding SQL runtime. Of course, the above description is only exemplary, and the parameters representing the SQL performance are not limited to the SQL runtime, and those skilled in the art may make other modifications within the spirit of the present application, but all that can achieve the same or similar functions and effects as the present application shall be covered by the protection scope of the present application.

In some implementation scenarios, the SQL comparison verification scheduler may be used to verify whether the performance of the tuned SQL performs better than the performance of the SQL before optimization. The SQL comparison verification scheduler can automatically adjust cluster environment parameters and allocate queue resources, automatically synchronizes production data to a test environment through a data synchronization tool, keeps consistent in the process of operating the low-efficiency SQL and the optimized SQL, and respectively schedules the operating low-efficiency SQL and the optimized SQL. The low-efficiency SQL can be understood as the risk SQL.

In some implementation scenarios, after determining that the optimized performance of the SQL is improved, it may be further determined whether the corresponding performance of the SQL meets an expected target. In some implementations, after determining that the desired goal is achieved, a release version thereof may be brought online in the production environment. In some implementation scenarios, when it is determined that the expected target is not reached, performance optimization can be continued and performances before and after optimization are compared until the optimized performance reaches the expected target, and release production is carried out.

In the embodiment of the specification, in the training process of the machine learning model by using the feature data, the feature engineering including feature selection, feature transformation, feature combination and the like is developed, the parameters of the training model are continuously subjected to iterative optimization, and the risk identification models of the test environment and the production environment are respectively constructed, so that not only can the SQL risk prediction identification service of the test environment and the production environment be guaranteed, but also the accuracy of risk identification can be improved.

The embodiment of the specification not only can highly-automatically fall to the ground for practical application and realize accurate identification of low-efficiency SQL, but also supports efficient and quick comparison, test and verification of SQL before and after optimization, thereby effectively improving the efficiency of risk SQL test.

According to the embodiment of the specification, risk identification is carried out on the SQL code based on the preset code check rule and the risk identification model in the test environment, and the risk identification can be carried out on the SQL in the research, development and test stage, so that release production is avoided, and the risk is avoided in advance. After SQL is released and produced on line, early warning is timely carried out on slow SQL through production monitoring early warning measures, and therefore risk identification in the production environment operation stage is achieved. Therefore, the embodiment of the specification can identify the low-efficiency SQL with the performance risk in advance and in all directions in a full flow and all-round manner, and avoids the hidden danger of production performance, so that the performance and the stability of the big data application are protected.

In the embodiment of the specification, the low-efficiency SQL is identified based on the preset code inspection rule and the risk identification model, so that not only can the performance code risk be pushed, but also the performance tuning suggestion can be pushed, research personnel can accurately position the tuning direction of the problem SQL conveniently, and a corresponding tuning strategy can be formulated.

The embodiment of the specification provides a simulation environment for automatically testing the SQL performance, so that the performance problem caused by low-efficiency SQL in a production environment can be avoided to a great extent, and the performance and the stability of a large data technology platform are effectively protected.

In the embodiment of the specification, research and development testers only need to submit SQL to the scheduling server, the scheduling framework can automatically adjust cluster parameter configuration and allocate queue resources, synchronize data from a production environment to a test environment, automatically schedule SQL to run, record running index results and resource monitoring information, and store the running index results and the resource monitoring information in the database.

It is to be understood that the foregoing is only exemplary, and the embodiments of the present disclosure are not limited to the above examples, and other modifications may be made by those skilled in the art within the spirit of the present disclosure, and the scope of the present disclosure is intended to be covered by the claims as long as the functions and effects achieved by the embodiments are the same as or similar to the present disclosure.

The above method is described below with reference to a specific example, however, it should be noted that the specific example is only for better describing the present application and is not to be construed as limiting the present application. As shown in fig. 2, fig. 2 is a schematic flowchart of a specific embodiment of an SQL risk identification method provided in this specification. Specifically, in this embodiment, a feature optimization project may be performed on feature data by a feature selection method, a feature conversion method and a feature combination method based on the SQL feature data sets of the test environment and the production environment in advance, and the feature data sets are continuously trained by using a machine learning algorithm to obtain a qualified SQL risk recognition prediction model. The SQL risk identification and prediction model comprises a test environment SQL risk identification and prediction model and a production environment SQL risk identification and prediction model.

In this embodiment, when the user submits the SQL codes, the SQL code rule checking service may be automatically triggered to determine whether the submitted SQL codes have the performance risk SQL. The SQL with the performance risk is intercepted, submission is refused, the performance risk and performance tuning suggestion are pushed, and the SQL can be resubmitted after code optimization.

Further, when the SQL code rule checking service determines that the performance risk SQL does not exist in the submitted SQL codes, the SQL codes are issued to the test environment, an SQL version change identification tool in the test environment can automatically identify the newly issued SQL codes and generate SQL feature data, and then the test environment SQL risk identification prediction service is called to judge whether the performance risk SQL exists in the SQL codes. And if the performance risk exists, pushing the risk result and the importance of the characteristic factor which are predicted and identified to research and development personnel for performance tuning. And if the performance risk does not exist, releasing the production online.

Furthermore, after the production is released to be on line, a production monitoring early warning tool in the production environment can monitor the operation of the SQL, and the SQL risk identification and prediction service in the production environment is automatically called to judge whether the performance risk SQL exists in the slow SQL in the SQL code. And if the performance risk exists, pushing the risk result and the importance of the characteristic factor which are predicted and identified to research and development personnel for performance tuning. When monitoring the operation of the SQL, a proper threshold value can be set according to the SQL which applies different service scenes, and the SQL which exceeds the threshold value is diagnosed as slow SQL.

In this embodiment, after performing performance tuning on risky SQL identified by the SQL risk prediction and identification model, SQL performance comparison test verification may be performed in a simulation environment, and whether performance optimization achieves an expected effect is determined; and if the performance of the SQL after tuning is improved compared with the SQL before tuning and reaches the performance expectation, releasing the production on line.

From the above description, it can be seen that the SQL code submitted in the embodiment of the present application may be acquired, whether risky SQL exists in the SQL code is determined based on the preset code check rule, and when it is determined that risky SQL does not exist in the SQL code, the SQL code may be issued to the test environment, and then the first risk identification model in the test environment is invoked to perform risk identification on the SQL code. And when the risk SQL does not exist in the identification result, the SQL code is issued to the production environment, and a second risk identification model in the production environment is called to carry out risk identification on the slow SQL in the SQL code, so that a risk identification result is obtained. Compared with the prior art that the performance problem caused by the low-efficiency SQL can be only analyzed and checked afterwards, the embodiment of the specification can identify the low-efficiency SQL in a full flow, all-around and accurate mode and perform performance optimization, so that the problems are not found timely, the problem solving period is long and the like are effectively solved, and the service is normally operated.

In the present specification, each embodiment of the method is described in a progressive manner, and the same and similar parts in each embodiment may be joined together, and each embodiment focuses on the differences from the other embodiments. Reference is made to the description of the method embodiments.

Based on the SQL risk identification method, one or more embodiments of the present specification further provide an SQL risk identification device. The apparatus may include systems (including distributed systems), software (applications), modules, components, servers, clients, etc. that use the methods described in the embodiments of the present specification in conjunction with any necessary apparatus to implement the hardware. Based on the same innovative conception, embodiments of the present specification provide an apparatus as described in the following embodiments. Since the implementation scheme of the apparatus for solving the problem is similar to that of the method, the specific implementation of the apparatus in the embodiment of the present specification may refer to the implementation of the foregoing method, and repeated details are not repeated. As used hereinafter, the term "unit" or "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.

Specifically, fig. 3 is a schematic block diagram of an embodiment of an SQL risk recognition device provided in this specification, and as shown in fig. 3, the SQL risk recognition device provided in this specification may include: the system comprises an acquisition module 120, a first publishing module 122, a first identification module 124, a second publishing module 126, and a second identification module 128.

The obtaining module 120 may be configured to obtain the submitted SQL codes, and determine whether risky SQL exists in the SQL codes based on a preset code check rule;

a first issuing module 122, configured to issue the SQL code to a test environment when it is determined that the SQL code does not exist;

a first identification module 124, which can be used to invoke a first risk identification model in the test environment to perform risk identification on the SQL code; the first risk identification model is obtained based on SQL characteristic data training in a test environment;

the second publishing module 126 may be configured to publish the SQL code to the production environment when no risky SQL exists in the recognition result;

a second identification module 128, configured to invoke a second risk identification model in the production environment to perform risk identification on slow SQL in the SQL code, so as to obtain a risk identification result; wherein the second risk identification model is obtained based on SQL feature data training in a production environment; the slow SQL is the SQL exceeding a preset threshold value.

It should be noted that the above-mentioned description of the apparatus according to the method embodiment may also include other embodiments, and specific implementation manners may refer to the description of the related method embodiment, which is not described herein again.

The present specification also provides an embodiment of an SQL risk identification device, comprising a processor and a memory for storing processor-executable instructions, which when executed by the processor, implement steps comprising: acquiring submitted SQL codes, and judging whether risk SQL exists in the SQL codes or not based on a preset code check rule; when the SQL code does not exist, the SQL code is issued to a test environment; calling a first risk identification model in the test environment to carry out risk identification on the SQL code; the first risk identification model is obtained based on SQL characteristic data training in a test environment; when risk SQL does not exist in the identification result, the SQL code is issued to a production environment; calling a second risk identification model in the production environment to carry out risk identification on slow SQL in the SQL codes to obtain a risk identification result; wherein the second risk identification model is obtained based on SQL feature data training in a production environment; the slow SQL is the SQL exceeding a preset threshold value.

It should be noted that the above-mentioned apparatuses may also include other embodiments according to the description of the method or apparatus embodiments. The specific implementation manner may refer to the description of the related method embodiment, and is not described in detail herein.

The method embodiments provided in the present specification may be executed in a mobile terminal, a computer terminal, a server or a similar computing device. Taking an example of the SQL risk identification device running on a server, fig. 4 is a hardware structure block diagram of an embodiment of an SQL risk identification server provided in this specification, where the server may be an SQL risk identification device or an SQL risk identification device in the foregoing embodiment. As shown in fig. 4, the server 10 may include one or more (only one shown) processors 100 (the processors 100 may include, but are not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA, etc.), a memory 200 for storing data, and a transmission module 300 for communication functions. It will be understood by those skilled in the art that the structure shown in fig. 4 is only an illustration and is not intended to limit the structure of the electronic device. For example, the server 10 may also include more or fewer components than shown in FIG. 4, and may also include other processing hardware, such as a database or multi-level cache, a GPU, or have a different configuration than shown in FIG. 4, for example.

The memory 200 may be used to store software programs and modules of application software, such as program instructions/modules corresponding to the SQL risk identification method in the embodiment of the present specification, and the processor 100 executes various functional applications and data processing by executing the software programs and modules stored in the memory 200. Memory 200 may include high speed random access memory and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, memory 200 may further include memory located remotely from processor 100, which may be connected to a computer terminal through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission module 300 is used for receiving or transmitting data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the computer terminal. In one example, the transmission module 300 includes a Network adapter (NIC) that can be connected to other Network devices through a base station so as to communicate with the internet. In one example, the transmission module 300 may be a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.

The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

The method or apparatus provided by the present specification and described in the foregoing embodiments may implement service logic through a computer program and record the service logic on a storage medium, where the storage medium may be read and executed by a computer, so as to implement the effect of the solution described in the embodiments of the present specification. The storage medium may include a physical device for storing information, and typically, the information is digitized and then stored using an electrical, magnetic, or optical media. The storage medium may include: devices that store information using electrical energy, such as various types of memory, e.g., RAM, ROM, etc.; devices that store information using magnetic energy, such as hard disks, floppy disks, tapes, core memories, bubble memories, and usb disks; devices that store information optically, such as CDs or DVDs. Of course, there are other ways of storing media that can be read, such as quantum memory, graphene memory, and so forth.

The embodiments of the SQL risk identification method and apparatus provided in this specification may be implemented in a computer by a processor executing corresponding program instructions, for example, implemented in a PC end using a c + + language of a windows operating system, implemented in a linux system, or implemented in an intelligent terminal using android, iOS system programming languages, implemented in processing logic based on a quantum computer, and the like.

It should be noted that descriptions of the apparatus, the device, and the system described above according to the related method embodiments may also include other embodiments, and specific implementations may refer to descriptions of corresponding method embodiments, which are not described in detail herein.

The embodiments in the present application are described in a progressive manner, and the same and similar parts among the embodiments can be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the hardware + program class embodiment, since it is substantially similar to the method embodiment, the description is simple, and the relevant points can be referred to the partial description of the method embodiment.

For convenience of description, the above devices are described as being divided into various modules by functions, and are described separately. Of course, when implementing one or more of the present description, the functions of some modules may be implemented in one or more software and/or hardware, or the modules implementing the same functions may be implemented by a plurality of sub-modules or sub-units, etc.

The present invention has been described with reference to flowchart illustrations and/or block diagrams of methods, apparatus, devices, systems according to embodiments of the invention. It will be understood that the implementation can be by computer program instructions which can be provided to a processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

As will be appreciated by one skilled in the art, one or more embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, one or more embodiments of the present description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects.

The above description is merely exemplary of one or more embodiments of the present disclosure and is not intended to limit the scope of one or more embodiments of the present disclosure. Various modifications and alterations to one or more embodiments described herein will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims.

Claims

1. A SQL risk identification method is characterized by comprising the following steps:

acquiring submitted SQL codes, and judging whether risk SQL exists in the SQL codes or not based on a preset code check rule;

when the SQL code does not exist, the SQL code is issued to a test environment;

calling a first risk identification model in the test environment to carry out risk identification on the SQL code; the first risk identification model is obtained based on SQL characteristic data training in a test environment;

when risk SQL does not exist in the identification result, the SQL code is issued to a production environment;

calling a second risk identification model in the production environment to carry out risk identification on slow SQL in the SQL codes to obtain a risk identification result; wherein the second risk identification model is obtained based on SQL feature data training in a production environment; the slow SQL is the SQL exceeding a preset threshold value.

2. The method of claim 1, wherein the preset code checking rules comprise at least one of: the method comprises the steps that a Cartesian product is used for a large-capacity table, full-table scanning risk exists for the large-capacity table, no clipping exists in a query column of the large-capacity table, a partition limiting condition is not added to the partition table, a count (distint) function is used for the large-capacity table, a row _ number () function is used for the large-capacity table, an order by function is used for the large-capacity table, a collect _ set function is used for the large-capacity table, a collect _ list function is used for the large-capacity table, a unixtimemap function is used for the large-capacity table, a regular expression is used for the large-capacity table, a join field and a group by field are consistent but a performance optimization switch is not opened, non-primary key association risk exists in the large-capacity table, the number of parts of unions exceeds a standard threshold.

3. The method of claim 1, wherein the first/second risk identification model is obtained by:

acquiring characteristic data associated with SQL in a test environment/production environment;

allocating a category label to the SQL according to the operation and maintenance diagnosis information of the test environment/the production environment; the category labels comprise inefficient SQL and non-inefficient SQL;

training the XGboost model by using the characteristic data and the class labels of the SQL to obtain a first risk identification model and a second risk identification model; and in the training process, parameters of the model are continuously optimized based on the characteristic engineering.

4. The method of claim 3, wherein the feature data comprises base feature data, risk feature data, SQL execution plan feature data; the basic feature data is obtained by analyzing SQL sentences in a test environment/a production environment based on metadata; the risk characteristic data is obtained by checking the SQL statement based on a preset code checking rule; the SQL execution plan feature data is obtained based on SQL execution plan analysis of a test environment/a production environment.

5. The method of claim 3, wherein the feature engineering includes feature selection, feature transformation, feature combination; wherein the selected feature field in the feature selection comprises at least one of: SQL runtime, SQL execution start time, SQL execution end time, SQL length, SQL nesting times, SQL partition access times, SQL access table partition capacity, SQL access table partition number, SQL access table times, SQL access table number, SQL access table capacity, non-primary key association times, SQL execution plan cost value, SQL hit high risk times, SQL hit low risk times, risk category of SQL hit, SQL hit large capacity table times using Cartesian product, SQL hit large capacity table times with full table scanning risk, SQL hit table times with query column no clipping, SQL hit partition table times without partition condition, SQL hit large capacity table times using count (cancel) function, SQL hit large capacity table times using row _ number () function, SQL hit large capacity table times using order by ordering operation times, The number of times that the SQL hit large-capacity table uses a collect _ set function, the number of times that the SQL hit large-capacity table uses a collect _ list function, the number of times that the SQL hit large-capacity table uses a unixtimetamp function, the number of times that the SQL hit large-capacity table uses a regular expression, the number of times that the SQL hit join field is consistent with the group by field but the performance optimization switch is not opened, the number of times that the SQL hit large-capacity table has a non-primary key correlation risk, the number of times that the SQL hit join part exceeds a standard threshold, and the number of times that the SQL hit nested part exceeds the standard threshold.

6. The method of claim 1, wherein said invoking a first risk identification model in the test environment to risk identify the SQL code comprises:

analyzing the SQL code to obtain the SQL code comprising an SQL statement;

analyzing the SQL statement based on metadata to obtain basic feature data corresponding to the SQL code;

checking the SQL statement based on a preset code checking rule to obtain risk characteristic data corresponding to the SQL code;

analyzing the SQL execution plan of the SQL statement to obtain the SQL execution plan characteristic data corresponding to the SQL code;

and inputting basic characteristic data, risk characteristic data and SQL execution plan characteristic data corresponding to the SQL code into the first risk identification model for risk identification.

7. The method of claim 1, wherein said invoking a second risk identification model in the production environment risk identifies slow SQL in the SQL code, comprising:

analyzing the SQL code to obtain the SQL code comprising an SQL statement;

selecting SQL sentences exceeding a preset threshold value from the SQL sentences to obtain slow SQL;

analyzing the slow SQL based on metadata to obtain basic characteristic data corresponding to the SQL code;

checking the slow SQL based on a preset code checking rule to obtain risk characteristic data corresponding to the SQL code;

analyzing the SQL execution plan of the slow SQL to obtain SQL execution plan characteristic data corresponding to the SQL code;

and inputting the basic characteristic data, the risk characteristic data and the SQL execution plan characteristic data corresponding to the SQL code into the second risk identification model for risk identification.

8. The method according to claim 1, when it is identified that the SQL code has risk SQL based on a preset code check rule or a first risk identification model or a second risk identification model, the method comprises:

acquiring risk SQL and corresponding characteristic fields;

and optimizing the risk SQL based on the importance of the characteristic factor corresponding to the characteristic field to obtain the optimized SQL.

9. The method of claim 8, wherein obtaining the optimized SQL comprises:

submitting the optimized SQL to a simulation environment;

calling a preset code check rule, a first risk identification model and a second risk identification model in the simulation environment to carry out risk identification on the optimized SQL;

when confirming that no risk exists, scheduling risk SQL and optimizing SQL;

comparing the running time of the risk SQL and the optimization SQL, and determining whether the optimization SQL reaches an expected target or not based on the comparison result;

and when the expected target is confirmed to be reached, releasing the optimized SQL to a production environment.

10. An SQL risk identification apparatus, comprising:

the acquisition module is used for acquiring the submitted SQL codes and judging whether the risk SQL exists in the SQL codes or not based on a preset code check rule;

the first release module is used for releasing the SQL code to a test environment when the SQL code does not exist;

the first identification module is used for calling a first risk identification model in the test environment to carry out risk identification on the SQL code; the first risk identification model is obtained based on SQL characteristic data training in a test environment;

the second issuing module is used for issuing the SQL code to the production environment when no risk SQL exists in the identification result;

the second identification module is used for calling a second risk identification model in the production environment to carry out risk identification on the slow SQL in the SQL codes to obtain a risk identification result; wherein the second risk identification model is obtained based on SQL feature data training in a production environment; the slow SQL is the SQL exceeding a preset threshold value.

11. An SQL risk identification device comprising at least one processor and a memory storing computer executable instructions which, when executed by the processor, implement the steps of the method of any of claims 1 to 9.

12. A computer-readable storage medium having stored thereon computer instructions which, when executed, implement the steps of the method of any one of claims 1 to 9.