WO2023236776A1

WO2023236776A1 - Method and apparatus for sub-optimal query identification and index optimizations post migration

Info

Publication number: WO2023236776A1
Application number: PCT/CN2023/096064
Authority: WO
Inventors: Manoj Kulkarni
Original assignee: Huawei Cloud Computing Technologies Co., Ltd.
Priority date: 2022-06-11
Filing date: 2023-05-24
Publication date: 2023-12-14

Abstract

A method (800) and an apparatus (900) for sub-optimal query identification and index optimizations post migration to cloud storage are disclosed. The method (S800) for sub-optimal query identification, said method comprising the steps of: determining (S801) query execution statistics on a source database (SRC-DB); determining (S802) query execution statistics on a destination database (DST-DB); and comparing (S803) the query execution statistics of the SRC-DB to the DST-DB.

Description

METHOD AND APPARATUS FOR SUB-OPTIMAL QUERY IDENTIFICATION AND INDEX OPTIMIZATIONS POST MIGRATION

TECHNICAL FIELD

The present subject matter described herein, in general, relates to cloud storage, and more particularly, it relates to a method and an apparatus for sub-optimal query identification and index optimizations post migration.

BACKGROUND

Cloud storage is a model of computer data storage in which the digital data is stored in logical pools, said to be on "the cloud" . The physical storage spans multiple servers (sometimes in multiple locations) , and the physical environment is typically owned and managed by a hosting company. These cloud storage providers are responsible for keeping the data available and accessible, and the physical environment secured, protected, and running. People and organizations buy or lease storage capacity from the providers to store user, organization, or application data. Cloud storage services may be accessed through a colocated cloud computing service, a web service application programming interface (API) or by applications that use the API, such as cloud desktop storage, a cloud storage gateway or Web-based content management systems.

Customers are now opting for cloud frameworks to reduce the costs associated with upgrades and maintenance of on-premises databases. Accordingly, the cloud database market is expected to grow exponentially in the coming years. Increasing number of customers moving/migrating their on-premises databases to Relational databases on the cloud brings a challenging situation on evaluating the performance/stability/functionality on the migrated Database. The more specific challenge with customers is to identify the possible solutions on how they may reduce the cost for their clients. First option for customers is to migrate from licensed database to open source databases, however there are quite a few challenges associated with them and the same are discussed hereinbelow:

a) Tool to migrate: no common tool is available to evaluate across multiple target data bases (DBs) . All commercial vendors support migrating to their commercial Databases at cost.

b) Cost of evaluation: a team with expert Database Administrators (DBA’s ) migration, optimizations has to be setup. The cost required to setup this team will be generally very high.

c) Feedback time: The time for feedback for the following is quite high.

● Migration;

● Execution of simulated application on new DB;

● Identification of bottlenecks, sub-optimal query are complex, time consuming and mostly identified late in the cycle; and

● Fixing and regressing adds to one more cycle.

d) Automatic report generation post migration: there exists no tool which may provide a detailed report on Sub-Optimal queries, recommend index changes for space/speed optimizations.

As part of migration (schema, application and data) to a cloud database, there are currently ongoing efforts to provide frameworks/tools/capabilities to evaluate and understand the data on the source system, but very few/non-existent frameworks/capabilities to determine how the migrated data and migrated SQL’s /applications /workload will behave on the target environment. There remains a definite possibility that the migrated queries may not meet performance goals or require further tuning which becomes an extended task post migration. This requires manual effort on part of the DBA/users, which is a time consuming and difficult activity. The live applications may contain 100s to 1000s of tables where each table may have multiple indices, and the overall applications may have queries running in terms of 1000s. Therefore, practically going through each table, evaluating and confirming the same would be practically impossible.

As per the state of art industry standards and trends, most optimizer plan (s) comparison methods and index recommendation methods are used within the same vendor specific database management systems (different versions) . In this context, there are workloads where plans from same or different versions of the same database may be compared. Based on the specific query performance index recommendation and tuning methods may be provided by the DBA’s , which involves manual work. Researches on automatic query performance improvement by adjusting indices is being carried out by Microsoft Cloud service team on SQL Database (Cloud deployed) . Currently, there exists no cloud service that is capable of helping customers to evaluate on multiple target database, saving cost, time and acquiring a more detailed in-depth report for migration.

The technical problems associated with the existing prior art techniques are discussed hereinbelow:

1. Very high effort to locate and late identification: Late and manual identification of sub-optimal queries generated by migration process for new Database (ex: Oracle to PG/MySQL/….. ) :

a. in practical cases, the user/customer would identify the performance impact of sub-optimal query on the complete application at later stage after migration; and

b. this would require manual effort by DBA’s to analyze complete application and identify the exact query causing the impact.

2. High wait time to conclude: User has to wait for complete application, and data migration to evaluate and confirm to proceed with new Target DB.

a. User cannot take the decision and get a holistic view on whether the decision to migrate to a new target DB is ok, with no impact. Repeated cost/effort to evaluate new target DB.

3. Challenge in suggesting Index optimization: Complex and error prone by DBA’s to understand complete DB schema to identify unused Index/fields within the index.

a. Over the course of time, there may be changes in the schema impacting the index (s) , which is not possible to be analyzed by DBA’s manually (when customer scripts have few 1000’s of tables) . These can be analyzed and optimized during migration to destination DB.

4. No Industry tool to recommend migration across stores/engines: There exists no recommendation tool for migration between different stores/engines.

5. Attract more customers towards Cloud Native deployment: With the capability of brining in this service into Cloud, one may attract customers directly to migrate their services/applications/databases on to the Huawei Cloud.

6. CAICT Performance Level-3: This meets one of the requirements of Application Object reconstruction –Level 3 capability (CAICT –China Academy of Information and Communications Technology) “Supports execution plan comparison analysis and performance comparison analysis of collected application SQL statements in the source and target databases” .

Accordingly, there is a need to automatically perform query cost comparison between source database and migrated target database and generate a report of sub-optimal queries as an output, so as to improve space utilization and/or query performance when compared to the conventional cloud migration techniques. There is also a need to provide a detailed report of query (where condition) and index on the columns along with index optimizations so as to arrive at a cost effective and systematic approach for identifying Index optimization.

The above-described need for sub-optimal query identification and index optimizations post migration is merely intended to provide an overview of some of the shortcomings of conventional systems /mechanism /techniques, and is not intended to be exhaustive. Other problems/shortcomings with conventional systems/mechanism/techniques and corresponding benefits of the various non-limiting embodiments described herein may become further apparent upon review of the following description.

SUMMARY

This summary is provided to introduce concepts related to a method and an apparatus for sub-optimal query identification and index optimizations post migration, and the same are further described below in the detailed description. This summary is not intended to identify essential features of the claimed subject matter nor is it intended for use in determining or limiting the scope of the claimed subject matter.

An objective of the present invention is to provide sub-optimal query identification and index optimizations post migration to cloud storage.

Another objective of the present invention is to automatically perform query cost comparison between source database and migrated target database to generate a report of sub-optimal queries as an output.

Yet another objective of the present invention is to provide suggestions for index optimizations, to improve storage space and query performance.

Yet another objective of the present invention is to compare query cost on two different engines/vendors of the source and target databases as a part of the migration process before the actual data migration and application migration is put into production.

In particular, the present invention discloses a method and an apparatus for sub-optimal query identification and index optimizations post migration to cloud storage to ameliorate drawbacks associated with the prior arts.

According to first aspect of the invention, there is provided a method for sub-optimal query identification and index optimizations post migration. The method comprising the steps of: determining query execution statistics on a source database (SRC-DB) ; determining query execution statistics on a destination database (DST-DB) ; and comparing the query execution statistics of the SRC-DB to the DST-DB.

In a first possible implementation of the method according to the first aspect, determining (S801) the query execution statistics on the SRC-DB comprises the steps of: executing a query on the SRC-DB; extracting and capturing a query plan from the executed query; and parsing the query plan to obtain query information from the SRC-DB.

In a second possible implementation of the method according to the first aspect, determining (S802) the query execution statistics on the DST-DB comprises the steps of: executing a query on the DST-DB; extracting and capturing a query plan from the executed query; and parsing the query plan to obtain query information from the DST-DB.

In a third possible implementation of the method according to the first aspect, the query information comprises table schema, table data stats/metrics and table catalog.

In a fourth possible implementation of the method according to the first aspect, comparing (S803) the query execution statistics of the SRC-DB to the DST-DB comprises the steps of: extracting from the query plan, cost associated with executing a query on the SRC-DB; extracting from the query plan, cost associated with executing a query on the DST-DB; and comparing the cost associated with executing a query on the SRC-DB with cost associated with executing a query on the DST-DB.

According to second aspect of the invention, there is provided an apparatus for sub-optimal query identification and index optimizations post migration. The apparatus comprises a source query execution unit adapted to determine query execution statistics on a source database (SRC-DB) ; a destination query execution unit adapted to determine query execution statistics on a destination database (DST-DB) ; and a comparison unit adapted to compare the query execution statistics of the SRC-DB to the DST-DB.

In a first possible implementation of the apparatus according to the second aspect, the source query execution unit is adapted to: executes a query on the SRC-DB; extracts and captures a query plan from the executed query; and parses the query plan to obtain query information from the SRC-DB.

In a second possible implementation of the apparatus according to the second aspect, the destination query execution unit is adapted to: execute a query on the DST-DB; extracts and captures a query plan from the executed query; and parses the query plan to obtain query information from the DST-DB.

In a third possible implementation of the apparatus according to the second aspect, the comparison unit is adapted to: extract from the query plan, cost associated with executing a query on the SRC-DB; extract from the query plan, cost associated with executing a query on the DST-DB; and compare the cost associated with executing a query on the SRC-DB with cost associated with executing a query on the DST-DB.

In a fourth possible implementation of the apparatus according to the second aspect, the query information comprises table schema, table data stats/metrics and table catalog.

According to third aspect of the invention, there is provided a computer readable storage medium for sub-optimal query identification and index optimizations post migration. The computer readable storage medium stores a computer program, and when the computer program runs on a processor, the processor performs the method as described in the first aspect of the invention.

Other aspects, advantages, and salient features of the invention will become apparent to those skilled in the art from the following detailed description, which, taken in conjunction with the annexed drawings, discloses exemplary embodiments of the invention.

BRIEF DESCRIPTION OF THE ACCOMPANYING DRAWINGS

The detailed description is described with reference to the accompanying figures. In the figures, the digit (s) of a reference number identifies the figure in which the reference number first appears. The same numbers are used throughout the drawings to refer like features and components.

Figure 1 illustrates a block-diagram of data collection and data processing across various different stages of a sub-optimal query identification system, in an implementation of the present invention.

Figure 2 illustrates a block-diagram of a typical service deployment on Cloud, in an implementation of the present invention.

Figure 3 illustrates a block-diagram of a general query plan for sub-optimal query identification, in an implementation of the present invention.

Figure 4 illustrates a flow-chat of the process flow for sub-optimal query identification, in an implementation of the present invention.

Figure 5 illustrates a sample Query plan on Source DB and Destination DB, in an implementation of the present invention.

Figure 6 illustrates a flow-chat of the sequence to identify the Index optimizations, in an implementation of the present invention.

Figure 7 illustrates a graphical representation of columns and index columns of the table, in an implementation of the present invention.

Figure 8 illustrates flow-chart of the method for sub-optimal query identification and index optimizations post migration, in accordance with another embodiment of the present invention.

Figure 9 illustrates a block diagram of the apparatus for sub-optimal query identification and index optimizations post migration, in accordance with another embodiment of the present invention.

It is to be understood that the attached drawings are for purposes of illustrating the concepts of the invention and may not be to scale.

DETAILED DESCRIPTION OF THE PRESENT INVENTION

The following clearly describes the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Apparently, the described embodiments are merely a part rather than all of the embodiments of the present invention. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present invention without creative efforts shall fall within the protection scope of the present invention.

The present invention can be implemented in numerous ways, as a process, an apparatus, a system, a composition of matter, a computer readable medium such as a computer readable storage medium or a computer network wherein program instructions are sent over optical or electronic communication links. In this specification, these implementations, or any other form that the invention may take, may be referred to as techniques. In general, the order of the steps of disclosed processes may be altered within the scope of the invention.

A detailed description of one or more embodiments of the invention is provided below along with accompanying figures that illustrate the principles of the invention. The invention is described in connection with such embodiments, but the invention is not limited to any embodiment. The scope of the invention is limited only by the claims and the invention encompasses numerous alternatives, modifications, and equivalents. Numerous specific details are set forth in the following description in order to provide a thorough understanding of the invention. These details are provided for the purpose of example and the invention may be practiced according to the claims without some or all of these specific details. For the purpose of clarity, technical material that is known in the technical fields related to the invention has not been described in detail so that the invention is not unnecessarily obscured.

In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the invention. However, it will be understood by those skilled in the art that the present invention may be practiced without these specific details. In other instances, well-known methods, procedures, and components, modules, units and/or circuits have not been described in detail so as not to obscure the invention.

Although embodiments of the invention are not limited in this regard, discussions utilizing terms such as, for example, “processing, ” “computing, ” “calculating, ” “determining, ” “establishing” , “analysing” , “checking” , or the like, may refer to operation (s) and/or process (es) of a computer, a computing platform, a computing system, or other electronic computing device, that manipulates and/or transforms data represented as physical (e.g., electronic) quantities within the computer's registers and/or memories into other data similarly represented as physical quantities within the computer's registers and/or memories or other information non-transitory storage medium that may store instructions to perform operations and/or processes.

Although embodiments of the invention are not limited in this regard, the terms “plurality” and “a plurality” as used herein may include, for example, “multiple” or “two or more” . The terms “plurality” or “a plurality” may be used throughout the specification to describe two or more components, devices, elements, units, parameters, or the like. Unless explicitly stated, the method embodiments described herein are not constrained to a particular order or sequence. Additionally, some of the described method embodiments or elements thereof can occur or be performed simultaneously, at the same point in time, or concurrently.

In the present invention, “cloud storage” refers to a cloud computing model that stores data on the Internet through a cloud computing provider who manages and operates data storage as a service.

In the present invention, “index optimization” refers to identification of unused /unnecessary indices. This optimization will help in storage space reduction and can help in performance improvement.

In the present invention, “sub-optimal query” refers to a query performance that is substandard. There is an opportunity to modify the query to improve its performance.

In the present invention, “application migration” refers to the process of moving software applications from one computing environment to another. In Database migration there are two aspects, (i) Data (ii) Application. The code which gets executed for storing/modifying/retrieval is called the application. In general, these are specific to a particular database, like functions/procedures/system views are specifically designed for better performance and are proprietary. Migrating this complete SQL code from one database to another database is called Application Migration.

In the present invention, “data migration” refers to the process of moving data from one database to another. As discussed above, Database migration involves two aspects, (i) Data (ii) Application. The data which is stored in the database needs to be migrated if the user decides to migrated from one database to another. There are specific data types or formats supported by different database vendors, to make the data available from one database to another to make the application operate on top of this data is called "Data Migration" .

In the present invention, “Query” refers the SELECT statement in SQL. This statement is used to fetch the data.

In the present invention, “Query plan” refers the plan prepared on how a query needs to be executed.

A method and an apparatus for sub-optimal query identification and index optimizations are disclosed. While aspects are described for allowing customers to evaluate the migration in the short time, with least amount of infrastructure deployment, and a detailed report when migrating from a high cost license to free/open source databases, the present invention may be implemented in any number of different computing systems, environments, and/or configurations, the embodiments are described in the context of the following exemplary systems, devices/nodes/apparatus, and methods.

Henceforth, embodiments of the present disclosure are explained with the help of exemplary diagrams and one or more examples. However, such exemplary diagrams and examples are provided for the illustration purpose for better understanding of the present disclosure and should not be construed as limitation on scope of the present disclosure.

In state of the art techniques and trends, optimizer plan (s) comparison methods and index recommendation methods are used within the same vendor specific database management systems (different versions) . In this context, there are workloads where plans from same or different versions of the same database may be compared. Based on the specific query performance index recommendation and tuning methods may be provided by the DBA’s , which involves manual work. Researches on automatic query performance improvement by adjusting indices is being carried out by Microsoft Cloud service team on SQL Database (Cloud deployed) . Currently, there exists no cloud service that is capable of helping customers to evaluate on multiple target database, saving cost, time and acquiring a more detailed in-depth report for migration.

The drawbacks associated with the conventional techniques involve manually scanning through 1000s of tables, index and queries to prepare such details is practically impossible, prone to errors and requires enormous effort.

Accordingly, there is a need to compare query cost on two engines of the source and target databases (different vendor/engines) as part of the migration process before the actual data migration and application migration is put into production so as to arrive at a cost effective and systematic approach for identifying index optimization. The present invention discloses a method and an apparatus for sub-optimal query identification and index optimizations post migration by running customer queries on both source and destination DB’s , collecting plan information, and generating a report to indicate the sub-optimal queries after migration based on cost.

The present invention discloses running customer queries on the source database and destination database, collecting plan information (the format, information will be different on different DBs) and generating a report to indicate the sub-optimal queries after migration based on the cost. These do require user intervention to optimize the queries further. Scanning the complete database schema, identifying the unused index/fields or suggest new indexes/fields addition in existing indexes –optimizations for efficient storage usage and queries performances. Further, when the cost of the query is higher in the destination database when compared to the source database, re-generating query with optimizations to give lower cost. If the same cannot be done, this query is added to the list of sub-optimal queries.

Significantly, the target DB for migration is decided by the user/customer. Further, data similar to the source database is re-created using the table statistics, metrics and table catalogue information. Notably, any commercially available tool to generate the similar data on the migrated DB may be employed and accordingly, the same is not discussed herein. Moreover, the present invention is not a database management system (DBMS) specific index recommendation engine or comparison framework.

Figure 1 illustrates a block-diagram of the data collection and data processing across the various different stages of the sub-optimal query identification system, in an implementation of the present invention. The SQL-pre parser will extract the table schema, table data stats/metrics and table catalog from the source database (SRC-DB) . This information is passed through the migration tool (which is a 3^rd party component) to convert the SQL scripts as per the destination database (DST-DB) . By using the Table Schema, Table Data Stats and Table catalog, the simulated Data Generator tool (which is a 3^rd party component) simulates the data on which the Migrated SQLs are executed to get the plans. The Query Executor &Plan comparison module executes the Migrated SQL on the DST DB and get the plan details. The plan from the SRC DB and DST DB is parsed, to extract the required information. The Query Executor &Plan Cost comparison module parses the plan from SRC DB and DST DB, prepare the data in tabular format for user understanding. Based on the cost involved, if the SRC-DB Cost is lower than the cost in the DST DB, this marks said particular Query for Manual Intervention from the user, as illustrated in Figure 5. Using the Table Schema (Table columns, Indices, index fields &SQLs) -a matrix is prepared to identify if:

a) There are any columns which needs to be included in the Index (Ex: #1 in the Figure 7) ; and

b) Any index field not used in any of the Queries (Ex: #3 in Figure 7) .

Figure 2 illustrates a block-diagram of a typical service deployment on Cloud, in an implementation of the present invention. In particular, how the technique of sub-optimal query identification is deployed as a service on the cloud. This is a typical service deployment on cloud with three major layers, i.e., the bottom layer, middle layer and the top layer. Significantly, the bottom most layer is the new service, while the middle layer is an infrastructure and the top most layer is the UI (user interface) . Further, there are additional supporting services/components like zookeeper, Kafka for messaging, Databases for SRC &DST, Repository DB, and the like to maintain information on the cloud. From the UI the LB (Load Balance) can be used to control the load.

The core premise of the present invention is to automatically running the customer queries in the source DB and the migrated Structured Query Language (SQL) queries on the target DB, parsing the query plan and collecting all key information such as but not limited to cost, search path, sequential/index search, kind of index search, and approximate rows returned. Populating the cost of the queries from source DB and target DB in order to identify the sub-optimal queries.

Significantly, preparing a report to list down the sub-optimal queries on the target DB, involves user/human intervention so as to review/optimize them further. Based on the tools maturity, tool may perform a few query optimizations based on the target DB architectures.

Notably, MySQL is incapable of predicting how many rows may come in result between two values, whereas PG has better predictability. Few databases such as ORACLE can create dynamic hash index when there are sub-queries involved for quicker execution.

In the present invention, a table is created by scanning through target DB SQL queries, to identify columns used in the ‘where’ condition of the query and a separate table to identify the columns used in the indices. Machine learning techniques are employed to analyse the source and target DBs so as to identify the following:

a) Unused indices, as illustrated in Figure 7;

b) Fields in the indices which do not constitute to the where condition in queries; and

c) Dominant fields which are part of the where condition in queries but not part of any indices.

■ Space optimization by deleting unused indices, field’s part of indices.

■ Query performance improvement by optimization of the indices.

The complete solution proposed in the invention may be deployed as a service in Cloud. This would allow users/customers to evaluate the migration in a short time frame, with least infrastructure deployment, and a detailed report, thereby aiding customers to migrate from a high cost license to free/open source databases. The customer can evaluate a plurality of target databases with very less effort in both person and/or cost. Advantages of utilizing the Cloud infrastructure to evaluate the target DB in itself is a significant advantage while the rest of the benefits associated with Cloud storage is already known and the same is not discussed herein.

Figure 3 illustrates a block-diagram of the general query plan, in an implementation of the present invention. In particular, it provides a pictorial representation of how plan can be visualized. The Leaf node of the plan is executed first, and thereafter, the data is propagated to the upper layer. Accordingly, the Est (estimated) row count and Est (estimated) cost indicates the weight-age and numbers of rows which can result from these nodes. The same information is populated in the Table format in figure b. A simple query plan is considered for understanding. There are three depths of the query execution. Hash join is applied on the result set obtained from Hash join and Index Scan, the next level of Hash join is done from a result set of two index scan. To start with, the overall cost of the query is considered:

Est Cost: 35, Est Rows: 200

Inferences from the above result:

1. There are queries with WHERE condition on the columns which are not part of any indices designed. (C3) –A new index may be designed or this column may be added as part of the existing Indices (which most of the queries have where condition) to improve the query performance;

2. There are queries with WHERE condition on the columns which are part of the indices; and

3. There are indices on the columns which are not part of any WHERE condition of the queries. (C25, C27) –This may be optimized to improve the space utilization and WRITE operations on the table.

Table 1: Sample table schema

Figure 4 illustrates a flow-chat of the process-flow for sub-optimal query identification, in an implementation of the present invention. This method comprises the following steps:

1.a. executing the Query on the SRC-DB;

1.b. extracting and capturing the query plan from the executed Query; and

1.c. parsing the query plan to obtain the details from the SRC-DB.

2.a. executing the Query on the DST-DB;

2.b. extracting and capturing the query plan from the executed Query; and

2.c. parsing the query plan to obtain the details from the DST-DB.

Thereafter query plan information from SRC-DB and DST-DB is fed to the Query Executor &Plan cost comparison module for comparison. In particular, these query plans are compared to check if the cost of query execution is more in SRC-DB or DST-DB.

Figure 5 illustrates a sample Query plan on Source DB and Destination DB, in an implementation of the present invention. The same query is executed is executed on Source DB and Destination DB to obtain the query plan. The information (i.e., Table Schema, Table catalogue and SQL Queries) from the query plan is extracted and populated in a tabular format. For example, when the Actual Time on Source DB is less than the Actual time on the Destination DB, the query is marked for Manual Intervention. As the performance of the query is slower in the Destination DB. This needs to be investigated by the DBA for cause and optimize the query further which is not part of the present invention.

Figure 6 illustrates a flow-chat of the sequence to identify the Index optimizations, in an implementation of the present invention. The SQL Pre-Parser extracts all the table information, viz: Table Schema, Table catalogue and SQL Queries from the source (SRC) DB. The SQL Pre-Parser then parses through this information and prepare a matrix. From this matrix a visual representation may be created to indicate:

1. Query has a where condition on a field which is not part of any index (Qry_wt is > 1 &idx_wt =0) ---This helps in performance optimization.

2. Query does not use the field on which the index is created (Qry_wt =0 &idx_wt > 0) ---This helps in space optimization.

When both are non-zero, then it indicates there exists a query which where condition on the field which is also part of the index field.

The complete schema of the database after migration is prepared, which is fed to a Machine Learning algorithm to analyze and give important insights about the query fields –index fields relationships. The parser will scan through the complete database to understand the table schema (Table + Index) and queries executed on the database. The output are 2 data files which majorly consists of sparse matrix.

The input data file is attached for reference:

Query Fields table:

1 denotes the column is part of the where condition in the query.

Index Fields table:

1 denotes the column is part of the index designed.

These data files are provided as an input to the machine learning algorithm. The first step is the DATA CLEANING, where all headers (column/row names) are removed.

This data is further processed to identify the columns on which the query is present and columns on which the indices are available.

The columns which are part of the query and is not part of any indices –is recommended/suggested to be looked up to add in the indices to improve performance.

The columns which are part of the indices and is not part of any of the query condition, is suggested to be looked and removed from the indices to improve the performance of WRITE operations.

Assume table statistics as shown below:

Table 2: Sample table statistics

Figure 7 illustrates a graphical representation of columns and index columns of the table, in an implementation of the present invention. The idx_wt (orange) or Qry_wt (blue) lines indicate the fields being part of the index and fields on which query has where condition respectively. Only idx_wt indicates the index exists, but no query has where condition on these fields. Only Qry_wt line indicates the fields used in where condition in this specific query is not part of any index. Accordingly, both of these conditions require user intervention in the form of index or space optimization which is beyond the scope of this document.

The method for sub-optimal query identification and index optimizations formats the data files and parses the contents of the input data file in the human readable format/visualization. It can be further extended to a pure Machine learning algorithm which has been trained/tested with more data. The method is limited to parsing the data and visualizing the data.

Figure 8 illustrates a flow-chart of the method (800) for sub-optimal query identification and index optimizations post migration, in accordance with a second embodiment of the present invention. The method (800) comprises the following steps:

Step 801: determining query execution statistics on a source database (SRC-DB) ;

Step 802: determining query execution statistics on a destination database (DST-DB) ; and

Step 803: comparing the query execution statistics of the SRC-DB to the DST-DB.

The step of determining (S801) the query execution statistics on the SRC-DB comprises the steps of: executing a query on the SRC-DB; extracting and capturing a query plan from the executed query; and parsing the query plan to obtain query information from the SRC-DB. The step of determining (S802) the query execution statistics on the DST-DB comprises the steps of: executing a query on the DST-DB; extracting and capturing a query plan from the executed query; and parsing the query plan to obtain query information from the DST-DB. The step of comparing (S803) the query execution statistics of the SRC-DB to the DST-DB comprises the steps of: extracting from the query plan, cost associated with executing a query on the SRC-DB; extracting from the query plan, cost associated with executing a query on the DST-DB; and comparing the cost associated with executing a query on the SRC-DB with cost associated with executing a query on the DST-DB. The query information comprises table schema, table data stats/metrics and table catalog.

Figure 9 illustrates a block diagram of the apparatus (900) for sub-optimal query identification and index optimizations post migration, in accordance with an embodiment of the present invention. The apparatus (900) comprising: a source query execution unit (901) adapted to determine query execution statistics on a source database (SRC-DB) ; a destination query execution unit (902) adapted to determine query execution statistics on a destination database (DST-DB) ; and a comparison unit (903) adapted to compare the query execution statistics of the SRC-DB to the DST-DB. The source query execution unit (901) is adapted to: executes a query on the SRC-DB; extracts and captures a query plan from the executed query; and parses the query plan to obtain query information from the SRC-DB. The destination query execution unit (902) is adapted to: executes a query on the DST-DB; extracts and captures a query plan from the executed query; and parses the query plan to obtain query information from the DST-DB. The comparison unit (903) is adapted to: extract from the query plan, cost associated with executing a query on the SRC-DB; extract from the query plan, cost associated with executing a query on the DST-DB; and compare the cost associated with executing a query on the SRC-DB with cost associated with executing a query on the DST-DB. The query information comprises table schema, table data stats/metrics and table catalog.

An embodiment of the present invention further provides a computer readable storage medium, wherein the computer readable storage medium stores a computer program, and when the computer program runs on a processor, the processor performs the method as described in steps 801 to 803 of Figure 8.

Further, a computer readable medium, comprising a computer program, which when executed by a computer, causing the computer to carry out the method steps is disclosed in accordance with another embodiment of the present invention.

The present invention also discloses a computer readable medium, comprising a computer program, which when executed by a computer, causes the computer to carry out the method for sub-optimal query identification and index optimizations post migration as described hereinabove.

The present invention finds its application in the field of database migrations. The technical effect offered by the present invention is that Efficiency in terms of Quality, Time and Reliability of the SQLs (Application) migrated is substantially improved. The SQL performance will be evaluated automatically, quite early in the stage of migration, pre-network verification &network deployment. Currently there are no tools that are capable of supporting opensource/commercial databases for this kind of evaluation. All commercial database vendors want to protect their customer base. The present invention enables post migration evaluation from any DB vendor to any DB vendor and different DB engines.

Some of the non-limiting advantages of the present invention are indicated hereinbelow:

It provides a systematic and holistic approach for identifying the sub-optimal query post migration;

It provides a cost effective and systematic approach for identifying index optimization;

The same report may be used by users/customers to improve the query performance based on the index optimization report. Which is more pictorial and give 1 single view; and

It ensures that the user need not invest months of effort to evaluate and identify the issues.

Some of the future implementations of the present invention are indicated hereinbelow:

It may be further enhanced as a supplementary service on the cloud, which aids the user to save the cost/effort in post migration analysis, identifying the sub-optimal queries, and storage issues; and

It may be further enhanced to automatically evaluate the correctness of the data migrated from different tool or enhancement.

A person skilled in the art may understand that any known or new algorithms be used for the implementation of the present invention. However, it is to be noted that, the present invention provides a method and an apparatus for sub-optimal query identification and index optimizations post migration to cloud storage to achieve the above mentioned benefits and technical advancement irrespective of using any known or new algorithms.

A person of ordinary skill in the art may be aware that in combination with the examples described in the embodiments disclosed in this specification, units and algorithm steps may be implemented by electronic hardware, or a combination of computer software and electronic hardware. Whether the functions are performed by hardware or software depends on the particular applications and design constraint conditions of the technical solution. A person skilled in the art may use different methods to implement the described functions for each particular application, but it should not be considered that the implementation goes beyond the scope of the present invention.

It may be clearly understood by a person skilled in the art that for the purpose of convenient and brief description, for a detailed working process of the foregoing system, apparatus, and unit, reference may be made to a corresponding process in the foregoing method embodiments, and details are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus, method or system may be implemented in other manners. For example, the described apparatus embodiment is merely exemplary. For example, the unit division is merely logical function division and may be other division in actual implementation. For example, a plurality of units or components may be combined or integrated into another system, or some features may be ignored or not performed. In addition, the displayed or discussed mutual couplings or direct couplings or communication connections may be implemented through some interfaces. The indirect couplings or communication connections between the apparatuses or units may be implemented in electronic, mechanical, or other forms.

When the functions are implemented in a form of a software functional unit and sold or used as an independent product, the functions may be stored in a computer-readable storage medium. Based on such an understanding, the technical solutions of the present invention essentially, or the part contributing to the prior art, or a part of the technical solutions may be implemented in a form of a software product. The computer software product is stored in a storage medium, and includes several instructions for instructing a computer node (which may be a personal computer, a server, or a network node) to perform all or a part of the steps of the methods described in the embodiment of the present invention. The foregoing storage medium includes: any medium that can store program code, such as a USB flash drive, a removable hard disk, a read-only memory (Read-Only Memory, ROM) , a random access memory (Random Access Memory, RAM) , a magnetic disk, or an optical disc.

Devices that are in communication with each other need not be in continuous communication with each other, unless expressly specified otherwise. In addition, devices that are in communication with each other may communicate directly or indirectly through one or more intermediaries.

When a single device or article is described herein, it will be readily apparent that more than one device/article (whether or not they cooperate) may be used in place of a single device/article. Similarly, where more than one device or article is described herein (whether or not they cooperate) , it will be readily apparent that a single device/article may be used in place of the more than one device or article or a different number of devices/articles may be used instead of the shown number of devices or programs. The functionality and/or the features of a device may be alternatively embodied by one or more other devices which are not explicitly described as having such functionality/features. Thus, other embodiments of the invention need not include the device itself.

Finally, the language used in the specification has been principally selected for readability and instructional purposes, and it may not have been selected to delineate or circumscribe the inventive subject matter. It is therefore intended that the scope of the invention be limited not by this detailed description, but rather by any claims that issue on an application based here on. Accordingly, the disclosure of the embodiments of the invention is intended to be illustrative, but not limiting, of the scope of the invention, which is set forth in the following claims.

With respect to the use of substantially any plural and/or singular terms herein, those having skill in the art can translate from the plural to the singular and/or from the singular to the plural as is appropriate to the context and/or application. The various singular/plural permutations may be expressly set forth herein for sake of clarity.

Although implementations for sub-optimal query identification and index optimizations post migration to cloud storage have been described in language specific to structural features and/or methods, it is to be understood that the appended claims are not necessarily limited to the specific features or methods described. Rather, the specific features and methods are disclosed as examples of implementations of allowing customers to evaluate the migration in the short time, with least amount of infrastructure deployment, and a detailed report when migrating from a high cost license to free/open source databases.

Claims

A method (S800) for sub-optimal query identification, said method comprising the steps of:

determining (S801) query execution statistics on a source database (SRC-DB) ;

determining (S802) query execution statistics on a destination database (DST-DB) ; and

comparing (S803) the query execution statistics of the SRC-DB to the DST-DB.
The method as claimed in claim 1, wherein determining (S801) the query execution statistics on the SRC-DB comprises the steps of:

executing a query on the SRC-DB;

extracting and capturing a query plan from the executed query; and

parsing the query plan to obtain query information from the SRC-DB.
The method as claimed in claim 1, wherein determining (S802) the query execution statistics on the DST-DB comprises the steps of:

executing a query on the DST-DB;

extracting and capturing a query plan from the executed query; and

parsing the query plan to obtain query information from the DST-DB.
The method as claimed in claim 2 or 3, wherein the query information comprises table schema, table data stats/metrics and table catalog.
The method as claimed in claim 1, wherein comparing (S803) the query execution statistics of the SRC-DB to the DST-DB comprises the steps of:

extracting from the query plan, cost associated with executing a query on the SRC-DB;

extracting from the query plan, cost associated with executing a query on the DST-DB; and

comparing the cost associated with executing a query on the SRC-DB with cost associated with executing a query on the DST-DB.
An apparatus (900) for sub-optimal query identification, said method comprising the steps of:

a source query execution unit (901) adapted to determine query execution statistics on a source database (SRC-DB) ;

a destination query execution unit (902) adapted to determine query execution statistics on a destination database (DST-DB) ; and

a comparison unit (903) adapted to compare the query execution statistics of the SRC-DB to the DST-DB.
The apparatus (900) as claimed in claim 6, wherein the source query execution unit (901) is adapted to:

executes a query on the SRC-DB;

extracts and captures a query plan from the executed query; and

parses the query plan to obtain query information from the SRC-DB.
The apparatus (900) as claimed in claim 6, wherein the destination query execution unit (902) is adapted to:

executes a query on the DST-DB;

extracts and captures a query plan from the executed query; and

parses the query plan to obtain query information from the DST-DB.
The apparatus (900) as claimed in claim 7 or 8, wherein the query information comprises table schema, table data stats/metrics and table catalog.
The apparatus (900) as claimed in claim 6, wherein the comparison unit (903) is adapted to:

extract from the query plan, cost associated with executing a query on the SRC-DB;

extract from the query plan, cost associated with executing a query on the DST-DB; and

compare the cost associated with executing a query on the SRC-DB with cost associated with executing a query on the DST-DB.
A computer readable storage medium, wherein the computer readable storage medium stores a computer program, and when the computer program runs on a processor, the processor performs the method as claimed in any one of claims 1 to 5.

.