CN116108025B

CN116108025B - Data virtualization performance optimization method

Info

Publication number: CN116108025B
Application number: CN202310398765.4A
Authority: CN
Inventors: 王聪明; 王三明; 胡小敏; 李成坤; 赵伟帆; 尹文祥
Original assignee: Qiye Cloud Big Data Nanjing Co ltd; Anyuan Technology Co ltd
Current assignee: Qiye Cloud Big Data Nanjing Co ltd; Anyuan Technology Co ltd
Priority date: 2023-04-14
Filing date: 2023-04-14
Publication date: 2023-08-01
Anticipated expiration: 2043-04-14
Also published as: CN116108025A

Abstract

The invention relates to the technical field of data virtualization, in particular to a data virtualization performance optimization method, which is characterized in that a calculation scheme is decided by analyzing service data and monitoring metadata, and the purposes of optimizing scheme universality and self-learning are achieved by designing an algorithm; specifically, a rule model is designed by combining a metadata technology, an index optimization technology, a caching technology and an sql decomposition and pushing technology, and manual annotation of rules is considered, so that generation of the model can be interfered; the method has good universality, and the rule model is basically not changed as long as the data sources are the same; with the accumulation of time, the hit probability of the model can be increased, and rule optimization can be automatically completed; the manual annotation management module is arranged, so that the problem can be solved by customizing, and customizing and automatic identification are considered.

Description

Data virtualization performance optimization method

Technical Field

The invention relates to the technical field of data virtualization, in particular to a data virtualization performance optimization method.

Background

The enterprise data are mutually isolated and distributed at the positions of a traditional data warehouse, an enterprise application, a large data lake, an operation type data storage, a cloud end and the like, so that great challenges are caused to business teams, the existing general scheme is relatively dependent on a general sql optimization scheme, understanding of the business data is relatively weak, and the business data needs to be dynamically adjusted manually according to actual conditions.

In the prior art, the specification of the rule needs to be determined according to the actual service condition, and if the specification is manually determined, the requirement of migration cannot be met; because of the business related, the rules need to be dynamically adjusted, resulting in the rules being indeterminate and the effect not being time efficient.

Disclosure of Invention

The invention aims to provide a data virtualization performance optimization method to solve the problems in the background technology.

The technical scheme of the invention is as follows: a data virtualization performance optimization method, comprising the steps of:

s1, marking an optimal strategy;

s2, establishing a manual rule base;

s3, predicting by using a strategy decision model;

s4, sample data are put in storage.

Preferably, S1 includes:

s11, a strategy evaluation model is firstly defined, wherein an evaluation index weight vector is defined according to service requirementsThe larger the value is, the larger the occupied weight is, and the calculation formula of the evaluation value is as follows:

；

s12, executing sql sentences according to 5 optimization strategies of increasing index, data cache, library and table division, replacement execution mode and replacement execution engine respectively, obtaining 5 evaluation index vectors-Z vectors,；

and obtaining a corresponding evaluation value according to the strategy evaluation model, wherein the optimization strategy corresponding to the maximum evaluation value is the optimal strategy, so that a vector Y can be obtained.

Preferably, S2 includes manually specifying business rules based on actual business experience, returning results directly if hit, and having no optimization strategy if miss.

Preferably, S3 comprises the steps of:

s31, collecting service related data;

s32, preprocessing data;

s33, model training and parameter adjustment;

s34, model prediction.

Preferably, S4 includes directly storing 13 features of the sql statement and the optimal strategy for model prediction when the model is good.

Preferably, S4 comprises marking according to an optimal strategy marking mode when the model distinguishing effect is not obvious, and then storing 13 features of the sql statement and marking results.

Preferably, S31 includes using the sql execution type, the historical execution index, the table metadata structure, the table statistics, the table blood-cause relationship, the table custom tag data in the library to form an X matrix, and then calculating the optimal policy vector Y according to the policy evaluation model, where X and Y together form the input dataset of the model.

Preferably, S32 includes redundant data removal and text label digitizing.

Preferably, S33 comprises dividing the preprocessed data into a training set and a testing set in a ratio of 7:3, then training by using an XGBoost algorithm, wherein the objective parameter selects multi: softmax, the num_class parameter selects 5, and other parameters search for optimal super parameters through classification gridding, and then saving the model.

Preferably, S34 includes predicting sql statements under different conditions by using the data preprocessing method in S32 and the model trained in S33, to obtain an optimal policy and a probability corresponding to the optimal policy.

The invention provides a data virtualization performance optimization method through improvement, which has the following improvement and advantages compared with the prior art:

the method comprises the following steps: according to the invention, the algorithm is designed to decide the calculation scheme through analyzing the service data and monitoring the metadata, so that the purposes of optimizing the scheme universality and self-learning are achieved; specifically, a rule model is designed by combining a metadata technology, an index optimization technology, a caching technology and an sql decomposition and pushing technology, and manual annotation of rules is considered, so that generation of the model can be interfered;

and two,: the invention has better universality, and the rule model is basically not changed as long as the data sources are the same; with the accumulation of time, the hit probability of the model can be increased, and rule optimization can be automatically completed; the manual annotation management module is arranged, so that the problem can be solved by customizing, and customizing and automatic identification are considered.

Drawings

The invention is further explained below with reference to the drawings and examples:

FIG. 1 is a flow chart of a data virtualization performance optimization method of the present invention;

FIG. 2 is a diagram of a policy decision model in accordance with the present invention.

Detailed Description

The following detailed description of the present invention clearly and fully describes the technical solutions of the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The invention provides a data virtualization performance optimization method by improving the data virtualization performance optimization method, which comprises the following steps:

as shown in fig. 1-2, a data virtualization performance optimization method includes the following steps:

s1, carrying out optimal strategy labeling, which specifically comprises,

s11, a strategy evaluation model is firstly defined, wherein an evaluation index weight vector is defined according to service requirementsThe larger the value, the larger the weight is, e.gCalculation formula of evaluation valueThe method comprises the following steps:

；

s12, executing sql sentences according to 5 optimization strategies of increasing index, data cache, library and table division, replacement execution mode and replacement execution engine respectively, obtaining 5 evaluation index vectors, namely Z vectors (n dimensions),to be distinguished, recorded as：

；

Available from a policy evaluation modelCorresponding evaluation value of (a)Wherein the maximum evaluation valueThe corresponding optimization strategy is the optimal strategy, so that a vector Y can be obtained,representation ofIs a transposed matrix of (a);

s2, establishing a manual rule base, manually designating service rules according to actual service experience, directly returning a result if hit, and having no optimization strategy if miss;

s3, predicting by using a strategy decision model, wherein the method specifically comprises the following steps of:

s31, collecting service related data, utilizing sql execution type, historical execution index, table metadata structure, table statistics condition, table blood relationship and table custom label data (namely X matrix) in a library, and then calculating an optimal strategy vector Y (n dimension) according to a strategy evaluation model, wherein X and Y together form an input data set of the model;

s32, preprocessing data, including redundant data removal and text label numerical processing;

s33, training and adjusting parameters of a model, namely dividing a training set and a testing set according to the ratio of 7:3 for preprocessed data, then training by using an XGBoost algorithm, wherein the object parameter selects multi-category, the num_class parameter selects 5 (category number, corresponding to the number of optimization strategy categories), and other parameters search optimal super-parameters through classification gridding, and then saving the model;

s34, predicting a model, namely predicting sql sentences under different conditions by using the data preprocessing method in S32 and the model trained in S33 to obtain an optimal strategy and the corresponding probability thereof;

s4, sample data are put in storage, and when the model effect is good, 13 features of the sql statement and an optimal strategy of model prediction are directly stored; and when the model distinguishing effect is not obvious, marking according to an optimal strategy marking mode, and then storing 13 features of the sql statement and marking results.

Based on the scheme, the method and the device design an algorithm to decide a calculation scheme through analysis of service data and monitoring of metadata, so that the purposes of optimizing scheme universality and self-learning are achieved; specifically, a rule model is designed by combining a metadata technology, an index optimization technology, a caching technology and an sql decomposition and pushing technology, and manual annotation of rules is considered, so that generation of the model can be interfered;

the method has good universality, and the rule model is basically not changed as long as the data sources are the same; with the accumulation of time, the hit probability of the model can be increased, and rule optimization can be automatically completed; the manual annotation management module is arranged, so that the problem can be solved by customizing, and customizing and automatic identification are considered.

The stratified cross-validation shown in fig. 2 is stratified k-fold cross-validation, and specifically includes the following steps:

dividing the data set into K parts according to the proportion of the categories, wherein the proportion of the categories in each part is the same as that of the original data set; selecting one part from the K data as a test set, and using the rest K-1 parts as a training set for model training; training the XGBoost model with a training set, and evaluating performance indexes (Macro-F1, macro-Precision, macro-Recall) of the model with a test set; repeating the steps K times, and selecting different data as a test set each time; and calculating an average value of k groups of test results as an estimation of model precision, and taking the average value as a performance index of the model under the current k-fold cross validation.

It should be further noted that the following description is given of the parameters in the above scheme:

representing 13 feature vectors of the sql statement (where,is a vector of dimension n, n representing the number of samples), Y (n dimensions) represents the optimal strategy vector,the (n 4 dimension) represents the feature dimension that is collected when the sql statement is executed. (meaning of each subparameter is shown in Table 1)

Table 1 parameter description table

The previous description is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A data virtualization performance optimization method is characterized in that: the method comprises the following steps:

s1, performing optimal strategy labeling, which comprises the following steps:

s11, a strategy evaluation model is firstly defined, wherein an evaluation index weight vector is defined according to service requirements，w _j The larger the value is, the larger the occupied weight is, and the calculation formula of the evaluation value is as follows:

the method comprises the steps of carrying out a first treatment on the surface of the S12, executing sql sentences according to 5 optimization strategies of increasing index, data cache, library and table division, replacement execution mode and replacement execution engine respectively, obtaining 5 evaluation index vectors-Z vectors,；

obtaining the corresponding according to the strategy evaluation modelEvaluation value of->The optimization strategy corresponding to the maximum evaluation value is the optimal strategy, so that an optimal strategy vector Y can be obtained; />Representation->Is a transposed matrix of (a);

s2, establishing a manual rule base, wherein the manual rule base comprises manually designating service rules according to actual service experience, directly returning a result if hit, and having no optimization strategy if miss;

s31, collecting service related data, including an X matrix formed by sql execution type, historical execution index, table metadata structure, table statistics condition, table blood edge relation and table self-defined label data in a library, and then calculating an optimal strategy vector Y according to a strategy evaluation model, wherein the X and Y together form an input data set of the model;

s32, preprocessing data;

s33, model training and parameter adjustment;

s4, sample data are put in storage, when the probability corresponding to the optimal strategy is large, the model effect is good, and 13 features of the sql statement and the optimal strategy predicted by the model are directly stored; when the probability corresponding to the optimal strategy is smaller, the model distinguishing effect is not obvious, the labeling is carried out according to the labeling mode of the optimal strategy, 13 features of the sql statement and the labeling result are saved, wherein a matrix formed by 13 feature vectors of the sql statement is expressed asThe 13 feature vectors of the sql statement include:

x ₁ for the number of tables, including querying the number of related tables;

x ₂ the relationship is represented as join/un;

x ₃ for computational types, including filtering and aggregation;

x ₄ for execution time, including the sql historical execution time;

x ₅ to execute the frequency;

x ₆ the result set is the sql result set size, namely the sql historical query result level;

x ₇ is an index case;

x ₈ a data type, a table field type, whether a blob is contained;

x ₉ is the data quantity of a single table;

x ₁₀ frequency of change for single table data;

x ₁₁ is blood margin similarity;

x ₁₂ the number of references to the blood margin, including the number of times that the reference is made;

x ₁₃ the label comprises a dimension table, a dictionary table, a time sequence table and a stream meter.

2. The method for optimizing data virtualization performance according to claim 1, wherein: the S32 includes redundant data removal and text label digitizing.

3. The method for optimizing data virtualization performance according to claim 1, wherein: the step S33 includes dividing the preprocessed data into a training set and a testing set in a ratio of 7:3, then training by using XGBoost algorithm, wherein the objective parameter selects multi: softmax, the num_class parameter selects 5, and other parameters search the optimal super parameters through classification gridding, and then saving the model.