CN116108025B - Data virtualization performance optimization method - Google Patents
Data virtualization performance optimization method Download PDFInfo
- Publication number
- CN116108025B CN116108025B CN202310398765.4A CN202310398765A CN116108025B CN 116108025 B CN116108025 B CN 116108025B CN 202310398765 A CN202310398765 A CN 202310398765A CN 116108025 B CN116108025 B CN 116108025B
- Authority
- CN
- China
- Prior art keywords
- data
- model
- strategy
- sql
- optimization
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2228—Indexing structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/242—Query formulation
- G06F16/2433—Query languages
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2455—Query execution
- G06F16/24552—Database cache management
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to the technical field of data virtualization, in particular to a data virtualization performance optimization method, which is characterized in that a calculation scheme is decided by analyzing service data and monitoring metadata, and the purposes of optimizing scheme universality and self-learning are achieved by designing an algorithm; specifically, a rule model is designed by combining a metadata technology, an index optimization technology, a caching technology and an sql decomposition and pushing technology, and manual annotation of rules is considered, so that generation of the model can be interfered; the method has good universality, and the rule model is basically not changed as long as the data sources are the same; with the accumulation of time, the hit probability of the model can be increased, and rule optimization can be automatically completed; the manual annotation management module is arranged, so that the problem can be solved by customizing, and customizing and automatic identification are considered.
Description
Technical Field
The invention relates to the technical field of data virtualization, in particular to a data virtualization performance optimization method.
Background
The enterprise data are mutually isolated and distributed at the positions of a traditional data warehouse, an enterprise application, a large data lake, an operation type data storage, a cloud end and the like, so that great challenges are caused to business teams, the existing general scheme is relatively dependent on a general sql optimization scheme, understanding of the business data is relatively weak, and the business data needs to be dynamically adjusted manually according to actual conditions.
In the prior art, the specification of the rule needs to be determined according to the actual service condition, and if the specification is manually determined, the requirement of migration cannot be met; because of the business related, the rules need to be dynamically adjusted, resulting in the rules being indeterminate and the effect not being time efficient.
Disclosure of Invention
The invention aims to provide a data virtualization performance optimization method to solve the problems in the background technology.
The technical scheme of the invention is as follows: a data virtualization performance optimization method, comprising the steps of:
s1, marking an optimal strategy;
s2, establishing a manual rule base;
s3, predicting by using a strategy decision model;
s4, sample data are put in storage.
Preferably, S1 includes:
s11, a strategy evaluation model is firstly defined, wherein an evaluation index weight vector is defined according to service requirementsThe larger the value is, the larger the occupied weight is, and the calculation formula of the evaluation value is as follows:
;
s12, executing sql sentences according to 5 optimization strategies of increasing index, data cache, library and table division, replacement execution mode and replacement execution engine respectively, obtaining 5 evaluation index vectors-Z vectors,;
and obtaining a corresponding evaluation value according to the strategy evaluation model, wherein the optimization strategy corresponding to the maximum evaluation value is the optimal strategy, so that a vector Y can be obtained.
Preferably, S2 includes manually specifying business rules based on actual business experience, returning results directly if hit, and having no optimization strategy if miss.
Preferably, S3 comprises the steps of:
s31, collecting service related data;
s32, preprocessing data;
s33, model training and parameter adjustment;
s34, model prediction.
Preferably, S4 includes directly storing 13 features of the sql statement and the optimal strategy for model prediction when the model is good.
Preferably, S4 comprises marking according to an optimal strategy marking mode when the model distinguishing effect is not obvious, and then storing 13 features of the sql statement and marking results.
Preferably, S31 includes using the sql execution type, the historical execution index, the table metadata structure, the table statistics, the table blood-cause relationship, the table custom tag data in the library to form an X matrix, and then calculating the optimal policy vector Y according to the policy evaluation model, where X and Y together form the input dataset of the model.
Preferably, S32 includes redundant data removal and text label digitizing.
Preferably, S33 comprises dividing the preprocessed data into a training set and a testing set in a ratio of 7:3, then training by using an XGBoost algorithm, wherein the objective parameter selects multi: softmax, the num_class parameter selects 5, and other parameters search for optimal super parameters through classification gridding, and then saving the model.
Preferably, S34 includes predicting sql statements under different conditions by using the data preprocessing method in S32 and the model trained in S33, to obtain an optimal policy and a probability corresponding to the optimal policy.
The invention provides a data virtualization performance optimization method through improvement, which has the following improvement and advantages compared with the prior art:
the method comprises the following steps: according to the invention, the algorithm is designed to decide the calculation scheme through analyzing the service data and monitoring the metadata, so that the purposes of optimizing the scheme universality and self-learning are achieved; specifically, a rule model is designed by combining a metadata technology, an index optimization technology, a caching technology and an sql decomposition and pushing technology, and manual annotation of rules is considered, so that generation of the model can be interfered;
and two,: the invention has better universality, and the rule model is basically not changed as long as the data sources are the same; with the accumulation of time, the hit probability of the model can be increased, and rule optimization can be automatically completed; the manual annotation management module is arranged, so that the problem can be solved by customizing, and customizing and automatic identification are considered.
Drawings
The invention is further explained below with reference to the drawings and examples:
FIG. 1 is a flow chart of a data virtualization performance optimization method of the present invention;
FIG. 2 is a diagram of a policy decision model in accordance with the present invention.
Detailed Description
The following detailed description of the present invention clearly and fully describes the technical solutions of the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The invention provides a data virtualization performance optimization method by improving the data virtualization performance optimization method, which comprises the following steps:
as shown in fig. 1-2, a data virtualization performance optimization method includes the following steps:
s1, carrying out optimal strategy labeling, which specifically comprises,
s11, a strategy evaluation model is firstly defined, wherein an evaluation index weight vector is defined according to service requirementsThe larger the value, the larger the weight is, e.gCalculation formula of evaluation valueThe method comprises the following steps:
;
s12, executing sql sentences according to 5 optimization strategies of increasing index, data cache, library and table division, replacement execution mode and replacement execution engine respectively, obtaining 5 evaluation index vectors, namely Z vectors (n dimensions),to be distinguished, recorded as:
;
Available from a policy evaluation modelCorresponding evaluation value of (a)Wherein the maximum evaluation valueThe corresponding optimization strategy is the optimal strategy, so that a vector Y can be obtained,representation ofIs a transposed matrix of (a);
s2, establishing a manual rule base, manually designating service rules according to actual service experience, directly returning a result if hit, and having no optimization strategy if miss;
s3, predicting by using a strategy decision model, wherein the method specifically comprises the following steps of:
s31, collecting service related data, utilizing sql execution type, historical execution index, table metadata structure, table statistics condition, table blood relationship and table custom label data (namely X matrix) in a library, and then calculating an optimal strategy vector Y (n dimension) according to a strategy evaluation model, wherein X and Y together form an input data set of the model;
s32, preprocessing data, including redundant data removal and text label numerical processing;
s33, training and adjusting parameters of a model, namely dividing a training set and a testing set according to the ratio of 7:3 for preprocessed data, then training by using an XGBoost algorithm, wherein the object parameter selects multi-category, the num_class parameter selects 5 (category number, corresponding to the number of optimization strategy categories), and other parameters search optimal super-parameters through classification gridding, and then saving the model;
s34, predicting a model, namely predicting sql sentences under different conditions by using the data preprocessing method in S32 and the model trained in S33 to obtain an optimal strategy and the corresponding probability thereof;
s4, sample data are put in storage, and when the model effect is good, 13 features of the sql statement and an optimal strategy of model prediction are directly stored; and when the model distinguishing effect is not obvious, marking according to an optimal strategy marking mode, and then storing 13 features of the sql statement and marking results.
Based on the scheme, the method and the device design an algorithm to decide a calculation scheme through analysis of service data and monitoring of metadata, so that the purposes of optimizing scheme universality and self-learning are achieved; specifically, a rule model is designed by combining a metadata technology, an index optimization technology, a caching technology and an sql decomposition and pushing technology, and manual annotation of rules is considered, so that generation of the model can be interfered;
the method has good universality, and the rule model is basically not changed as long as the data sources are the same; with the accumulation of time, the hit probability of the model can be increased, and rule optimization can be automatically completed; the manual annotation management module is arranged, so that the problem can be solved by customizing, and customizing and automatic identification are considered.
The stratified cross-validation shown in fig. 2 is stratified k-fold cross-validation, and specifically includes the following steps:
dividing the data set into K parts according to the proportion of the categories, wherein the proportion of the categories in each part is the same as that of the original data set; selecting one part from the K data as a test set, and using the rest K-1 parts as a training set for model training; training the XGBoost model with a training set, and evaluating performance indexes (Macro-F1, macro-Precision, macro-Recall) of the model with a test set; repeating the steps K times, and selecting different data as a test set each time; and calculating an average value of k groups of test results as an estimation of model precision, and taking the average value as a performance index of the model under the current k-fold cross validation.
It should be further noted that the following description is given of the parameters in the above scheme:
representing 13 feature vectors of the sql statement (where,is a vector of dimension n, n representing the number of samples), Y (n dimensions) represents the optimal strategy vector,the (n 4 dimension) represents the feature dimension that is collected when the sql statement is executed. (meaning of each subparameter is shown in Table 1)
Table 1 parameter description table
The previous description is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (3)
1. A data virtualization performance optimization method is characterized in that: the method comprises the following steps:
s1, performing optimal strategy labeling, which comprises the following steps:
s11, a strategy evaluation model is firstly defined, wherein an evaluation index weight vector is defined according to service requirements,w j The larger the value is, the larger the occupied weight is, and the calculation formula of the evaluation value is as follows:
the method comprises the steps of carrying out a first treatment on the surface of the S12, executing sql sentences according to 5 optimization strategies of increasing index, data cache, library and table division, replacement execution mode and replacement execution engine respectively, obtaining 5 evaluation index vectors-Z vectors,;
obtaining the corresponding according to the strategy evaluation modelEvaluation value of->The optimization strategy corresponding to the maximum evaluation value is the optimal strategy, so that an optimal strategy vector Y can be obtained; />Representation->Is a transposed matrix of (a);
s2, establishing a manual rule base, wherein the manual rule base comprises manually designating service rules according to actual service experience, directly returning a result if hit, and having no optimization strategy if miss;
s3, predicting by using a strategy decision model, wherein the method specifically comprises the following steps of:
s31, collecting service related data, including an X matrix formed by sql execution type, historical execution index, table metadata structure, table statistics condition, table blood edge relation and table self-defined label data in a library, and then calculating an optimal strategy vector Y according to a strategy evaluation model, wherein the X and Y together form an input data set of the model;
s32, preprocessing data;
s33, model training and parameter adjustment;
s34, predicting a model, namely predicting sql sentences under different conditions by using the data preprocessing method in S32 and the model trained in S33 to obtain an optimal strategy and the corresponding probability thereof;
s4, sample data are put in storage, when the probability corresponding to the optimal strategy is large, the model effect is good, and 13 features of the sql statement and the optimal strategy predicted by the model are directly stored; when the probability corresponding to the optimal strategy is smaller, the model distinguishing effect is not obvious, the labeling is carried out according to the labeling mode of the optimal strategy, 13 features of the sql statement and the labeling result are saved, wherein a matrix formed by 13 feature vectors of the sql statement is expressed asThe 13 feature vectors of the sql statement include:
x 1 for the number of tables, including querying the number of related tables;
x 2 the relationship is represented as join/un;
x 3 for computational types, including filtering and aggregation;
x 4 for execution time, including the sql historical execution time;
x 5 to execute the frequency;
x 6 the result set is the sql result set size, namely the sql historical query result level;
x 7 is an index case;
x 8 a data type, a table field type, whether a blob is contained;
x 9 is the data quantity of a single table;
x 10 frequency of change for single table data;
x 11 is blood margin similarity;
x 12 the number of references to the blood margin, including the number of times that the reference is made;
x 13 the label comprises a dimension table, a dictionary table, a time sequence table and a stream meter.
2. The method for optimizing data virtualization performance according to claim 1, wherein: the S32 includes redundant data removal and text label digitizing.
3. The method for optimizing data virtualization performance according to claim 1, wherein: the step S33 includes dividing the preprocessed data into a training set and a testing set in a ratio of 7:3, then training by using XGBoost algorithm, wherein the objective parameter selects multi: softmax, the num_class parameter selects 5, and other parameters search the optimal super parameters through classification gridding, and then saving the model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310398765.4A CN116108025B (en) | 2023-04-14 | 2023-04-14 | Data virtualization performance optimization method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310398765.4A CN116108025B (en) | 2023-04-14 | 2023-04-14 | Data virtualization performance optimization method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116108025A CN116108025A (en) | 2023-05-12 |
CN116108025B true CN116108025B (en) | 2023-08-01 |
Family
ID=86260214
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310398765.4A Active CN116108025B (en) | 2023-04-14 | 2023-04-14 | Data virtualization performance optimization method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116108025B (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111444220A (en) * | 2020-05-09 | 2020-07-24 | 南京大学 | Cross-platform SQ L query optimization method combining rule driving and data driving |
CN112149721A (en) * | 2020-09-10 | 2020-12-29 | 南京大学 | Target detection method for reducing labeling requirements based on active learning |
CN112749041A (en) * | 2019-10-29 | 2021-05-04 | 中国移动通信集团浙江有限公司 | Virtualized network function backup strategy self-decision method and device and computing equipment |
CN113110866A (en) * | 2021-04-30 | 2021-07-13 | 深圳前海微众银行股份有限公司 | Method and device for evaluating database change script |
CN113656440A (en) * | 2021-08-20 | 2021-11-16 | 中国工商银行股份有限公司 | Database statement optimization method, device and equipment |
CN115705322A (en) * | 2021-08-13 | 2023-02-17 | 华为技术有限公司 | Database management system, data processing method and equipment |
-
2023
- 2023-04-14 CN CN202310398765.4A patent/CN116108025B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112749041A (en) * | 2019-10-29 | 2021-05-04 | 中国移动通信集团浙江有限公司 | Virtualized network function backup strategy self-decision method and device and computing equipment |
CN111444220A (en) * | 2020-05-09 | 2020-07-24 | 南京大学 | Cross-platform SQ L query optimization method combining rule driving and data driving |
CN112149721A (en) * | 2020-09-10 | 2020-12-29 | 南京大学 | Target detection method for reducing labeling requirements based on active learning |
CN113110866A (en) * | 2021-04-30 | 2021-07-13 | 深圳前海微众银行股份有限公司 | Method and device for evaluating database change script |
CN115705322A (en) * | 2021-08-13 | 2023-02-17 | 华为技术有限公司 | Database management system, data processing method and equipment |
CN113656440A (en) * | 2021-08-20 | 2021-11-16 | 中国工商银行股份有限公司 | Database statement optimization method, device and equipment |
Also Published As
Publication number | Publication date |
---|---|
CN116108025A (en) | 2023-05-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109615014B (en) | KL divergence optimization-based 3D object data classification system and method | |
CN109685635A (en) | Methods of risk assessment, air control server-side and the storage medium of financial business | |
CN107766929B (en) | Model analysis method and device | |
CN109508374A (en) | Text data Novel semi-supervised based on genetic algorithm | |
CN109359135B (en) | Time sequence similarity searching method based on segment weight | |
CN109299270A (en) | A kind of text data unsupervised clustering based on convolutional neural networks | |
CN111190968A (en) | Data preprocessing and content recommendation method based on knowledge graph | |
CN108280236A (en) | A kind of random forest visualization data analysing method based on LargeVis | |
CN105183792B (en) | Distributed fast text classification method based on locality sensitive hashing | |
CN110347821B (en) | Text category labeling method, electronic equipment and readable storage medium | |
CN111797267A (en) | Medical image retrieval method and system, electronic device and storage medium | |
CN111026870A (en) | ICT system fault analysis method integrating text classification and image recognition | |
CN107169020B (en) | directional webpage collecting method based on keywords | |
CN116108025B (en) | Data virtualization performance optimization method | |
CN112417082A (en) | Scientific research achievement data disambiguation filing storage method | |
Zhang et al. | Ontology-based clustering algorithm with feature weights | |
CN109871894A (en) | A kind of Method of Data Discretization of combination forest optimization and rough set | |
KR101085066B1 (en) | An Associative Classification Method for detecting useful knowledge from huge multi-attributes dataset | |
Feng et al. | Web Service QoS Classification Based on Optimized Convolutional Neural Network | |
CN111767404B (en) | Event mining method and device | |
She et al. | Text Classification Research Based on Improved SoftMax Regression Algorithm | |
CN111581164B (en) | Multimedia file processing method, device, server and storage medium | |
CN117251605B (en) | Multi-source data query method and system based on deep learning | |
Shao et al. | Nonuniform Granularity-Based Classification in Social Interest Detection | |
CN112100370B (en) | Picture-trial expert combination recommendation method based on text volume and similarity algorithm |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |