CN116108025B - Data virtualization performance optimization method - Google Patents

Data virtualization performance optimization method Download PDF

Info

Publication number
CN116108025B
CN116108025B CN202310398765.4A CN202310398765A CN116108025B CN 116108025 B CN116108025 B CN 116108025B CN 202310398765 A CN202310398765 A CN 202310398765A CN 116108025 B CN116108025 B CN 116108025B
Authority
CN
China
Prior art keywords
data
model
strategy
sql
optimization
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310398765.4A
Other languages
Chinese (zh)
Other versions
CN116108025A (en
Inventor
王聪明
王三明
胡小敏
李成坤
赵伟帆
尹文祥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qiye Cloud Big Data Nanjing Co ltd
Anyuan Technology Co ltd
Original Assignee
Qiye Cloud Big Data Nanjing Co ltd
Anyuan Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qiye Cloud Big Data Nanjing Co ltd, Anyuan Technology Co ltd filed Critical Qiye Cloud Big Data Nanjing Co ltd
Priority to CN202310398765.4A priority Critical patent/CN116108025B/en
Publication of CN116108025A publication Critical patent/CN116108025A/en
Application granted granted Critical
Publication of CN116108025B publication Critical patent/CN116108025B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24552Database cache management
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of data virtualization, in particular to a data virtualization performance optimization method, which is characterized in that a calculation scheme is decided by analyzing service data and monitoring metadata, and the purposes of optimizing scheme universality and self-learning are achieved by designing an algorithm; specifically, a rule model is designed by combining a metadata technology, an index optimization technology, a caching technology and an sql decomposition and pushing technology, and manual annotation of rules is considered, so that generation of the model can be interfered; the method has good universality, and the rule model is basically not changed as long as the data sources are the same; with the accumulation of time, the hit probability of the model can be increased, and rule optimization can be automatically completed; the manual annotation management module is arranged, so that the problem can be solved by customizing, and customizing and automatic identification are considered.

Description

Data virtualization performance optimization method
Technical Field
The invention relates to the technical field of data virtualization, in particular to a data virtualization performance optimization method.
Background
The enterprise data are mutually isolated and distributed at the positions of a traditional data warehouse, an enterprise application, a large data lake, an operation type data storage, a cloud end and the like, so that great challenges are caused to business teams, the existing general scheme is relatively dependent on a general sql optimization scheme, understanding of the business data is relatively weak, and the business data needs to be dynamically adjusted manually according to actual conditions.
In the prior art, the specification of the rule needs to be determined according to the actual service condition, and if the specification is manually determined, the requirement of migration cannot be met; because of the business related, the rules need to be dynamically adjusted, resulting in the rules being indeterminate and the effect not being time efficient.
Disclosure of Invention
The invention aims to provide a data virtualization performance optimization method to solve the problems in the background technology.
The technical scheme of the invention is as follows: a data virtualization performance optimization method, comprising the steps of:
s1, marking an optimal strategy;
s2, establishing a manual rule base;
s3, predicting by using a strategy decision model;
s4, sample data are put in storage.
Preferably, S1 includes:
s11, a strategy evaluation model is firstly defined, wherein an evaluation index weight vector is defined according to service requirementsThe larger the value is, the larger the occupied weight is, and the calculation formula of the evaluation value is as follows:
s12, executing sql sentences according to 5 optimization strategies of increasing index, data cache, library and table division, replacement execution mode and replacement execution engine respectively, obtaining 5 evaluation index vectors-Z vectors,
and obtaining a corresponding evaluation value according to the strategy evaluation model, wherein the optimization strategy corresponding to the maximum evaluation value is the optimal strategy, so that a vector Y can be obtained.
Preferably, S2 includes manually specifying business rules based on actual business experience, returning results directly if hit, and having no optimization strategy if miss.
Preferably, S3 comprises the steps of:
s31, collecting service related data;
s32, preprocessing data;
s33, model training and parameter adjustment;
s34, model prediction.
Preferably, S4 includes directly storing 13 features of the sql statement and the optimal strategy for model prediction when the model is good.
Preferably, S4 comprises marking according to an optimal strategy marking mode when the model distinguishing effect is not obvious, and then storing 13 features of the sql statement and marking results.
Preferably, S31 includes using the sql execution type, the historical execution index, the table metadata structure, the table statistics, the table blood-cause relationship, the table custom tag data in the library to form an X matrix, and then calculating the optimal policy vector Y according to the policy evaluation model, where X and Y together form the input dataset of the model.
Preferably, S32 includes redundant data removal and text label digitizing.
Preferably, S33 comprises dividing the preprocessed data into a training set and a testing set in a ratio of 7:3, then training by using an XGBoost algorithm, wherein the objective parameter selects multi: softmax, the num_class parameter selects 5, and other parameters search for optimal super parameters through classification gridding, and then saving the model.
Preferably, S34 includes predicting sql statements under different conditions by using the data preprocessing method in S32 and the model trained in S33, to obtain an optimal policy and a probability corresponding to the optimal policy.
The invention provides a data virtualization performance optimization method through improvement, which has the following improvement and advantages compared with the prior art:
the method comprises the following steps: according to the invention, the algorithm is designed to decide the calculation scheme through analyzing the service data and monitoring the metadata, so that the purposes of optimizing the scheme universality and self-learning are achieved; specifically, a rule model is designed by combining a metadata technology, an index optimization technology, a caching technology and an sql decomposition and pushing technology, and manual annotation of rules is considered, so that generation of the model can be interfered;
and two,: the invention has better universality, and the rule model is basically not changed as long as the data sources are the same; with the accumulation of time, the hit probability of the model can be increased, and rule optimization can be automatically completed; the manual annotation management module is arranged, so that the problem can be solved by customizing, and customizing and automatic identification are considered.
Drawings
The invention is further explained below with reference to the drawings and examples:
FIG. 1 is a flow chart of a data virtualization performance optimization method of the present invention;
FIG. 2 is a diagram of a policy decision model in accordance with the present invention.
Detailed Description
The following detailed description of the present invention clearly and fully describes the technical solutions of the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The invention provides a data virtualization performance optimization method by improving the data virtualization performance optimization method, which comprises the following steps:
as shown in fig. 1-2, a data virtualization performance optimization method includes the following steps:
s1, carrying out optimal strategy labeling, which specifically comprises,
s11, a strategy evaluation model is firstly defined, wherein an evaluation index weight vector is defined according to service requirementsThe larger the value, the larger the weight is, e.gCalculation formula of evaluation valueThe method comprises the following steps:
s12, executing sql sentences according to 5 optimization strategies of increasing index, data cache, library and table division, replacement execution mode and replacement execution engine respectively, obtaining 5 evaluation index vectors, namely Z vectors (n dimensions),to be distinguished, recorded as
Available from a policy evaluation modelCorresponding evaluation value of (a)Wherein the maximum evaluation valueThe corresponding optimization strategy is the optimal strategy, so that a vector Y can be obtained,representation ofIs a transposed matrix of (a);
s2, establishing a manual rule base, manually designating service rules according to actual service experience, directly returning a result if hit, and having no optimization strategy if miss;
s3, predicting by using a strategy decision model, wherein the method specifically comprises the following steps of:
s31, collecting service related data, utilizing sql execution type, historical execution index, table metadata structure, table statistics condition, table blood relationship and table custom label data (namely X matrix) in a library, and then calculating an optimal strategy vector Y (n dimension) according to a strategy evaluation model, wherein X and Y together form an input data set of the model;
s32, preprocessing data, including redundant data removal and text label numerical processing;
s33, training and adjusting parameters of a model, namely dividing a training set and a testing set according to the ratio of 7:3 for preprocessed data, then training by using an XGBoost algorithm, wherein the object parameter selects multi-category, the num_class parameter selects 5 (category number, corresponding to the number of optimization strategy categories), and other parameters search optimal super-parameters through classification gridding, and then saving the model;
s34, predicting a model, namely predicting sql sentences under different conditions by using the data preprocessing method in S32 and the model trained in S33 to obtain an optimal strategy and the corresponding probability thereof;
s4, sample data are put in storage, and when the model effect is good, 13 features of the sql statement and an optimal strategy of model prediction are directly stored; and when the model distinguishing effect is not obvious, marking according to an optimal strategy marking mode, and then storing 13 features of the sql statement and marking results.
Based on the scheme, the method and the device design an algorithm to decide a calculation scheme through analysis of service data and monitoring of metadata, so that the purposes of optimizing scheme universality and self-learning are achieved; specifically, a rule model is designed by combining a metadata technology, an index optimization technology, a caching technology and an sql decomposition and pushing technology, and manual annotation of rules is considered, so that generation of the model can be interfered;
the method has good universality, and the rule model is basically not changed as long as the data sources are the same; with the accumulation of time, the hit probability of the model can be increased, and rule optimization can be automatically completed; the manual annotation management module is arranged, so that the problem can be solved by customizing, and customizing and automatic identification are considered.
The stratified cross-validation shown in fig. 2 is stratified k-fold cross-validation, and specifically includes the following steps:
dividing the data set into K parts according to the proportion of the categories, wherein the proportion of the categories in each part is the same as that of the original data set; selecting one part from the K data as a test set, and using the rest K-1 parts as a training set for model training; training the XGBoost model with a training set, and evaluating performance indexes (Macro-F1, macro-Precision, macro-Recall) of the model with a test set; repeating the steps K times, and selecting different data as a test set each time; and calculating an average value of k groups of test results as an estimation of model precision, and taking the average value as a performance index of the model under the current k-fold cross validation.
It should be further noted that the following description is given of the parameters in the above scheme:
representing 13 feature vectors of the sql statement (where,is a vector of dimension n, n representing the number of samples), Y (n dimensions) represents the optimal strategy vector,the (n 4 dimension) represents the feature dimension that is collected when the sql statement is executed. (meaning of each subparameter is shown in Table 1)
Table 1 parameter description table
The previous description is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (3)

1. A data virtualization performance optimization method is characterized in that: the method comprises the following steps:
s1, performing optimal strategy labeling, which comprises the following steps:
s11, a strategy evaluation model is firstly defined, wherein an evaluation index weight vector is defined according to service requirements,w j The larger the value is, the larger the occupied weight is, and the calculation formula of the evaluation value is as follows:
the method comprises the steps of carrying out a first treatment on the surface of the S12, executing sql sentences according to 5 optimization strategies of increasing index, data cache, library and table division, replacement execution mode and replacement execution engine respectively, obtaining 5 evaluation index vectors-Z vectors,
obtaining the corresponding according to the strategy evaluation modelEvaluation value of->The optimization strategy corresponding to the maximum evaluation value is the optimal strategy, so that an optimal strategy vector Y can be obtained; />Representation->Is a transposed matrix of (a);
s2, establishing a manual rule base, wherein the manual rule base comprises manually designating service rules according to actual service experience, directly returning a result if hit, and having no optimization strategy if miss;
s3, predicting by using a strategy decision model, wherein the method specifically comprises the following steps of:
s31, collecting service related data, including an X matrix formed by sql execution type, historical execution index, table metadata structure, table statistics condition, table blood edge relation and table self-defined label data in a library, and then calculating an optimal strategy vector Y according to a strategy evaluation model, wherein the X and Y together form an input data set of the model;
s32, preprocessing data;
s33, model training and parameter adjustment;
s34, predicting a model, namely predicting sql sentences under different conditions by using the data preprocessing method in S32 and the model trained in S33 to obtain an optimal strategy and the corresponding probability thereof;
s4, sample data are put in storage, when the probability corresponding to the optimal strategy is large, the model effect is good, and 13 features of the sql statement and the optimal strategy predicted by the model are directly stored; when the probability corresponding to the optimal strategy is smaller, the model distinguishing effect is not obvious, the labeling is carried out according to the labeling mode of the optimal strategy, 13 features of the sql statement and the labeling result are saved, wherein a matrix formed by 13 feature vectors of the sql statement is expressed asThe 13 feature vectors of the sql statement include:
x 1 for the number of tables, including querying the number of related tables;
x 2 the relationship is represented as join/un;
x 3 for computational types, including filtering and aggregation;
x 4 for execution time, including the sql historical execution time;
x 5 to execute the frequency;
x 6 the result set is the sql result set size, namely the sql historical query result level;
x 7 is an index case;
x 8 a data type, a table field type, whether a blob is contained;
x 9 is the data quantity of a single table;
x 10 frequency of change for single table data;
x 11 is blood margin similarity;
x 12 the number of references to the blood margin, including the number of times that the reference is made;
x 13 the label comprises a dimension table, a dictionary table, a time sequence table and a stream meter.
2. The method for optimizing data virtualization performance according to claim 1, wherein: the S32 includes redundant data removal and text label digitizing.
3. The method for optimizing data virtualization performance according to claim 1, wherein: the step S33 includes dividing the preprocessed data into a training set and a testing set in a ratio of 7:3, then training by using XGBoost algorithm, wherein the objective parameter selects multi: softmax, the num_class parameter selects 5, and other parameters search the optimal super parameters through classification gridding, and then saving the model.
CN202310398765.4A 2023-04-14 2023-04-14 Data virtualization performance optimization method Active CN116108025B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310398765.4A CN116108025B (en) 2023-04-14 2023-04-14 Data virtualization performance optimization method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310398765.4A CN116108025B (en) 2023-04-14 2023-04-14 Data virtualization performance optimization method

Publications (2)

Publication Number Publication Date
CN116108025A CN116108025A (en) 2023-05-12
CN116108025B true CN116108025B (en) 2023-08-01

Family

ID=86260214

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310398765.4A Active CN116108025B (en) 2023-04-14 2023-04-14 Data virtualization performance optimization method

Country Status (1)

Country Link
CN (1) CN116108025B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111444220A (en) * 2020-05-09 2020-07-24 南京大学 Cross-platform SQ L query optimization method combining rule driving and data driving
CN112149721A (en) * 2020-09-10 2020-12-29 南京大学 Target detection method for reducing labeling requirements based on active learning
CN112749041A (en) * 2019-10-29 2021-05-04 中国移动通信集团浙江有限公司 Virtualized network function backup strategy self-decision method and device and computing equipment
CN113110866A (en) * 2021-04-30 2021-07-13 深圳前海微众银行股份有限公司 Method and device for evaluating database change script
CN113656440A (en) * 2021-08-20 2021-11-16 中国工商银行股份有限公司 Database statement optimization method, device and equipment
CN115705322A (en) * 2021-08-13 2023-02-17 华为技术有限公司 Database management system, data processing method and equipment

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112749041A (en) * 2019-10-29 2021-05-04 中国移动通信集团浙江有限公司 Virtualized network function backup strategy self-decision method and device and computing equipment
CN111444220A (en) * 2020-05-09 2020-07-24 南京大学 Cross-platform SQ L query optimization method combining rule driving and data driving
CN112149721A (en) * 2020-09-10 2020-12-29 南京大学 Target detection method for reducing labeling requirements based on active learning
CN113110866A (en) * 2021-04-30 2021-07-13 深圳前海微众银行股份有限公司 Method and device for evaluating database change script
CN115705322A (en) * 2021-08-13 2023-02-17 华为技术有限公司 Database management system, data processing method and equipment
CN113656440A (en) * 2021-08-20 2021-11-16 中国工商银行股份有限公司 Database statement optimization method, device and equipment

Also Published As

Publication number Publication date
CN116108025A (en) 2023-05-12

Similar Documents

Publication Publication Date Title
CN109615014B (en) KL divergence optimization-based 3D object data classification system and method
CN109685635A (en) Methods of risk assessment, air control server-side and the storage medium of financial business
CN107766929B (en) Model analysis method and device
CN109508374A (en) Text data Novel semi-supervised based on genetic algorithm
CN109359135B (en) Time sequence similarity searching method based on segment weight
CN109299270A (en) A kind of text data unsupervised clustering based on convolutional neural networks
CN111190968A (en) Data preprocessing and content recommendation method based on knowledge graph
CN108280236A (en) A kind of random forest visualization data analysing method based on LargeVis
CN105183792B (en) Distributed fast text classification method based on locality sensitive hashing
CN110347821B (en) Text category labeling method, electronic equipment and readable storage medium
CN111797267A (en) Medical image retrieval method and system, electronic device and storage medium
CN111026870A (en) ICT system fault analysis method integrating text classification and image recognition
CN107169020B (en) directional webpage collecting method based on keywords
CN116108025B (en) Data virtualization performance optimization method
CN112417082A (en) Scientific research achievement data disambiguation filing storage method
Zhang et al. Ontology-based clustering algorithm with feature weights
CN109871894A (en) A kind of Method of Data Discretization of combination forest optimization and rough set
KR101085066B1 (en) An Associative Classification Method for detecting useful knowledge from huge multi-attributes dataset
Feng et al. Web Service QoS Classification Based on Optimized Convolutional Neural Network
CN111767404B (en) Event mining method and device
She et al. Text Classification Research Based on Improved SoftMax Regression Algorithm
CN111581164B (en) Multimedia file processing method, device, server and storage medium
CN117251605B (en) Multi-source data query method and system based on deep learning
Shao et al. Nonuniform Granularity-Based Classification in Social Interest Detection
CN112100370B (en) Picture-trial expert combination recommendation method based on text volume and similarity algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant