CN116303538A - Training method of calculation engine selection model, calculation engine selection method and device - Google Patents

Training method of calculation engine selection model, calculation engine selection method and device Download PDF

Info

Publication number
CN116303538A
CN116303538A CN202310274395.3A CN202310274395A CN116303538A CN 116303538 A CN116303538 A CN 116303538A CN 202310274395 A CN202310274395 A CN 202310274395A CN 116303538 A CN116303538 A CN 116303538A
Authority
CN
China
Prior art keywords
data
sql
calculation engine
training
selection model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310274395.3A
Other languages
Chinese (zh)
Inventor
张子浪
刘海滨
李小言
郝慧俊
程玉藏
郑青如
刘航
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Tower Co Ltd
Original Assignee
China Tower Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Tower Co Ltd filed Critical China Tower Co Ltd
Priority to CN202310274395.3A priority Critical patent/CN116303538A/en
Publication of CN116303538A publication Critical patent/CN116303538A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a training method for a calculation engine selection model, a calculation engine selection method and a calculation engine selection device, and belongs to the technical field of big data. The method comprises the following steps: acquiring a Structured Query Language (SQL) training set, text features of each SQL statement in the SQL training set, vectorized data and data features of a data table in each SQL statement; generating an engine label of each SQL sentence, wherein the engine label is used for indicating a computing engine with the shortest execution time corresponding to each SQL sentence; taking the text characteristics and the vectorization data of each SQL sentence and the data characteristics of a data table in each SQL sentence as training data of a calculation engine selection model, taking an engine label corresponding to each SQL sentence as a label value, and training the calculation engine selection model; the trained calculation engine selection model is used for outputting a calculation engine with shortest execution time of the target SQL statement according to the input target SQL statement.

Description

Training method of calculation engine selection model, calculation engine selection method and device
Technical Field
The application belongs to the technical field of big data, and particularly relates to a training method of a calculation engine selection model, a calculation engine selection method and a device.
Background
The structured query language (Structured Query Language, SQL) is a database query and programming language that facilitates access to data and query, update, and manage relational database systems.
When a technician uses a big data platform to perform SQL interactive query, different computing engines are selected, the execution efficiency is different, and particularly, which computing engine is selected, an experienced engineer is often required to decide or manually try, an automatic method is not needed, and the time cost and the resource cost are greatly increased.
Disclosure of Invention
The embodiment of the application aims to provide a training method of a calculation engine selection model, a calculation engine selection method and a device, which can solve the problems of high time cost and resource cost of the existing calculation engine selection mode.
In a first aspect, an embodiment of the present application provides a training method for a computing engine selection model, where the method includes:
acquiring a Structured Query Language (SQL) training set, text features of each SQL statement in the SQL training set, vectorized data and data features of a data table in each SQL statement;
generating an engine label of each SQL sentence, wherein the engine label is used for indicating a computing engine with the shortest execution time corresponding to each SQL sentence;
taking the text characteristics and the vectorization data of each SQL sentence and the data characteristics of a data table in each SQL sentence as training data of a calculation engine selection model, taking an engine label corresponding to each SQL sentence as a label value, and training the calculation engine selection model;
the trained calculation engine selection model is used for outputting a calculation engine with shortest execution time of the target SQL statement according to the input target SQL statement.
Optionally, the training the computing engine selection model by using the text feature of each SQL statement, the vectorized data and the data feature of the data table in each SQL statement as training data of the computing engine selection model, and using the engine label corresponding to each SQL statement as a label value includes:
the vectorization data of each SQL sentence is input into the calculation engine selection model by taking the engine label corresponding to each SQL sentence as a label value, and one-dimensional hidden layer vector representation is obtained;
and training the calculation engine selection model based on the one-dimensional hidden layer vector representation, the text characteristic of each SQL sentence, the data characteristic of a data table in each SQL sentence and the engine label corresponding to each SQL sentence.
Optionally, the acquiring the SQL training set, text features of each SQL statement in the SQL training set, vectorized data, and data features of a data table in each SQL statement includes:
performing text analysis on each SQL sentence in the SQL training set, and determining text characteristics of each SQL sentence, wherein the text characteristics comprise the number of data tables in the SQL sentence and information for representing whether an aggregation function exists in the SQL sentence;
based on a natural language processing NLP technology, text vectorization is carried out on each SQL sentence in the SQL training set so as to obtain vectorized data of each SQL sentence;
determining a table name of a data table involved in the SQL statement based on text features of the SQL statement;
and acquiring historical data of the data table from a statistical table based on the table name of the data table, and taking the maximum value of the historical data as the data characteristic of the data table in the SQL sentence.
Optionally, the training the computing engine selection model based on the one-dimensional hidden layer vector representation, the text feature of each SQL statement, the data feature of the data table in each SQL statement, and the engine label corresponding to each SQL statement includes:
and training by taking the one-dimensional hidden layer vector representation, the text characteristic of each SQL statement and the data characteristic of a data table in each SQL statement as training data and taking an engine label corresponding to each SQL statement as a label value, and adopting a distributed gradient enhancement library XGBoost or a gradient enhancement decision tree GBDT or a lightweight gradient elevator LightGBM.
In a second aspect, embodiments of the present application provide a computing engine selection method, the method including:
acquiring a calculation engine selection request, and acquiring an SQL sentence in the calculation engine selection request;
inputting the SQL sentence into a calculation engine selection model, and acquiring first data output by the calculation engine selection model, wherein the first data is used for indicating a calculation engine with shortest execution time for the SQL sentence;
wherein the calculation engine selects a model to be a model trained based on the method of the first aspect.
Optionally, the inputting the SQL statement into a compute engine selection model includes:
acquiring text features of the SQL statement, vectorized data and data features of a data table in the SQL statement;
inputting the vectorization data of the SQL sentence into the calculation engine selection model, and obtaining a one-dimensional hidden layer vector representation;
and inputting the one-dimensional hidden layer vector representation, the text characteristics of the SQL sentence and the data characteristics of the data table in the SQL sentence into the calculation engine selection model.
Optionally, the acquiring the data characteristics of the data table in the SQL statement includes:
determining a table name of a data table in the SQL sentence based on the text feature;
determining historical data of the data table based on the table name, wherein the historical data comprises a distributed file system HDFS file size;
and determining the current data of the data table by adopting a time sequence prediction method for the historical data, wherein the current data is the data characteristics of the data table.
In a third aspect, embodiments of the present application provide a training apparatus for computing an engine selection model, the apparatus comprising:
the first acquisition module is used for acquiring a Structured Query Language (SQL) training set, text features of each SQL statement in the SQL training set, vectorized data and data features of a data table in each SQL statement;
the generating module is used for generating an engine label of each SQL sentence, and the engine label is used for indicating a computing engine with the shortest execution time corresponding to each SQL sentence;
the training module is used for training the calculation engine selection model by taking the text characteristics and the vectorization data of each SQL sentence and the data characteristics of the data table in each SQL sentence as training data of the calculation engine selection model and taking the engine label corresponding to each SQL sentence as a label value;
the trained calculation engine selection model is used for outputting a calculation engine with shortest execution time of the target SQL statement according to the input target SQL statement.
In a fourth aspect, embodiments of the present application provide a computing engine selection apparatus, the apparatus comprising:
the second acquisition module is used for acquiring a calculation engine selection request and acquiring an SQL sentence in the calculation engine selection request;
the selection module is used for inputting the SQL sentence into a calculation engine selection model, and acquiring first data output by the calculation engine selection model, wherein the first data is used for indicating a calculation engine with the shortest execution time for the SQL sentence;
wherein the calculation engine selects a model to be a model trained based on the method of the first aspect.
In a fifth aspect, embodiments of the present application provide an electronic device comprising a processor and a memory storing a program or instructions executable on the processor, which when executed by the processor, implement the steps of the method for training a calculation engine selection model as described in the first aspect or the steps of the method for calculation engine selection as described in the second aspect.
In a sixth aspect, embodiments of the present application provide a readable storage medium having stored thereon a program or instructions which, when executed by a processor, implement the steps of the method for training a calculation engine selection model as described in the first aspect or the steps of the method for calculation engine selection as described in the second aspect.
In the embodiment of the application, the text features of the SQL sentences, the vectorized data and the data features of the data table in the SQL sentences are used as training data of the calculation engine selection model, the engine labels corresponding to the SQL sentences are used as label values, the calculation engine selection model is trained, and along with the continuous increase of the training data and the corresponding label values, the accuracy of the output result of the trained calculation engine selection model is continuously improved. After training is finished, when a computation engine is required to be selected for one SQL sentence, the computation engine with the shortest execution time can be selected by the computation engine selection model only by inputting the SQL sentence into the computation engine selection model, and manual attempt is not required, so that automation is realized, and time cost and resource cost are reduced.
Drawings
FIG. 1 is a flowchart of a training method of a calculation engine selection model according to an embodiment of the present application;
FIG. 2 is a flowchart of a computing engine selection method according to an embodiment of the present disclosure;
FIG. 3 is a second flowchart of a training method of a calculation engine selection model according to an embodiment of the present application;
FIG. 4 is a second flowchart of a computing engine selection method according to an embodiment of the present disclosure;
FIG. 5 is a schematic structural diagram of a training selection device for selecting a model by a computing engine according to an embodiment of the present application;
FIG. 6 is a schematic diagram of a computing engine selection device according to an embodiment of the present disclosure;
fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
Technical solutions in the embodiments of the present application will be clearly described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application are within the scope of the protection of the present application.
The terms first, second and the like in the description and in the claims, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged, as appropriate, such that embodiments of the present application may be implemented in sequences other than those illustrated or described herein, and that the objects identified by "first," "second," etc. are generally of a type and not limited to the number of objects, e.g., the first object may be one or more. Furthermore, in the description and claims, "and/or" means at least one of the connected objects, and the character "/", generally means that the associated object is an "or" relationship.
The method provided by the embodiment of the application is described in detail below through specific embodiments and application scenes thereof with reference to the accompanying drawings.
As shown in fig. 1, the training method for a calculation engine selection model provided in the embodiment of the present application includes the following steps:
step S11, obtaining a structured query language SQL training set, text features of each SQL sentence in the SQL training set, vectorized data and data features of a data table in each SQL sentence,
the SQL training set comprises a plurality of SQL sentences, so that the accuracy of output results after training of the calculation engine selection model is guaranteed, the number of the SQL sentences needs to be as large as possible, and the types of the SQL sentences have diversity. The text feature of the SQL statement may be a data table information relationship in the SQL statement, and may be information of whether an aggregation function exists in the SQL statement.
Based on natural language processing (Natural Language Processing, NLP) technology, text vectorization is carried out on the SQL sentence, and vectorization data of the SQL sentence is obtained. And acquiring a data table of the SQL sentence, and finding data table information corresponding to the data table from the statistical table, thereby obtaining the data characteristics of the data table.
Step S12, generating an engine label of each SQL sentence, wherein the engine label is used for indicating a computing engine with the shortest execution time corresponding to each SQL sentence,
and executing each SQL sentence in the SQL training set in different computing engines in turn (the resource configuration of each computing engine is balanced), and recording the execution time of the SQL sentence in the different computing engines to obtain the corresponding engine label of the computing engine with the shortest SQL sentence execution time. The computing engine may be a first generation MapReduce-based Hive engine, a second generation streaming Spark engine, a third generation streaming unified data processing Flink engine, an interactive query Presto, a massively parallel processing (Massively Parallel Processing, MPP) architecture GreenPlum engine. Execution times of different SQL statements in different compute engines are all inconsistent.
Step S13, training the calculation engine selection model based on the one-dimensional hidden layer vector representation, the text characteristic of each SQL sentence, the data characteristic of a data table in each SQL sentence and the engine label corresponding to each SQL sentence;
the trained calculation engine selection model is used for outputting a calculation engine with shortest execution time of the target SQL statement according to the input target SQL statement.
The text features of the SQL sentences, the vectorized data and the data features of the data tables in the SQL sentences obtained in the steps are used as training data of a calculation engine selection model, the engine labels corresponding to the SQL sentences are used as label values, the calculation engine selection model is trained, along with the continuous increase of the training data and the corresponding label values, the accuracy of the output result of the trained calculation engine selection model is continuously improved, after training is finished, when one SQL sentence is needed to be selected, the SQL sentences are only needed to be input into the calculation engine selection model, the calculation engine with the shortest execution time can be selected by the calculation engine selection model, manual attempt is not needed, automation is realized, and time cost and resource cost are reduced.
Optionally, step S13, training the calculation engine selection model by using the text feature, the vectorization data and the data feature of the data table in each SQL statement as training data of the calculation engine selection model, and using the engine label corresponding to each SQL statement as a label value, where the training includes:
the vectorization data of each SQL sentence is input into the calculation engine selection model by taking the engine label corresponding to each SQL sentence as a label value, and one-dimensional hidden layer vector representation is obtained;
and training the calculation engine selection model based on the one-dimensional hidden layer vector representation, the text characteristic of each SQL sentence, the data characteristic of a data table in each SQL sentence and the engine label corresponding to each SQL sentence.
The calculation engine selection model of the present embodiment includes a first sub-model and a second sub-model. The vectorization data of the SQL sentence is used as training data, an engine label corresponding to the SQL sentence is used as a label value to be input into the first sub-model, the first sub-model is trained, the accuracy of the output result of the first sub-model can be improved during each training, meanwhile, the first sub-model can generate a corresponding one-dimensional hidden layer vector representation, and the one-dimensional hidden layer vector representation is not the output result of the first sub-model, but a parameter corresponding to the first sub-model. And training a second sub-model by taking the one-dimensional hidden layer vector representation, the text characteristics of the SQL sentence and the data characteristics of the data table in the SQL sentence as training data and taking an engine tag corresponding to the SQL sentence as a tag value, wherein the second sub-model is used for outputting a computing engine with the shortest execution time of the SQL sentence according to the input one-dimensional hidden layer vector representation, the text characteristics of the SQL sentence and the data characteristics of the data table of the SQL sentence. In the step, the one-dimensional hidden layer vector representation is obtained through the vectorization data of the SQL sentence and the first sub-model, the one-dimensional hidden layer vector representation is combined with the text features of other SQL sentences, the data features of the data table and other related features to train the second sub-model, and a final calculation engine selection model is obtained.
Optionally, the vectorized data of each SQL statement is used as training data, the engine label corresponding to each SQL statement is used as a label value, and a convolutional neural network (Convolutional Neural Networks, CNN) or a cyclic neural network (Recurrent Neural Networks, RNN) is used for training.
Optionally, step S11, obtaining an SQL training set, and text features of each SQL statement in the SQL training set, vectorized data, and data features of a data table in each SQL statement, includes:
performing text analysis on each SQL sentence in the SQL training set, determining text characteristics of each SQL sentence, wherein the text characteristics comprise the number of data tables in the SQL sentence and information for representing whether aggregation functions exist in the SQL sentence,
text analysis refers to the selection of a text representation and its characteristics, and the text characteristics of the SQL statement include the number of data tables (with one data table or multiple data tables), the names of the data tables, and information for representing whether an aggregation function exists in the SQL statement, so as to determine whether the aggregation function is used in the SQL statement.
Based on a natural language processing NLP technology, text vectorization is carried out on each SQL sentence in the SQL training set so as to obtain vectorized data of each SQL sentence;
determining a table name of a data table involved in the SQL statement based on text features of the SQL statement;
and acquiring historical data of the data table from a statistical table based on the table name of the data table, and taking the maximum value of the historical data as the data characteristic of the data table in the SQL sentence.
Optionally, performing data standardization (min-max standardization) on the maximum value of the historical data, and taking the maximum value of the historical data after data standardization as the data characteristic of a data table in the SQL sentence.
It should be noted that, the method of querying Hive table metadata is adopted to obtain the data of the statistics table, and the method has high efficiency of obtaining the record number of the statistics table. Preferably, the statistics table data can be obtained in a non-interactive query time period such as evening, part of the telecom data table can be extracted to make real data statistics, and other types of data tables are calculated according to the data characteristics of the data table, such as service types and bottom file sizes (the statistics time is short). The calculation engine selection model is trained by integrating the data features, the text features and the vectorization data, so that the accuracy of an output result of the calculation engine selection model is further ensured.
Optionally, training the computing engine selection model based on the one-dimensional hidden layer vector representation, the text feature of each SQL statement, the data feature of the data table in each SQL statement, and the engine label corresponding to each SQL statement, including:
and training the calculation engine selection model by using the one-dimensional hidden layer vector representation, the text characteristic of each SQL sentence and the data characteristic of a data table in each SQL sentence as training data and using an engine label corresponding to each SQL sentence as a label value, and adopting a distributed gradient enhancement library (eXtreme Gradient Boosting, XGBoost) or a gradient enhancement decision tree (Gradient Boosting Decision Tree, GBDT) or a lightweight gradient elevator (Light Gradient Boosting Machine, lightGBM).
XGBoost has the characteristics of high efficiency and flexibility. In addition, different types of algorithms are selected to train the calculation engine selection models, the accuracy of the output results of the obtained calculation engine selection models is different, and the selection model with the highest accuracy can be determined in the calculation engine selection models corresponding to different algorithms according to the accuracy of the actual results in the actual calculation engine selection process.
The calculation engine shown in fig. 3 selects a training flow diagram of the model to obtain an original SQL sample, i.e. an SQL training set. And vectorizing the SQL sentence based on the NLP technology to obtain vectorized data of the SQL sentence. And executing the SQL sentences in different computing engines respectively, and acquiring the computing engine with the shortest time and the engine label corresponding to the computing engine with the shortest execution time. Based on the text analysis program, text characteristics and data characteristics of the SQL sentence are obtained, the text characteristics comprise the number of data tables in the SQL sentence and information for representing whether an aggregation function exists in the SQL sentence, and the data characteristics comprise standardized data of the maximum value of historical data of the data tables in the SQL sentence. And carrying out feature combination on the engine label and the vectorization data, and training by adopting an RNN/CNN neural network to obtain a classifier 1 and a one-dimensional hidden layer vector representation corresponding to the classifier 1. The classifier 1 selects a first sub-model of the models for the calculation engine described above. And carrying out feature combination on the engine label, the one-dimensional hidden layer vector representation, the text feature and the data feature, and training by adopting XGboost to obtain a classifier 2, namely a second sub-model in the calculation engine selection model.
As shown in fig. 2, the embodiment of the present application further provides a computing engine selection method, which includes the following steps:
step S21, acquiring a calculation engine selection request, acquiring an SQL sentence in the calculation engine selection request,
the execution subject of the calculation engine selection method acquires and responds to the calculation engine selection request and simultaneously acquires the SQL sentence contained in the calculation engine selection request.
Step S22, inputting the SQL sentence into a calculation engine selection model, and acquiring first data output by the calculation engine selection model, wherein the first data is used for indicating a calculation engine with the shortest execution time for the SQL sentence;
the calculation engine selection model is a model obtained by training based on the training method of the calculation engine selection model in the embodiment.
In the embodiment, the model obtained by training the training method of the calculation engine selection model is input into the calculation engine selection model, the model outputs corresponding first data, the first data is engine label data in the training method of the calculation engine selection model, the corresponding calculation engine can be selected according to the engine label, manual attempt is not needed, and time cost and resource cost are greatly saved.
Optionally, step S22, inputting the SQL statement into a calculation engine selection model includes:
acquiring text features of the SQL statement, vectorized data and data features of a data table in the SQL statement;
inputting the vectorization data of the SQL sentence into the calculation engine selection model, and obtaining a one-dimensional hidden layer vector representation;
and inputting the one-dimensional hidden layer vector representation, the text characteristics of the SQL sentence and the data characteristics of the data table in the SQL sentence into the calculation engine selection model.
It should be noted that, the calculation engine selection model of the embodiment includes a first sub-model and a second sub-model, and the vectorization data of the SQL statement is input into the first sub-model to obtain a one-dimensional hidden layer vector representation. And inputting the one-dimensional hidden vector representation, the text characteristics of the SQL sentence and the data characteristics of the data table in the SQL sentence into a second sub-model to obtain the first data for determining the calculation engine with the shortest execution time. The calculation engine selection model determines the first data according to the text characteristics of the SQL statement, the vectorized data and the data characteristics of the data table in the SQL statement, so that the calculation engine with the shortest execution time can be accurately indicated by the first data.
Optionally, step S22, obtaining the data characteristics of the data table in the SQL statement includes:
determining a table name of a data table in the SQL sentence based on the text feature;
determining historical data of the data table based on the table name, wherein the historical data includes a distributed file system (Hadoop Distributed File System, HDFS) file size;
and determining the current data of the data table by adopting a time sequence prediction method for the historical data, wherein the current data is the data characteristics of the data table.
The current data of the data table is determined using an autoregressive moving average (Auto Regression Moving Average, ARMA) algorithm or a differentially integrated moving average autoregressive (Auto Regressive Integrated Moving Average, ARIMA) algorithm. The method for selecting the calculation engine of the embodiment improves efficiency of the calculation engine selection method by adopting the algorithm to obtain the data on the T+1st day after obtaining the historical data on the T day corresponding to the data type of the data table from the statistical table based on the table name and the data type of the data table in the SQL sentence, and assuming that the data increment per hour is consistent, the current data size of the data table can be calculated according to the current time of the data table without the time-consuming step of obtaining real-time data.
As shown in fig. 4, a calculation engine selection request is obtained, an SQL statement therein is obtained, vectorized data of the SQL statement is obtained based on the NLP technique, vectorized data of the SQL statement is input to a classifier 1, and the classifier 1 selects a first sub-model in a model for the calculation engine to obtain a one-dimensional hidden layer vector representation. Based on the text analysis method, text features and data features of the SQL sentence are obtained, and whether the text features are single tables or multiple tables, namely the number of data tables, is determined, if yes, aggregation functions such as isSingle, isMult, isGroup are used. The method comprises the steps of obtaining historical data of a data table in an SQL sentence, performing time fitting by adopting an ARMA model to obtain current data, normalizing the current data by adopting min-max to obtain data characteristics, inputting the data characteristics, text characteristics and one-dimensional hidden layer vector representation into a classifier 2, namely, a second sub-model of the calculation engine selection model, outputting first data, and selecting a calculation engine with shortest execution time according to the first data.
The embodiment of the application also provides a training device for the calculation engine selection model, in this embodiment, taking the training device for the calculation engine selection model as an example to execute the training method for the calculation engine selection model, and referring to fig. 5, the training device 500 for the calculation engine selection model provided by the embodiment of the application is described. The training apparatus 500 for computing engine selection model includes:
a first obtaining module 501, configured to obtain a structured query language SQL training set, and text features of each SQL statement in the SQL training set, vectorized data, and data features of a data table in each SQL statement;
the generating module 502 is configured to generate an engine label of each SQL statement, where the engine label is configured to indicate a computing engine with a shortest execution time corresponding to each SQL statement;
the training module 503 is configured to use the text feature and the vectorized data of each SQL statement and the data feature of the data table in each SQL statement as training data of a calculation engine selection model, and use an engine label corresponding to each SQL statement as a label value to train the calculation engine selection model;
the trained calculation engine selection model is used for outputting a calculation engine with shortest execution time of the target SQL statement according to the input target SQL statement.
Optionally, the training module 503 is further configured to:
the vectorization data of each SQL sentence is input into the calculation engine selection model by taking the engine label corresponding to each SQL sentence as a label value, and one-dimensional hidden layer vector representation is obtained;
and training the calculation engine selection model by taking an engine label corresponding to each SQL statement as a label value based on the one-dimensional hidden layer vector representation, the text characteristic of each SQL statement and the data characteristic of a data table in each SQL statement.
Optionally, the first obtaining module 501 is further configured to:
performing text analysis on each SQL sentence in the SQL training set, and determining text characteristics of each SQL sentence, wherein the text characteristics comprise the number of data tables in the SQL sentence and information for representing whether an aggregation function exists in the SQL sentence;
based on a natural language processing NLP technology, text vectorization is carried out on each SQL sentence in the SQL training set so as to obtain vectorized data of each SQL sentence;
determining a table name of a data table involved in the SQL statement based on text features of the SQL statement;
and acquiring historical data of the data table from a statistical table based on the table name of the data table, and taking the maximum value of the historical data as the data characteristic of the data table in the SQL sentence.
Optionally, the training module 503 is further configured to:
and training by taking the one-dimensional hidden layer vector representation, the text characteristic of each SQL statement and the data characteristic of a data table in each SQL statement as training data and taking an engine label corresponding to each SQL statement as a label value, and adopting a distributed gradient enhancement library XGBoost or a gradient enhancement decision tree GBDT or a lightweight gradient elevator LightGBM.
The generating device of the training model tool can train the calculation engine selection model based on text features of the SQL sentences, vectorized data, data features of data tables in the SQL sentences and engine labels corresponding to the SQL sentences so as to improve the accuracy of output results of the calculation engine selection model.
It should be noted that, the generating device of the training model tool provided in the embodiment of the present application can implement all technical processes of the generating method of the training model tool, and can achieve the same technical effects, so that repetition is avoided, and no further description is provided here.
The embodiment of the present application also provides a computing engine selecting device, which in the embodiment of the present application takes a computing engine selecting device to execute a computing engine selecting method as an example, and fig. 6 is combined to describe a computing engine selecting device 600 provided in the implementation of the present application. The calculation engine selection apparatus 600 includes:
a second obtaining module 601, configured to obtain a calculation engine selection request, and obtain an SQL statement in the calculation engine selection request;
the selection module 602 is configured to input the SQL statement into a calculation engine selection model, and obtain first data output by the calculation engine selection model, where the first data is used to indicate a calculation engine with the shortest execution time for the SQL statement;
the calculation engine selection model is a model obtained by training based on the calculation engine selection model training method.
Optionally, the selection module 602 is further configured to:
acquiring text features of the SQL statement, vectorized data and data features of a data table in the SQL statement;
inputting the vectorization data of the SQL sentence into the calculation engine selection model, and obtaining a one-dimensional hidden layer vector representation;
and inputting the one-dimensional hidden layer vector representation, the text characteristics of the SQL sentence and the data characteristics of the data table in the SQL sentence into the calculation engine selection model.
Optionally, the selection module 602 is further configured to:
determining a table name of a data table in the SQL sentence based on the text feature;
determining historical data of the data table based on the table name, wherein the historical data comprises a distributed file system HDFS file size;
and determining the current data of the data table by adopting a time sequence prediction method for the historical data, wherein the current data is the data characteristics of the data table.
According to the calculation engine selection device provided by the application, the calculation engine with the shortest execution time can be automatically selected according to the input SQL statement, and manual operation is not needed.
It should be noted that, the computing engine selecting device provided in the embodiment of the present application can implement all technical processes of the computing engine selecting method, and can achieve the same technical effects, so that repetition is avoided, and no redundant description is provided herein.
The device in the embodiment of the application may be an electronic device, or may be a component in an electronic device, for example, an integrated circuit or a chip. The electronic device may be a terminal, or may be other devices than a terminal. By way of example, the electronic device may be a mobile phone, a tablet computer, a notebook computer, a palm computer, a vehicle-mounted electronic device, a mobile internet appliance (Mobile Internet Device, MID), an augmented reality (augmented reality, AR)/Virtual Reality (VR) device, a robot, a wearable device, an ultra-mobile personal computer (UMPC), a netbook or a personal digital assistant (personal digital assistant, PDA), etc., and the non-mobile electronic device may also be a server, a network attached storage (Network Attached Storage, NAS), a personal computer (personal computer, PC), a Television (TV), a teller machine or a self-service machine, etc., without being limited thereto.
Optionally, as shown in fig. 7, the embodiment of the present application further provides an electronic device 700, including a processor 701 and a memory 702, where the memory 702 stores a program or an instruction that can be executed on the processor 701, and the program or the instruction implements each step of the foregoing training method of the calculation engine selection model or the calculation engine selection method embodiment when executed by the processor 701, and the steps achieve the same technical effects, so that repetition is avoided and redundant description is omitted herein.
The electronic device in the embodiment of the application includes the mobile electronic device and the non-mobile electronic device described above.
The embodiment of the present application further provides a readable storage medium, where a program or an instruction is stored on the readable storage medium, and when the program or the instruction is executed by a processor, the program or the instruction implements the training method of the calculation engine selection model or each process of the calculation engine selection method embodiment, and the same technical effect can be achieved, so that repetition is avoided, and no further description is provided herein.
Wherein the processor is a processor in the electronic device described in the above embodiment. The readable storage medium includes a computer readable storage medium such as a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk or an optical disk, and the like.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. Furthermore, it should be noted that the scope of the methods and apparatus in the embodiments of the present application is not limited to performing the functions in the order shown or discussed, but may also include performing the functions in a substantially simultaneous manner or in an opposite order depending on the functions involved, e.g., the described methods may be performed in an order different from that described, and various steps may also be added, omitted, or combined. Additionally, features described with reference to certain examples may be combined in other examples.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solutions of the present application may be embodied essentially or in a part contributing to the prior art in the form of a computer software product stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk), comprising several instructions for causing a terminal (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method described in the embodiments of the present application.
The foregoing is merely specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes or substitutions are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (10)

1. A method of training a computing engine selection model, the method comprising:
acquiring a Structured Query Language (SQL) training set, text features of each SQL statement in the SQL training set, vectorized data and data features of a data table in each SQL statement;
generating an engine label of each SQL sentence, wherein the engine label is used for indicating a computing engine with the shortest execution time corresponding to each SQL sentence;
taking the text characteristics and the vectorization data of each SQL sentence and the data characteristics of a data table in each SQL sentence as training data of a calculation engine selection model, taking an engine label corresponding to each SQL sentence as a label value, and training the calculation engine selection model;
the trained calculation engine selection model is used for outputting a calculation engine with shortest execution time of the target SQL statement according to the input target SQL statement.
2. The method for training a selection model of a computing engine according to claim 1, wherein the training the selection model of the computing engine by using text features of each of the SQL sentences, vectorized data, and data features of a data table in each of the SQL sentences as training data of the selection model of the computing engine, and using an engine tag corresponding to each of the SQL sentences as a tag value comprises:
the vectorization data of each SQL sentence is input into the calculation engine selection model by taking the engine label corresponding to each SQL sentence as a label value, and one-dimensional hidden layer vector representation is obtained;
and training the calculation engine selection model by taking an engine label corresponding to each SQL statement as a label value based on the one-dimensional hidden layer vector representation, the text characteristic of each SQL statement and the data characteristic of a data table in each SQL statement.
3. The method for training a selection model of a computing engine according to claim 1, wherein the obtaining the SQL training set, and the text feature of each SQL statement in the SQL training set, the vectorized data, and the data feature of the data table in each SQL statement, comprises:
performing text analysis on each SQL sentence in the SQL training set, and determining text characteristics of each SQL sentence, wherein the text characteristics comprise the number of data tables in the SQL sentence and information for representing whether an aggregation function exists in the SQL sentence;
based on a natural language processing NLP technology, text vectorization is carried out on each SQL sentence in the SQL training set so as to obtain vectorized data of each SQL sentence;
determining a table name of a data table involved in the SQL statement based on text features of the SQL statement;
and acquiring historical data of the data table from a statistical table based on the table name of the data table, and taking the maximum value of the historical data as the data characteristic of the data table in the SQL sentence.
4. A method of training a computing engine selection model according to any one of claims 1 to 3, wherein the training the computing engine selection model based on the one-dimensional hidden layer vector representation, text features of each of the SQL statements, data features of a data table in each of the SQL statements, and engine labels corresponding to each of the SQL statements comprises:
and training by taking the one-dimensional hidden layer vector representation, the text characteristic of each SQL statement and the data characteristic of a data table in each SQL statement as training data and taking an engine label corresponding to each SQL statement as a label value, and adopting a distributed gradient enhancement library XGBoost or a gradient enhancement decision tree GBDT or a lightweight gradient elevator LightGBM.
5. A computing engine selection method, the method comprising:
acquiring a calculation engine selection request, and acquiring an SQL sentence in the calculation engine selection request;
inputting the SQL sentence into a calculation engine selection model, and acquiring first data output by the calculation engine selection model, wherein the first data is used for indicating a calculation engine with shortest execution time for the SQL sentence;
wherein the calculation engine selection model is a model trained based on the method of any one of claims 1-4.
6. The computing engine selection method of claim 5, wherein the inputting the SQL statement into a computing engine selection model comprises:
acquiring text features of the SQL statement, vectorized data and data features of a data table in the SQL statement;
inputting the vectorization data of the SQL sentence into the calculation engine selection model, and obtaining a one-dimensional hidden layer vector representation;
and inputting the one-dimensional hidden layer vector representation, the text characteristics of the SQL sentence and the data characteristics of the data table in the SQL sentence into the calculation engine selection model.
7. The method of claim 6, wherein the obtaining the data characteristics of the data table in the SQL statement comprises:
determining a table name of a data table in the SQL sentence based on the text feature;
determining historical data of the data table based on the table name, wherein the historical data comprises a distributed file system HDFS file size;
and determining the current data of the data table by adopting a time sequence prediction method for the historical data, wherein the current data is the data characteristics of the data table.
8. A training apparatus for a computing engine selection model, the apparatus comprising:
the first acquisition module is used for acquiring a Structured Query Language (SQL) training set, text features of each SQL statement in the SQL training set, vectorized data and data features of a data table in each SQL statement;
the generating module is used for generating an engine label of each SQL sentence, and the engine label is used for indicating a computing engine with the shortest execution time corresponding to each SQL sentence;
the training module is used for training the calculation engine selection model by taking the text characteristics and the vectorization data of each SQL sentence and the data characteristics of the data table in each SQL sentence as training data of the calculation engine selection model and taking the engine label corresponding to each SQL sentence as a label value;
the trained calculation engine selection model is used for outputting a calculation engine with shortest execution time of the target SQL statement according to the input target SQL statement.
9. A computing engine selection device, the device comprising
The second acquisition module is used for acquiring a calculation engine selection request and acquiring an SQL sentence in the calculation engine selection request;
the selection module is used for inputting the SQL sentence into a calculation engine selection model, and acquiring first data output by the calculation engine selection model, wherein the first data is used for indicating a calculation engine with the shortest execution time for the SQL sentence;
wherein the calculation engine selection model is a model trained based on the method of any one of claims 1-4.
10. An electronic device comprising a processor and a memory storing a program or instructions executable on the processor, which when executed by the processor, implement the method of training the calculation engine selection model of any one of claims 1 to 4 or the steps of the calculation engine selection method of any one of claims 5 to 7.
CN202310274395.3A 2023-03-21 2023-03-21 Training method of calculation engine selection model, calculation engine selection method and device Pending CN116303538A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310274395.3A CN116303538A (en) 2023-03-21 2023-03-21 Training method of calculation engine selection model, calculation engine selection method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310274395.3A CN116303538A (en) 2023-03-21 2023-03-21 Training method of calculation engine selection model, calculation engine selection method and device

Publications (1)

Publication Number Publication Date
CN116303538A true CN116303538A (en) 2023-06-23

Family

ID=86802814

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310274395.3A Pending CN116303538A (en) 2023-03-21 2023-03-21 Training method of calculation engine selection model, calculation engine selection method and device

Country Status (1)

Country Link
CN (1) CN116303538A (en)

Similar Documents

Publication Publication Date Title
CN111026842B (en) Natural language processing method, natural language processing device and intelligent question-answering system
US20230088171A1 (en) Method and apparatus for training search recommendation model, and method and apparatus for sorting search results
CN109299245B (en) Method and device for recalling knowledge points
CN106776575A (en) A kind of system and method for real-time semantic search working opportunity
CN114186084A (en) Online multi-mode Hash retrieval method, system, storage medium and equipment
CN114647713A (en) Knowledge graph question-answering method, device and storage medium based on virtual confrontation
CN113312468A (en) Conversation mode-based conversation recommendation method, device, equipment and medium
CN114546365A (en) Flow visualization modeling method, server, computer system and medium
KR20200131736A (en) Method and server for text classification using multi-task learning
CN116910357A (en) Data processing method and related device
CN116910201A (en) Dialogue data generation method and related equipment thereof
CN116303538A (en) Training method of calculation engine selection model, calculation engine selection method and device
CN108959327B (en) Service processing method, device and computer readable storage medium
CN112817560B (en) Computing task processing method, system and computer readable storage medium based on table function
Loh et al. Implementation of artificial intelligence chatbot in semiconductor manufacturing to optimize overall equipment effectiveness
CN114970666A (en) Spoken language processing method and device, electronic equipment and storage medium
CN111539529B (en) Event reasoning method and device
CN109684466B (en) Intelligent education advisor system
CN110333844B (en) Calculation formula processing method and device
US11842379B2 (en) Method and system for obtaining item-based recommendations
CN116361341B (en) Crowd-sourced circle selection method, crowd-sourced circle selection device, computer equipment and medium
CN113127509B (en) Method and device for adapting SQL execution engine in PaaS platform
CN112307227B (en) Data classification method
US20240152805A1 (en) Systems, methods, and non-transitory computer-readable storage devices for training deep learning and neural network models using overfitting detection and prevention
CN116737964B (en) Artificial intelligence brain system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination