WO2023185264A1 - 一种基数估计方法及装置 - Google Patents

一种基数估计方法及装置 Download PDF

Info

Publication number
WO2023185264A1
WO2023185264A1 PCT/CN2023/075191 CN2023075191W WO2023185264A1 WO 2023185264 A1 WO2023185264 A1 WO 2023185264A1 CN 2023075191 W CN2023075191 W CN 2023075191W WO 2023185264 A1 WO2023185264 A1 WO 2023185264A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
model
target model
target
column
Prior art date
Application number
PCT/CN2023/075191
Other languages
English (en)
French (fr)
Inventor
孙佶
李士福
张树杰
李国良
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2023185264A1 publication Critical patent/WO2023185264A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation

Definitions

  • the present application relates to the field of databases, and in particular to a cardinality estimation method and device.
  • Cardinality estimation technology is widely used in databases. It is a technology that predicts the number of result rows of query instructions or query statements before execution. Based on the estimated cardinality of the query instruction, the cost of the execution plan of the query instruction can be estimated and the optimal execution plan can be selected, which is beneficial to improving the execution efficiency of the query instruction and plays an important role in improving the overall performance of the database.
  • the current cardinality estimation method has the problem of low accuracy of cardinality estimation.
  • This application discloses a cardinality estimation method and device, which can realize the accurate construction of a model for cardinality estimation and is conducive to improving the efficiency and accuracy of cardinality estimation.
  • this application provides a cardinality estimation method, which method includes: selecting a target model type from multiple model types according to target distribution characteristics and mapping relationship information, where the mapping relationship information includes the multiple model types.
  • the mapping relationship between the target distribution characteristics and the predicted distribution characteristics, the target distribution characteristics are extracted from the sample data collected based on the analysis instructions; according to the target model type, the target model corresponding to the target model type is obtained, the The target model is used to perform cardinality estimation on query instructions to the database.
  • the method is applied to a first device, which may be, for example, a server deployed on the network side, or a component or chip in the server.
  • the first device may be deployed in a cloud environment, or the first device may also be deployed in an edge environment.
  • the first device may be an integrated device or multiple distributed devices, which is not specifically limited here.
  • the analysis instructions are used to collect statistical information related to the contents of the data tables in the database, for example, multi-column data.
  • the analysis command can be the Analyze command of the database.
  • the database can be a Structured Query Language (SQL) database, a MySQL database, an Oracle database, or other databases.
  • the predicted distribution characteristics include distribution characteristics corresponding to multiple analysis instructions.
  • the distribution characteristics corresponding to the first analysis instruction are extracted from the first sample data collected based on the first analysis instruction.
  • the analysis instructions corresponding to the above target distribution characteristics may be included in the multiple analysis instructions, or may not be included in the multiple analysis instructions.
  • different analysis instructions correspond to different target models.
  • different target models means: the target models are of different types, or they belong to the same type but have different model parameters of the target models.
  • Different analysis instructions means: the number of columns in the two analysis instructions is different, or the identifiers of the columns in the two analysis instructions are different, or the number of columns in the two analysis instructions are different and the identifiers of the columns are also different.
  • the same analysis instruction received at different times may correspond to different target models.
  • the same analysis instruction may be sent by different clients at different times, or may be sent by the same client at different times.
  • the sample data collected twice may be different because it occurs at different times. For example, it may be caused by changes in historical query records and/or changes in the data table, and then the extracted distribution The features may also be different, so the final target model obtained may also be different. It can be understood that when the target model corresponding to the same analysis instruction changes, the first device can update the relevant information (for example, model parameters, etc.) of the latest target model corresponding to the analysis instruction in the memory and/or disk accordingly.
  • a target model type suitable for the distribution characteristics is selected from multiple model types based on the distribution characteristics extracted from the sample data, providing multiple model types suitable for various application scenarios and realizing model types under different distribution characteristics. adaptive selection. The accuracy of the target model used for cardinality estimation is improved, which is beneficial to improving the accuracy of cardinality estimation.
  • obtaining a target model corresponding to the target model type according to the target model type includes: obtaining a target model corresponding to the target model type according to the target model type and the sample data.
  • the acquisition of the target model is not only related to the target model type, but also related to the sample data, achieving the accurate construction of the target model.
  • obtaining the target model corresponding to the target model type includes: from the multiple models Determine the operator corresponding to the target model type among the operators corresponding to the type; use the operator corresponding to the target model type to create an untrained model corresponding to the target model type; train the untrained model according to the sample data The trained model is used to obtain the target model corresponding to the target model type.
  • the operators corresponding to each of the multiple model types mentioned above are the smallest calculation units in the database.
  • each model type has its own corresponding operator set in the database.
  • the operator corresponding to the model type is a Bayesian network operator
  • the model corresponding to the model type is a Bayesian network model.
  • the model type is Gaussian kernel function
  • the operator corresponding to this model type is Gaussian kernel function operator
  • the model corresponding to this model type is Gaussian kernel function model.
  • the operators corresponding to each model type can also be called machine learning operators.
  • the operator corresponding to any one of the multiple model types has at least one of the following characteristics:
  • the operator corresponding to the model type is a native node in the execution plan of the database
  • Operators corresponding to the model types can directly access storage and data in the database.
  • the operator corresponding to the target model type also has at least one of the above three characteristics.
  • the data used by the untrained model corresponding to the target model type during training is related to the target model type.
  • an untrained model corresponding to the target model type is trained based on part of the sample data, which includes the following two situations: (1) When the target model type is Bayesian network, autoregressive and sum-product probability In any of the above types, train the untrained model based on part of the data in the sample data, where part of the data in the sample data refers to data collected from at least one data table in the database based on the analysis instructions, that is, the analysis instructions Corresponding multi-column data; (2) When the target model type is any type of mixed uniform distribution and Gaussian kernel function, train the untrained model based on part of the data in the sample data, where part of the data in the sample data is Refers to the data collected from historical query records based on analysis instructions, that is, the historical query instruction data corresponding to the analysis instructions.
  • an untrained model corresponding to the target model type is trained based on all the data in the sample data.
  • the target model type can be a neural network
  • all the data in the sample data refers to historical query records according to the analysis instructions.
  • data sampled from at least one data table in the database is provided.
  • the training of the untrained model corresponding to the target model type and the use of the target model may occur on the same device, for example, the first device.
  • the target model for cardinality estimation is obtained based on the sample data and the operator corresponding to the target model type in the database, which improves the efficiency of model construction.
  • using different data in the sample data to achieve targeted training of the target model will help improve the accuracy of the built model.
  • the sample data is collected from historical query records and at least one data table in the database based on the analysis instructions.
  • Historical query records include multiple historical query instructions input by the user.
  • the query instruction may be a query on a single table (or base table) or a query on a multi-table connection.
  • the query instruction includes query conditions, for example, the WHERE condition in the query instruction.
  • the query conditions include predicates and column identification information. The identifiers of columns in the query conditions are used to indicate the columns to be queried.
  • the predicates include at least one of like, between, is NULL/is NOT NULL, in, exists, and comparison operators. kind.
  • the collection of sample data comprehensively considers the collection of column data and historical query load, increasing the diversity of sample data.
  • the sample data includes multi-column data corresponding to the analysis instructions and historical query instruction data corresponding to the analysis instructions.
  • the target distribution characteristics include a first characteristic and a second characteristic, where the first characteristic is used to indicate the distribution of the above-mentioned multi-column data. Characteristics, the second characteristic is used to indicate the distribution characteristics of columns in historical query instruction data.
  • the first feature is extracted based on the multi-column data corresponding to the analysis instruction
  • the second feature is extracted based on the historical query instruction data corresponding to the analysis instruction.
  • the first feature includes at least one of the following features:
  • the correlation information between the columns in the multi-column data includes at least one of the following: the average value of the correlation coefficients between the columns in the multi-column data and the average value of the correlation coefficients between the columns in the multi-column data. maximum value.
  • the second feature includes at least one of the following features:
  • the column combination coverage information is used to indicate the ratio of the number of column combinations existing in the historical query instruction data to the total number of column combinations obtained based on the columns in the analysis instruction. For example, assume that there are three historical query instructions in the historical query instruction data. The column combinations corresponding to these three historical query instructions are: ⁇ c1,c2,c3 ⁇ , ⁇ c1,c2 ⁇ and ⁇ c2,c3 ⁇ , that is, the historical query instructions The number of existing column combinations in the data is 3. Assume that the analysis instruction is "Analyze(c1,c2,c3,c4)", and combine at least 2 of the columns c1, c2, c3, c4 in the analysis instruction to obtain The total number of column combinations is 11, so the column combinations in this historical query command The coverage information is
  • the column query range coverage information corresponding to the historical query instruction is used to indicate the ratio of the column query range of the historical query instruction to the total column query range corresponding to each column in the historical query instruction.
  • the column query similarity information is used to indicate the similarity of the column query ranges in any two historical query instructions.
  • the hit rate of the historical query load that is, column combination coverage
  • Range queries such as column query range coverage, columns of different historical query instructions, are also considered.
  • the similarity between query ranges enables the characterization of the distribution characteristics of historical query instruction data from multiple dimensions.
  • the distribution characteristics extracted from the sample data are not only fine, but also accurate, and can accurately express the distribution characteristics of column data and the distribution characteristics of historical query loads.
  • the sample data collection process specifically includes: sending sampling instruction information to at least one data node, and the sampling instruction information is used to instruct at least one data node according to The connection relationship between multiple data tables collects the data of the columns in the analysis instructions from multiple data tables; collects the historical query instructions related to the columns in the analysis instructions in the historical query records.
  • connection relationship between multiple data tables can be obtained through the primary and foreign key information of each data table.
  • the primary key of a data table is the unique identifier of a record, and the foreign key of a data table is used to associate another data table.
  • a data table has a primary key and at least one foreign key.
  • the foreign key of one data table can be the primary key of another data table.
  • the multiple columns of data corresponding to the analysis instructions in the sample data are collected by at least one data node and sent to the first device, and the historical query instruction data corresponding to the analysis instructions in the sample data are collected by the first device itself.
  • mapping relationship information is obtained based on training of an artificial intelligence algorithm.
  • the artificial intelligence algorithm can be a decision tree (Decision Tree), a random forest (Random Forest, RF), a support vector machine (Support Vector Machine, SVM) or other classification algorithms, which are not specifically limited here.
  • the device for using the artificial intelligence algorithm and the training device for the artificial intelligence algorithm may be different devices.
  • the device for using the artificial intelligence algorithm is a first device, and the device for training the artificial intelligence algorithm is independent of the first device. of another device.
  • the artificial intelligence algorithm obtains the above-mentioned mapping relationship information through training and uses the mapping relationship information to determine the target model type do not occur at the same time.
  • the mapping relationship information is obtained through training first, and then the mapping relationship information is used to determine the target model type. In this way, by using another training device to undertake the task of training to obtain the mapping relationship information, the load of the first device can be reduced.
  • the training and use of the artificial intelligence algorithm can also occur on the same device.
  • the training and use of the artificial intelligence algorithm both occur on the first device.
  • model parameters of the target model are stored in memory and/or disk.
  • model parameters of the target model when storing the model parameters of the target model to disk, they need to be serialized first to unify them into a format that can be stored persistently. Therefore, when the model parameters of the target model are read from the disk, the read model parameters need to be deserialized to restore the original format of the model parameters of the target model.
  • the first device after the first device performs cardinality estimation and reads the model parameters of the target model from the disk, the first device can also store the model parameters of the target model into the memory so that the same model can be read next time.
  • the model parameters of the target model can be obtained directly from the memory. This avoids the frequent deserialization operations caused by reading the model parameters from the disk, which is equivalent to realizing the pre-reading of the model parameters, effectively improving the reading of the target model parameters. Get efficiency.
  • the number of target models with model parameters stored in the memory is also limited. It may happen that the current memory is full but the model parameters of the target model to be called are not stored in the memory.
  • part of the model parameters of the target model in the memory can be deleted periodically or when the memory is full. For example, the deleted target model satisfies: The interval between the last time the model parameters of the target model were accessed and the current time is longer than the preset time. In this way, the model parameters of the target model to be called subsequently can be obtained directly from the memory as much as possible, or the model parameters of the target model to be called can be stored in the memory after being read from the disk.
  • the plurality of model types include at least two of the following types: Bayesian network, autoregressive, mixed uniform distribution, Gaussian kernel function, sum-product probability, and neural network.
  • this application provides a cardinality estimation device, which device includes: a type selection unit, a type selection unit configured to select a target model type from multiple model types based on target distribution characteristics and mapping relationship information, said The mapping relationship information includes the mapping relationship between the multiple model types and the predicted distribution characteristics.
  • the target distribution characteristics are extracted from the sample data collected based on the analysis instructions; a model establishment unit is used to calculate the target distribution characteristics according to the target Model type, obtain the target model corresponding to the target model type, and the target model is used to perform cardinality estimation on the query instructions of the database.
  • the model establishment unit is specifically configured to obtain a target model corresponding to the target model type according to the target model type and the sample data.
  • the model establishment unit is specifically configured to: determine the operator corresponding to the target model type from the operators corresponding to the multiple model types; use the operator corresponding to the target model type to create the target An untrained model corresponding to the model type; train the untrained model according to the sample data to obtain a target model corresponding to the target model type.
  • the sample data is collected from historical query records and at least one data table in the database based on the analysis instructions.
  • the sample data includes multi-column data corresponding to the analysis instruction and historical query instruction data corresponding to the analysis instruction, and the target distribution characteristics include a first characteristic and a second characteristic, wherein the first The feature is used to indicate the distribution characteristics of the multiple columns of data, and the second feature is used to indicate the distribution characteristics of the columns in the historical query instruction data.
  • the first feature includes at least one of the following features:
  • the second feature includes at least one of the following features:
  • the device also includes a data collection unit.
  • the data collection unit is specifically configured to: send sampling instruction information to at least one data node, and the sampling instruction information is Instructing at least one data node to collect the data of the column in the analysis instruction from multiple data tables according to the connection relationship between the multiple data tables; collecting historical query instructions related to the column in the analysis instruction in the historical query record.
  • mapping relationship information is obtained based on training of an artificial intelligence algorithm.
  • model parameters of the target model are stored in memory and/or disk.
  • the above-mentioned plurality of model types include at least two of the following types: Bayesian network, autoregressive, mixed uniform distribution, Gaussian kernel function, sum-product probability, and neural network.
  • the present application provides a device, which includes a processor and a memory, wherein the memory is used to store program instructions; the processor calls the program instructions in the memory, causing the device to execute the first aspect or the second aspect. method in any possible implementation of an aspect.
  • the present application provides a computer-readable storage medium, including computer instructions.
  • the computer instructions When the computer instructions are executed by a processor, the above-mentioned first aspect or any possible implementation of the first aspect is implemented. method.
  • the present application provides a computer program product that, when executed by a processor, implements the method in the above first aspect or any possible embodiment of the first aspect.
  • the computer program product for example, can be a software installation package. If it is necessary to use the method provided by any of the possible designs of the first aspect, the computer program product can be downloaded and executed on the processor. , to implement the method in the first aspect or any possible embodiment of the first aspect.
  • Figure 1 is a schematic diagram of a system architecture provided by an embodiment of the present application.
  • Figure 2 is a schematic block diagram of a coordination node provided by an embodiment of the present application.
  • Figure 3 is a flow chart of a cardinality estimation method provided by an embodiment of the present application.
  • Figure 4 is a flow chart of yet another cardinality estimation method provided by an embodiment of the present application.
  • Figure 5 is a schematic structural diagram of a cardinality estimation device provided by an embodiment of the present application.
  • Figure 6 is a schematic structural diagram of yet another cardinality estimation device provided by an embodiment of the present application.
  • prefixes such as “first” and “second” are used in this application only to distinguish different description objects, and there are no restrictions on the position, order, priority, quantity or content of the described objects. effect.
  • the objects modified by different prefix words can be the same or different. For example, if the described object is “device”, then “first device” and “second device” It can be the same device, the same type of device, or different types of devices; for another example, if the described object is "information”, then the "first information” and “second information” can be information with the same content or information with different contents. .
  • the use of prefixes used to distinguish the described objects in the embodiments of this application does not constitute a limitation on the described objects. For the description of the described objects, please refer to the claims or the context description in the embodiments. The use of such words should not be used. Prefix words constitute redundant restrictions.
  • references such as "at least one (or at least one) of a1, a2, ... and an” are used, including any one of a1, a2, ... and an.
  • the situation that exists alone also includes any combination of any number of a1, a2,... and an.
  • Each situation can exist alone.
  • the description of "at least one of a, b, and c" includes a alone, b alone, c alone, a combination of a and b, a combination of a and c, a combination of b and c, or a combination of abc. Condition.
  • the cardinality estimation method is often used to estimate the cardinality of the query command to select the least costly execution plan from multiple execution plans of the query command, thereby improving the query command execution efficiency and database query performance.
  • the coordination node of the database After monitoring the analysis instructions sent by the client, the coordination node of the database creates a statistical information analysis thread (or simply an analyzer).
  • the analysis instructions include the identifier of the column.
  • the columns all belong to the same data table (i.e. base table); the analyzer samples the base table according to the columns in the analysis instruction to obtain the column data corresponding to the columns in the analysis instruction, and analyzes the column data to obtain simple statistics Information, for example, when performing high-frequency statistical operations on column data, simple statistics include the high-frequency values in the column data, or when performing correlation analysis on column data, simple statistics include among the columns in the column data correlation coefficient between.
  • the coordination node receives the query instruction sent by the client, generates multiple execution plans based on the query instruction, and estimates the cardinality of each execution plan in the multiple execution plans.
  • execution plan 1 determines the matching execution plan 1.
  • the cardinality estimation result of each execution plan in multiple execution plans can be used to select an optimal execution plan from the multiple execution plans, so that the coordination node can execute the query instruction based on the optimal execution plan.
  • the simple statistical information used for cardinality estimation is relatively coarse-grained and single, and its application scenarios are also limited. For example, it can only achieve multi-column cardinality estimation of a base table, and the accuracy of the estimated cardinality is also low.
  • embodiments of the present application propose a cardinality estimation method, which can not only improve the accuracy of cardinality estimation, but also help improve the efficiency of cardinality estimation.
  • Figure 1 exemplarily provides a schematic diagram of a system architecture. This system is used for the estimation of cardinality.
  • the system at least includes a client, a coordination node and a data node.
  • the client and the coordination node are connected in a wired or wireless manner, and the coordination node and the data node are connected in a wired manner.
  • the client is used to send analysis instructions to the coordination node.
  • the analysis instructions include identification of multiple columns.
  • the analysis instructions are the Analyze command.
  • the data node serves as a physical storage node for user data, and the data node includes at least one node (for example, data node 1 and data node 2 in Figure 1).
  • Data nodes store data in the database, such as data tables.
  • the data node can manage and operate the data it stores according to the instructions of the coordination node. For example, the data node can collect and analyze the data of the column in the instruction from at least one data table it stores based on the sampling instruction information received from the coordination node. .
  • the coordination node is used to perform model type selection, model creation, training, and storage in response to analysis instructions, and the coordination node is also used to perform cardinality estimation using the cardinality estimation model.
  • the coordination node may be, for example, a server deployed on the network side, or a component or chip in the server.
  • the coordination node can be deployed in a cloud environment, that is, a cloud computing server, or the coordination node can also be deployed in an edge environment, that is, an edge computing server.
  • the coordination node may be an integrated device or multiple distributed devices, which is not specifically limited in the embodiment of this application.
  • the coordination node is mainly used to: receive analysis instructions from the client; sample according to the columns in the analysis instructions to obtain sample data; in the database, based on the sample data and the target distribution characteristics extracted from the sample data, Obtain the cardinality estimation model.
  • the coordination node can also receive a query instruction sent by the client, perform cardinality estimation on the query instruction using a cardinality estimation model, and the obtained cardinality is used to indicate the estimated number of result rows after the query instruction is executed.
  • the communication between the client and the coordination node, between the coordination node and the data node, and between the data node and the data node usually uses a local area network or a dedicated wired network.
  • Figure 1 is only an exemplary architecture diagram, but does not limit the number of network elements included in the system shown in Figure 1 .
  • FIG. 1 may also include other functional entities.
  • the method provided by the embodiment of the present application can be applied to the communication system shown in Figure 1.
  • the method provided by the embodiment of the present application can also be applied to other communication systems, which is not limited by the embodiment of the present application.
  • the coordination node includes a port listening daemon, an analyzer, a query optimizer and a query executor.
  • the port listening daemon is used to listen for analysis instructions (or statistical information creation instructions).
  • the analysis instructions can be sent by the client in Figure 1.
  • the analyzer is used to respond to the analysis instructions listened to by the port listening daemon and execute Sampling of sample data and type selection, model creation, training and storage of the cardinality estimation model.
  • the query optimizer can use the obtained cardinality estimation model to perform cardinality estimation.
  • the query executor can provide the analyzer with an operator in the database.
  • the operator Can be used to create cardinality estimation models.
  • the database can be a Structured Query Language (SQL) database, a MySQL database, an Oracle database, or other databases.
  • SQL Structured Query Language
  • the analyzer specifically includes a data sampling module, a model selection module and a model creation module, wherein the data sampling module is used to obtain sample data from historical query records and at least one data table in the database according to the analysis instructions, and the model selection module Used to extract target distribution features from sample data; select a target model type from multiple model types based on the extracted target distribution features.
  • the model creation module is used to obtain a target model corresponding to the target model type based on the target model type.
  • the target model is the model used for cardinality estimation, or is called the cardinality estimation model.
  • the analyzer uses the operator corresponding to the target model type in the database provided by the above query executor in the process of obtaining the cardinality estimation model.
  • This operator is the smallest calculation unit in the database.
  • the operator is a new intelligent operator set in the database.
  • This operator is related to the target model type.
  • this operator includes but is not limited to Bayesian network operator, autoregressive model operator, mixed uniform distribution model operator, Gaussian kernel function model operator, neural network model operators, sum-product probability model operators, etc.
  • the operator corresponding to the target model type has at least one of the following characteristics:
  • the operator corresponding to the target model type is the native node in the execution plan of the database
  • the operator corresponding to the target model type can be optimized in conjunction with the optimizer in the database;
  • Operators corresponding to the target model type can directly access storage and data in the database.
  • the query optimizer includes a cardinality intelligent estimation module that can use the model creation module in the analyzer
  • the output cardinality estimation model performs cardinality estimation on the query conditions in the query instruction to obtain the estimated cardinality.
  • the query optimizer further includes an execution plan generation module and an execution plan selection module, where the execution plan generation module is used to generate multiple candidate execution plans according to the query instructions, where each execution plan is Cardinality estimation of the cardinality intelligent estimation module, the execution plan selection module is used to select the optimal execution plan from multiple candidate execution plans based on the cardinality estimation result of each execution plan, and output the optimal execution plan to query execution
  • the query executor executes the query instruction according to the optimal execution plan.
  • FIG. 2 is only an example of a block diagram of a coordination node provided by the embodiment of the present application, and the coordination node is not limited to only the form shown in FIG. 2 .
  • the coordination node may also include more or fewer modules than shown in Figure 2, which is not specifically limited here.
  • FIG 3 is a flow chart of a cardinality estimation method provided by an embodiment of the present application.
  • This method is applied to the first device, which may be the coordination node in the above-mentioned Figure 1.
  • the coordination node in the above-mentioned Figure 1.
  • the method includes but is not limited to the following steps:
  • S101 Sampling according to the columns in the analysis instruction to obtain sample data.
  • the first device receives an analysis instruction from the client, where the analysis instruction includes identifiers of multiple columns to be collected.
  • the column ID indicates the column in the data table.
  • the client may be the client in Figure 1.
  • the analysis instruction may be an Analyze command, and the Analyze command may be used to collect statistical information related to the contents of the data table in the database, for example, multi-column data.
  • the analysis instructions also include an identification of the data table. Based on the number of identifiers of the data table in the analysis instruction, it can be determined whether the multiple columns to be collected in the analysis instruction are included in the same data table. When the analysis instruction includes the identifier of a data table, it means that multiple columns in the analysis instruction are included in the same data table. When the analysis instruction includes the identifiers of multiple data tables, it means that the multiple columns in the analysis instruction are included in different data tables. In this case, the analysis instruction also indicates the attribution between the identifiers of the columns and the identifiers of the data tables. relation.
  • analysis instruction 1 is Analyze(tabel_1(c1,c2,c3);tabel_2(c4,c5)), it can be seen that the multiple columns in this analysis instruction include ⁇ c1,c2,c3,c4,c5 ⁇ , the analysis instruction also includes the identification of two data tables, namely table 1tabel_1 and table 2tabel_2. Among them, columns c1, c2 and c3 are all from table 1, and columns c4 and c5 are all from table 2.
  • the sample data includes multi-column data corresponding to the analysis instruction and historical query instruction data corresponding to the analysis instruction.
  • the multi-column data corresponding to the analysis instruction includes the data where each column in the analysis instruction is located.
  • the analysis instruction corresponds to
  • the historical query instruction data includes historical query instructions related to columns in the analysis instructions.
  • the multi-column data corresponding to the analysis instruction is the five columns of data: c1, c2, c3, c4 and c5; for the above collected
  • the historical query command satisfies: the columns in the historical query command are all included in the columns in the analysis command.
  • the possible combinations of columns include the following At least one of the conditions: ⁇ c1,c2 ⁇ , ⁇ c1,c3 ⁇ , ⁇ c1,c4 ⁇ , ⁇ c2,c3 ⁇ , ⁇ c2,c4 ⁇ , ⁇ c3,c4 ⁇ , ⁇ c1,c2,c3 ⁇ , ⁇ c1,c2,c4 ⁇ , ⁇ c2,c3,c4 ⁇ , ⁇ c1,c3,c4 ⁇ and ⁇ c1,c2,c3,c4 ⁇ .
  • historical query command 1 Take historical query command 1 as an example to specifically illustrate the combination of columns in historical query command 1.
  • the column combination method in this historical query command is ⁇ c2,c3 ⁇ .
  • sampling according to the columns in the analysis instruction includes: sending first sampling instruction information to at least one data node, and the first sampling instruction information is Instructing at least one data node to collect the data of the column in the analysis instruction from the data table where the column is located; collecting historical query instructions related to the column in the analysis instruction in the historical query records.
  • sampling according to the columns in the analysis instruction includes: sending second sampling instruction information to at least one data node, the second sampling instruction information Used to instruct at least one data node to collect the data of the columns in the analysis instruction from multiple data tables based on the connection relationship between the multiple data tables; collect historical queries related to the columns in the analysis instruction in the historical query records instruction.
  • the connection relationship between multiple data tables can be obtained through the primary and foreign key information of each data table.
  • the primary key of a data table is the unique identifier of a record
  • the foreign key of a data table is used to associate another data table.
  • a data table has a primary key and at least one foreign key.
  • the foreign key of one data table can be the primary key of another data table.
  • a record includes "student number-name-age". The student number can uniquely identify a student, so the student number is the primary key, and the name and/or age can be used as foreign keys.
  • Each data node can obtain the connection relationship between multiple data tables based on the primary and foreign key information of each data table, and gradually sample multiple data tables based on the connection relationship between multiple data tables. For example, randomly select a data table from multiple data tables as the sampling starting point to collect relevant column data in the data table; then find the next data table connected to the data table based on the primary foreign key information and collect relevant column data from it. According to this method, uniform sampling of multiple data tables is achieved, which improves the collection efficiency of associated data in the data tables.
  • At least one data node After at least one data node completes collection, at least one data node sends the data it collected to the first device. It should be noted that the data collected by at least one data node is the multi-column data corresponding to the analysis instructions in the sample data mentioned above.
  • A11 When the port listening daemon of the coordination node listens to the analysis instructions sent by the client, it creates an analyzer.
  • A12 The data sampling module in the analyzer determines whether the analysis instruction includes the identification of multiple data tables. When the data sampling module determines that the analysis instruction only includes the identification of one data table, it sends the first sample to at least one data node in Figure 1 Instruction information to collect the data where the column in the analysis instruction in the data table is located; when the data sampling module determines that the analysis instruction includes the identifiers of multiple data tables, send second sampling instruction information to at least one data node in Figure 1 to collect The data for the columns in the analysis instructions in multiple data tables.
  • the first sampling instruction information and the second sampling instruction information reference may be made to the above related descriptions and will not be described again here.
  • At least one data node After at least one data node completes data collection, at least one data node will also return the collected data to the coordination node.
  • A13 The data sampling module in the analyzer also collects historical query instructions related to the columns in the analysis instructions in the historical query records.
  • S102 In the database, obtain a cardinality estimation model based on the sample data and the target distribution characteristics extracted from the sample data.
  • the sample data includes multi-column data corresponding to the analysis instruction and historical query instruction data corresponding to the analysis instruction.
  • the multiple columns of data corresponding to the analysis instructions are collected by at least one of the above-mentioned data nodes and then sent to the first device.
  • the historical query command data corresponding to the analysis command is collected by the first device itself.
  • the target distribution characteristics include a first characteristic and a second characteristic, where the first characteristic is corresponding to the analysis instruction.
  • the first feature is extracted from the multi-column data corresponding to the analysis instruction.
  • the first feature is used to indicate the distribution characteristics of the multi-column data corresponding to the analysis instruction.
  • the second feature is extracted based on the historical query instruction data corresponding to the analysis instruction.
  • the second feature is used to indicate the historical query. Distribution characteristics of columns in instruction data.
  • the first feature includes at least one of the following features:
  • the correlation information between the columns in the multi-column data includes at least one of the average value of the correlation coefficients between the columns in the multi-column data and the maximum value of the correlation coefficient between the columns in the multi-column data.
  • the second feature includes at least one of the following features:
  • the column combination coverage information is used to indicate the ratio of the number of column combinations existing in the historical query instruction data to the total number of column combinations obtained based on the columns in the analysis instruction.
  • the number of existing column combinations in the historical query command data is 3. These three column combinations are ⁇ c1, c2, c3 ⁇ , ⁇ c1,c2 ⁇ and ⁇ c2,c3 ⁇ .
  • the analysis instruction is "Analyze(c1,c2,c3,c4)"
  • there are 11 possible column combinations namely ⁇ c1,c2 ⁇ , ⁇ c1,c3 ⁇ , ⁇ c1,c4 ⁇ , ⁇ c2 ,c3 ⁇ , ⁇ c2,c4 ⁇ , ⁇ c3,c4 ⁇ , ⁇ c1,c2,c3 ⁇ , ⁇ c1,c2,c4 ⁇ , ⁇ c2,c3,c4 ⁇ , ⁇ c1,c3,c4 ⁇ and ⁇ c1 ,c2,c3,c4 ⁇
  • the column combination coverage information in the historical query instruction data is obtained by calculation:
  • the column query range coverage information corresponding to the historical query instruction is used to indicate the ratio of the column query range of the historical query instruction to the total column query range corresponding to each column in the historical query instruction.
  • the total column query range corresponding to each column is based on The maximum and minimum values of the values in each column are determined.
  • the column query range coverage information corresponding to this historical query instruction is obtained through calculation:
  • the column query similarity information is used to indicate the similarity of the column query ranges in any two historical query instructions.
  • obtaining a cardinality estimation model based on the sample data and the target distribution characteristics extracted from the sample data includes: selecting a target model from multiple model types based on the target distribution characteristics extracted from the sample data. Type; according to the target model type and sample data, obtain the target model corresponding to the target model type.
  • multiple model types include at least two of the following types: Bayesian network (Bayesian network), autoregressive (Autoregressive), mixed uniform distribution (Mixed Uniform Distribution), Gaussian kernel function (Radial Basis Function, RBF) , sum-product probability (Sum-Product probability) and neural networks (Neural Networks, ANNs).
  • Bayesian network Bayesian network
  • autoregressive Autoregressive
  • Mixed uniform distribution Mated Uniform Distribution
  • Gaussian kernel function Gaussian kernel function
  • sum-product probability Sum-product probability
  • neural networks Neural Networks, ANNs
  • obtaining the target model corresponding to the target model type includes: determining the target model type from the operators corresponding to the multiple model types. Corresponding operator; use the operator corresponding to the target model type to create an untrained model corresponding to the target model type; train the untrained model based on the sample data to obtain the target model corresponding to the standard model type.
  • the operators corresponding to each model type are the smallest computing units in the database.
  • the operators corresponding to the above multiple model types include but are not limited to Bayesian network operators, autoregressive model operators, mixed uniform distribution model operators, Gaussian kernel function model operators, neural network model operators, Sum-product probability model operators, etc.
  • the untrained model corresponding to the target model type is an untrained Bayesian network model
  • the target model corresponding to the target model type is a Bayesian network model, that is, training
  • a good Bayesian network model in this case, is the cardinality estimation model described above.
  • the model type of the base estimation model can be determined based on the target distribution characteristics extracted from the sample data.
  • the model type of the cardinality estimation model may be indicated by type indication information.
  • the type indication information may use bit mapping, binary value or other methods to indicate the model type of the cardinality estimation model.
  • Table 1 exemplarily provides a mapping table between the value of the type indication information and the model type of the cardinality estimation model. It can be seen from Table 1 that when the value of the type indication information is the first value, it indicates that the model type of the cardinality estimation model is Bayesian network; when the value of the type indication information is the second value, it indicates the model of the cardinality estimation model. The type is autoregressive; when the value of the type indication information is the third value, it indicates that the model type of the base estimation model is mixed uniform distribution,..., which is not specifically limited here.
  • Table 1 is only used as an example to reflect the correspondence between the value of the type indication information and the model type of the cardinality estimation model.
  • the text content and storage method of the correspondence can also be other.
  • the form is not specifically limited here.
  • selecting a target model type from multiple model types based on the target distribution characteristics extracted from the sample data includes: selecting a target model type from multiple model types based on the target distribution characteristics and mapping relationship information. , where the mapping relationship information includes mapping relationships between the above-mentioned multiple model types and predicted distribution characteristics.
  • the mapping relationship information can be obtained based on training of artificial intelligence algorithms.
  • the artificial intelligence algorithm can be a decision tree (Decision Tree), a random forest (Random Forest, RF), a support vector machine (Support Vector Machine, SVM) or other classification algorithms, which are not specifically limited here.
  • the predicted distribution characteristics include distribution characteristics corresponding to multiple analysis instructions.
  • the distribution characteristics corresponding to the first analysis instruction are extracted from the first sample data collected based on the first analysis instruction.
  • the analysis instructions corresponding to the above target distribution characteristics may be included in the multiple analysis instructions, or may not be included in the multiple analysis instructions.
  • the training samples of the artificial intelligence algorithm can be obtained based on expert experience.
  • the training process of the artificial intelligence algorithm can be, for example: assuming that the mapping relationships between known multiple sets of distribution features and multiple model types are obtained based on expert experience.
  • distribution feature 1 corresponds to model type 1, and distribution feature 1 is input into the artificial intelligence algorithm.
  • the intelligent algorithm performs classification, obtains the output result, and adjusts the parameters of the artificial intelligence algorithm based on the simulation error between the output result and model type 1. In this way, after training with a large number of training samples, the trained artificial intelligence algorithm can be based on the input distribution characteristics, and accurately output the model type corresponding to the distribution characteristics.
  • the artificial intelligence algorithm when the extracted distribution characteristics are input to the artificial intelligence algorithm, the artificial intelligence algorithm can learn from multiple predictions based on the mapping relationship between the predicted distribution characteristics and the predicted model type. Select the appropriate target model type for this distribution feature in the model type, that is, the artificial intelligence algorithm can output the above type indication information.
  • the extraction method of distribution features in training samples for artificial intelligence algorithm training can be: first obtain multiple analysis instructions based on expert experience, and the distribution characteristics corresponding to each analysis instruction are obtained from historical query records and databases based on the analysis instructions. extracted from the data sampled in at least one data table in .
  • the specific content of the distribution feature corresponding to the analysis instruction may refer to the specific content of the above-mentioned first feature and second feature.
  • the artificial intelligence algorithm used by the coordination node to select the target model type may be pre-trained.
  • the training of the artificial intelligence algorithm may not be completed within the coordination node, that is, the device using the artificial intelligence algorithm and the training device of the artificial intelligence algorithm may be different devices.
  • the device using the artificial intelligence algorithm is the first device and the training device is another device independent of the first device.
  • mapping relationship information is first obtained through training, and then the mapping relationship information is used to determine the target model type. That is to say, the mapping relationship information obtained through training occurred in the past time and space, and the mapping relationship information is used to determine the target model type in the present. occurs in time and space.
  • another training device undertakes the task of training to obtain mapping relationship information, which can reduce the load of the first device.
  • the device for using the artificial intelligence algorithm and the device for training the artificial intelligence algorithm may also be the same device.
  • the artificial intelligence algorithm is trained and used on the first device.
  • the determination of the target model type is related to the target distribution characteristics.
  • the target distribution feature extracted from the sample data has multiple of the following characteristics: a large number of columns (for example, greater than the first threshold), a number of columns The correlation between them is high (for example, greater than the second threshold)), the number of different values in each column is relatively even or moderate, the number of historical query instructions is small (for example, less than the third threshold), and the column combination coverage is relatively high Small (e.g., less than the fourth threshold).
  • the target distribution feature extracted from the sample data has multiple of the following characteristics: the number of columns is small (for example, less than the fifth threshold), historical query The number of instructions is large (for example, greater than the sixth threshold), the column combination coverage is high (for example, greater than the seventh threshold), and the column query range coverage corresponding to the historical query instruction is high (for example, greater than the eighth threshold).
  • the model creation module of the coordination node uses the operator corresponding to the target model type provided by the query executor to create an untrained model corresponding to the target model type.
  • the model creation module determines the model hyperparameters to select model hyperparameters with fast convergence speed and small error as much as possible.
  • Step B12 is optional.
  • the model hyperparameters may also be set by default by users or software, which are not specifically limited here.
  • model hyperparameters can be used to define the model structure or optimization strategy.
  • the model hyperparameters include at least one of the step size of gradient descent, the number of layers of the neural network, the learning rate, the number of iterations, the batch size, the regularization coefficient, etc.
  • the method for determining model hyperparameters can be a greedy algorithm, a grid search algorithm, an exhaustive search algorithm, or other algorithms, which are not specifically limited here.
  • the model creation module uses the sample data in S101 above to train the untrained model according to the model hyperparameters, and obtain the target model corresponding to the target model type.
  • the data used by the untrained model corresponding to the target model type during training is related to the target model type.
  • the data used by an untrained model during training is part of the data in the sample data
  • the data used by another untrained model during training is all the data in the sample data.
  • the sample data includes multi-column data corresponding to the analysis instructions and historical query instruction data corresponding to the analysis instructions.
  • using part of the data in the sample data to train the above-mentioned untrained model includes the following two situations: (1) When the target model type is any of the above-mentioned Bayesian network, autoregressive and sum-product probability, the untrained model is trained based on part of the data in the sample data, where part of the data in the sample data is analyzed Multiple columns of data corresponding to the instruction; (2) When the target model type is any of the above-mentioned mixed uniform distribution and Gaussian kernel functions, train the untrained model based on part of the data in the sample data, where, in the sample data Part of the data is the historical query command data corresponding to the analysis command.
  • Using all the data in the sample data to train the untrained model can be: when the target model type is a neural network, training the untrained model based on all the data in the sample data, and all the data in the sample data are analysis instructions The corresponding multi-column data and the historical query instruction data corresponding to the analysis instructions are all the data in the sample data.
  • the above-mentioned end condition when training an untrained model is related to the target model type.
  • the target model type is the above-mentioned neural network or sum-product probability
  • the end condition of training is that the output error of the model gradually stabilizes and reaches a convergence state.
  • the target model type is any one of the above-mentioned mixed uniform distribution model, Bayesian network model, and Gaussian kernel function
  • the end condition of training is to complete the input analysis of the training data.
  • model meta-information can also be obtained.
  • the model meta-information includes the identification of the target model, the analysis index The identifier of the column in the command (that is, the identifier of the column used to build the target model) and the identifier of the data table in the analysis instruction (that is, the identifier of the data table where the column used to build the target model is located).
  • the model metainformation also records The mapping relationship between the identifier of the target model, the identifier of the column in the analysis instruction, and the identifier of the data table in the analysis instruction.
  • the training of the untrained model corresponding to the target model type and the use of the target model may occur on the same device, for example, the first device.
  • the first device may store model metainformation, for example, store the model metainformation in memory and/or disk.
  • the first device stores the model metainformation into a system statistics table (or statistics table) of the database.
  • the model parameters of the target model may be stored in the memory and/or disk of the first device.
  • the first device stores the model parameters of the target model into a system model table (or system table) of the database, where the system model table is located on a disk. Before the first device stores the model parameters of the target model in the system model table, it needs to serialize the model parameters of the target model.
  • the first device may add or update the model parameters of the target model to the system model table.
  • the system model table may also store parameters of an old model, where the old model is a model parameter of a target model obtained by the first device according to historical analysis instructions. It can be seen from this that in the embodiment of the present application, the model parameters of the target model can be managed, updated and maintained by the database.
  • different analysis instructions correspond to different target models.
  • Different target models mean that different analysis instructions correspond to different types of target models, or that different analysis instructions correspond to the same type of target model but the model parameters of the target models corresponding to different analysis instructions are different.
  • Different analysis instructions means: the number of columns in the two analysis instructions is different, or the identifiers of the columns in the two analysis instructions are different, or the number of columns in the two analysis instructions are different and the identifiers of the columns are also different.
  • the coordination node is based on the received analysis instruction 1, and the cardinality estimation model obtained after the data sampling, model selection and model creation shown in Figure 2 is the target model 1; the coordination node is based on the received analysis instruction 2,
  • the cardinality estimation model obtained after the data sampling, model selection and model creation shown in Figure 2 is target model 2.
  • analysis instruction 1 is different from analysis instruction 2.
  • the type of target model 1 may be different from the type of target model 2. , it may also happen that the type 1 of target model 1 is the same as the type of target model 2 (for example, both are neural networks) but the model parameters of target model 1 are different from the model parameters of target model 2 (for example, the weights and bias values of the neural network wait).
  • the first device may receive the same analysis instruction at different times, and the target model corresponding to the same analysis instruction received at different times may also be different, wherein the analysis instruction may be issued by different clients at different times. It can be sent at any time, or it can be sent by the same client at different times, which is not specifically limited here.
  • the first device receives the analysis instruction 1 at time 1, and performs a series of operations such as sample data extraction, physical sign extraction, model type selection, model creation, and training based on the analysis instruction 1 to obtain the target model 1.
  • Time 2 receives the analysis command 1 again.
  • the target model 2 is obtained through a series of operations such as sample data extraction, physical sign extraction, model type selection, model creation, and training based on the analysis command 1.
  • the target model 2 may be the same as the target model 1. Different, specifically, the type of target model 2 is different from the type of target model 1, or the target model 2 and target model 1 are of the same type but the parameters of target model 2 are different from the parameters of target model 1.
  • the analysis instructions are received at different times. Between time 1 and time 2, the user's query instructions may be newly added to the historical query records and/or the data in each column of the data table may have changed. Changes, resulting in samples extracted based on analysis instructions at different times The data may be different, and the extracted distribution characteristics may also be different, so the final target model obtained may also be different.
  • the first device can store the newly obtained relevant information (for example, model parameters, model metainformation) of the target model corresponding to the analysis instruction in the memory and/or disk. Update to replace the relevant information of the old target model corresponding to the analysis command.
  • relevant information for example, model parameters, model metainformation
  • the first device receives the query instruction sent by the client.
  • the query command can be a query for a single table or a query for multiple tables.
  • the query for a single table is to filter the records in the data table according to the WHERE condition to form an intermediate table, and then return the query results according to the columns selected by SELECT;
  • the query for multiple tables can be to first combine the first data table and the second data
  • the table is queried according to the connection between the two tables, and the query results are used to connect and query the third data table... until all data tables are connected to form an intermediate table, and then the records of the intermediate table are filtered according to the WHERE condition and returned according to the columns specified by SELECT search result.
  • the query instruction includes query conditions, and the query conditions can be obtained according to the WHERE in the query instruction.
  • the query conditions in the query instructions include predicates and column identification information.
  • the predicates in the query conditions include at least one of like, between, is NULL/is NOT NULL, in, exists, and comparison operators, where like is used for characters
  • Partial consistency query of string is also called fuzzy query, between is used for range query, is NULL/is NOT NULL is used to determine the null value situation, specifically, is NULL is used to select the data of the column whose value is NULL (empty) , is NOT NULL is used to select the data of the column whose value is NOT NULL (not empty), in is used to find the data whose attribute value belongs to the specified set, exists is used to determine whether there are records that meet certain conditions, and the comparison operator is used To compare sizes.
  • the first device may use a cardinality estimation model to estimate the cardinality of the query instruction before executing the query instruction.
  • the execution plan generation module in the query optimizer obtains multiple candidate execution plans based on the query conditions in the query instructions, and the cardinality intelligent estimation module uses the cardinality estimation model to calculate the cardinality of each candidate execution plan. Estimation, so that each candidate execution plan has a corresponding estimated cardinality.
  • the execution plan selection module can select the optimal execution plan from multiple candidate execution plans based on the cardinality estimation results of the multiple candidate execution plans, and output it to the query executor to Let the query executor execute the query instruction according to the optimal execution plan.
  • the first device before executing the cardinality estimation, needs to first determine the target model matching the execution plan based on the model meta-information obtained above, and then obtain the target model from the memory or disk. Model parameters of the target model, and finally use the model parameters of the target model to estimate the base of the execution plan.
  • the target model matched by the execution plan may be determined based on the maximum number of columns in the execution plan that are the same as columns in each target model in the model metainformation.
  • the model metainformation includes two mapping relationships, one of which is: Target Model 1-Analysis Instructions, the column identifiers are ⁇ c1, c2, c3, c4 ⁇ , and the other mapping relationship is: Target Model 2-Analysis
  • the identifiers of the columns in the instruction are ⁇ c2, c5, c6 ⁇ .
  • the identifiers of the columns involved in execution plan 1 are ⁇ c1, c2, c3 ⁇
  • the identifiers of the columns involved in execution plan 2 are ⁇ c2, c5, c6 ⁇ .
  • execution plan 1 For execution plan 1, the comparison between execution plan 1 and target model 1 results in the same column ⁇ c1, c2, c3 ⁇ , and the comparison between execution plan 1 and target model 2 results in the same column ⁇ c2 ⁇ , so execution plan 1 matches
  • the target model is target model 1; for execution plan 2, the execution plan 2 and the target model 1 are compared and the same column is ⁇ c2 ⁇ , and the execution plan 2 and the target
  • the comparison of model 2 shows that the same columns are ⁇ c2, c5, c6 ⁇ , so the target model matched by execution plan 2 is target model 2.
  • use target model 1 to perform cardinality estimation on execution plan 1
  • use target model 2 to perform cardinality estimation on execution plan 2.
  • the first device after determining the target model based on the model meta-information, the first device can obtain the identification of the target model. Furthermore, it needs to obtain the model parameters of the target model based on the identification of the target model.
  • the first device can directly read the model parameters of the target model from the memory based on the identification of the target model. In this way, frequent deserialization operations caused by reading model parameters from disk can be avoided, which is beneficial to improving the efficiency of cardinality estimation.
  • the first device when the first device fails to read the model parameters of the target model from the memory of the first device based on the identification of the target model, the first device reads from the disk, for example, from the disk system
  • the model parameters of the target model are read from the model table, and the read model parameters of the target model are deserialized to restore the model parameters of the target model.
  • the model parameters of the target model are the parameters of the cardinality estimation model.
  • the first device whenever the first device obtains the model parameters of a target model from the disk, the first device stores the model parameters of the target model into the memory so that the same target model can be read next time.
  • the model parameters can be obtained directly from the memory, thus avoiding the frequent deserialization operations caused by reading the model parameters from the disk, which is equivalent to realizing the pre-reading of the model parameters, effectively improving the reading efficiency of the target model parameters. .
  • the storage space in the memory is limited, and the number of target models with model parameters stored in the memory is also limited, and the number of target models with model parameters stored in the memory is smaller than the number of target models with model parameters stored in the disk.
  • the memory can store up to 60 model parameters of the target model. As time goes by, the memory is filled with the model parameters of 60 target models. However, during use, it is found that some model parameters of the target model need to be called. It is not in the memory and has to be read from the disk, and there are some model parameters of the target model in the memory that have not been accessed for a long time. In this case, the model parameters of the corresponding target model in the memory can be deleted to increase the available storage of the memory. Space, for example, the deleted target model satisfies: the interval between the time when the model parameters of the target model were last accessed and the current time is longer than the preset time.
  • the type of model to be constructed for cardinality estimation is determined through the extracted distribution characteristics, and the operator corresponding to the type set in the database is used to realize the construction of the model, which not only improves the model construction
  • the accuracy rate is also conducive to improving the efficiency of cardinality estimation, ensuring the security of the database, and also facilitating the management and maintenance of relevant information of the model used for cardinality estimation in the database.
  • a variety of features are extracted from the multi-column data corresponding to the analysis instructions and the historical query instruction data, which can more accurately select the appropriate model type for different scenarios, enrich the application scenarios of cardinality estimation, and also improve the accuracy of cardinality estimation. .
  • FIG 4 is a flow chart of yet another cardinality estimation method provided by an embodiment of the present application.
  • This method is applied to the first device, which may be the coordination node in the above-mentioned Figure 1.
  • the coordination node in the above-mentioned Figure 1.
  • the method includes but is not limited to the following steps:
  • S201 Select a target model type from multiple model types according to the target distribution characteristics and mapping relationship information.
  • the mapping relationship information includes mapping relationships between the above-mentioned multiple model types and the predicted distribution characteristics.
  • the target distribution characteristics are from Extracted from sample data collected based on analysis instructions.
  • the sample data is collected from historical query records and at least one data table in the database based on the analysis instructions.
  • the sample data includes multi-column data corresponding to the analysis instruction and historical query instruction data corresponding to the analysis instruction.
  • the multi-column data corresponding to the analysis instruction is collected from at least one data table in the database based on the columns in the analysis instruction
  • the historical query instruction data corresponding to the analysis instruction is collected from historical query records based on the columns in the analysis instruction. of.
  • the target distribution characteristics include a first characteristic and a second characteristic, wherein the first characteristic is used to indicate the distribution characteristics of multiple columns of data corresponding to the analysis instruction, and the second characteristic is used to indicate the distribution characteristics of the columns in the historical query instruction data corresponding to the analysis instruction. distribution characteristics. It can be understood that the first feature is extracted from the multi-column data corresponding to the analysis instruction, and the second feature is extracted from the historical query instruction data corresponding to the analysis instruction.
  • the first feature includes at least one of the following features: the number of different values in each column of the multiple columns of data; the correlation information between the columns of the multiple columns of data; and the The number of columns in multi-column data.
  • the content of the first feature please refer to the relevant description of the first feature in the embodiment S102 of FIG. 3 and will not be described again here.
  • the second feature includes at least one of the following features: column combination coverage information of the historical query instruction data; column query range coverage corresponding to each historical query instruction in the historical query instruction data. information; and column query similarity information between historical query instructions in the historical query instruction data.
  • column combination coverage information of the historical query instruction data column query range coverage corresponding to each historical query instruction in the historical query instruction data. information
  • column query similarity information between historical query instructions in the historical query instruction data column query similarity information between historical query instructions in the historical query instruction data.
  • the sample data collection process is specifically: sending sampling instruction information to at least one data node, and the sampling instruction information is used to instruct at least one data node.
  • mapping relationship information may be obtained based on training of an artificial intelligence algorithm.
  • the plurality of model types include at least two of the following types: Bayesian network, autoregressive, mixed uniform distribution, Gaussian kernel function, sum-product probability, and neural network.
  • S202 According to the target model type, obtain the target model corresponding to the target model type.
  • the target model is used to perform cardinality estimation on the query instructions of the database.
  • obtaining a target model corresponding to the target model type according to the target model type includes: obtaining a target model corresponding to the target model type according to the target model type and sample data.
  • obtaining the target model corresponding to the target model type includes: determining from the operators corresponding to the above-mentioned multiple model types.
  • the operator corresponding to the target model type use the operator corresponding to the target model type to create an untrained model corresponding to the target model type; train the untrained model based on the sample data to obtain the target model corresponding to the target model type.
  • the target model is the cardinality estimation model obtained in S102 in the embodiment of FIG. 3.
  • the target model is the cardinality estimation model obtained in S102 in the embodiment of FIG. 3.
  • model parameters of the target model are stored in memory and/or disk.
  • the target model can also be used to perform cardinality estimation on the received query instructions.
  • the target model can also be used to perform cardinality estimation on the received query instructions.
  • the type of model to be constructed for cardinality estimation is first determined through the extracted distribution characteristics, and the model for cardinality estimation is constructed based on the determined model type, which not only improves the quality of the constructed model. It also helps to improve the accuracy of cardinality estimation.
  • distribution features are extracted from multiple dimensions such as column data values, correlations, number of columns, and query ranges corresponding to historical query loads, which increases the diversity of distribution features and enables more accurate selection of appropriate models for different scenarios. type, enriching the application scenarios of cardinality estimation.
  • FIG. 5 is a schematic structural diagram of a cardinality estimation device provided by an embodiment of the present application.
  • the cardinality estimation device 30 includes a type selection unit 310 and a model establishment unit 312 .
  • the cardinality estimation device 30 can be implemented by hardware, software, or a combination of software and hardware.
  • the type selection unit 310 is used to select a target model type from multiple model types according to target distribution characteristics and mapping relationship information, where the mapping relationship information includes mappings between the multiple model types and predicted distribution characteristics. relationship, the target distribution characteristics are extracted from the sample data collected based on the analysis instructions; the model building unit 312 is used to obtain the target model corresponding to the target model type according to the target model type, and the target model is used to execute query instructions on the database Cardinality estimate.
  • the cardinality estimation device 30 further includes a data collection unit (not shown), and the data collection unit is used to collect sample data.
  • the data collection unit is specifically configured to: send sampling instruction information to at least one data node, and the sampling instruction information is used to instruct at least one data node according to the above
  • the connection relationship between multiple data tables collects the data of the columns in the analysis instructions from these multiple data tables; collects the historical query instructions related to the columns in the analysis instructions in the historical query records.
  • the cardinality estimation device 30 can be used to implement the method described in the embodiment of FIG. 4 .
  • the type selection unit 310 can be used to perform S201
  • the model establishing unit 312 can be used to perform S202.
  • the cardinality estimation device 30 can also be used to implement the method described in the embodiment of FIG. 3. For the sake of simplicity of the description, details will not be described again here.
  • the software or firmware includes, but is not limited to, computer program instructions or code, and may be executed by a hardware processor.
  • the hardware includes but is not limited to various types of integrated circuits, such as central processing unit (CPU), digital signal processor (DSP), field-programmable gate array (FPGA) Or application-specific integrated circuit (ASIC).
  • Figure 6 is a schematic structural diagram of yet another cardinality estimation device provided by an embodiment of the present application.
  • the cardinality estimation device 40 includes: a processor 401 , a communication interface 402 , a memory 403 and a bus 404 .
  • the processor 401, the memory 403 and the communication interface 402 communicate through a bus 404. It should be understood that this application does not limit the number of processors and memories in the cardinality estimation device 40.
  • the cardinality estimation device 40 may be the coordination node in the above-mentioned Figure 1, or the first device in the embodiment shown in Figure 3.
  • the bus 404 may be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus, or the like.
  • the bus can be divided into address bus, data bus, control bus, etc. For ease of presentation, only one line is used in Figure 6, but it does not mean that there is only one bus or one type of bus.
  • Bus 404 may include a path for transmitting information between various components of cardinality estimation device 40 (eg, memory 403, processor 401, communication interface 402).
  • the processor 401 may include any one or more of a central processing unit (CPU), a microprocessor (MP), or a digital signal processor (DSP).
  • CPU central processing unit
  • MP microprocessor
  • DSP digital signal processor
  • the memory 403 is used to provide storage space, and data such as operating systems and computer programs can be stored in the storage space.
  • the memory 403 may be a random access memory (RAM), an erasable programmable read only memory (EPROM), a read-only memory (ROM), or a portable read-only memory.
  • RAM random access memory
  • EPROM erasable programmable read only memory
  • ROM read-only memory
  • portable read-only memory One or a combination of multiple types of memory (compact disc read memory, CD-ROM), etc.
  • the memory 403 may exist alone or be integrated inside the processor 401.
  • Communication interface 402 may be used to provide information input or output to processor 401. Or alternatively, the communication interface 402 can be used to receive data sent from the outside and/or send data to the outside. It can be a wired link interface such as an Ethernet cable, or a wireless link (such as Wi-Fi, Bluetooth, general wireless transmission, etc.) interface. Or alternatively, the communication interface 402 may also include a transmitter (such as a radio frequency transmitter, an antenna, etc.), or a receiver coupled to the interface.
  • a transmitter such as a radio frequency transmitter, an antenna, etc.
  • the processor 401 in the cardinality estimation device 40 is used to read the computer program stored in the memory 403 and to execute the aforementioned method, such as the method described in Figure 3 or Figure 4 .
  • the cardinality estimation device 40 may be one or more modules in the execution body of the method shown in FIG. 3 or FIG. 4 , and the processor 401 may be used to read one or more modules stored in the memory.
  • a computer program that performs the following operations:
  • the type selection unit 310 selects a target model type from multiple model types according to the target distribution characteristics and mapping relationship information.
  • the mapping relationship information includes mapping relationships between the multiple model types and the predicted distribution characteristics.
  • the target distribution Features are extracted from sample data collected based on analysis instructions;
  • the model building unit 312 obtains a target model corresponding to the target model type according to the target model type, and the target model is used to perform cardinality estimation on the query instructions of the database.
  • storage media include read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), programmable read-only memory (Programmable Read-only Memory, PROM), erasable programmable read-only memory ( Erasable Programmable Read Only Memory (EPROM), One-time Programmable Read-Only Memory (OTPROM), Electronically Erasable Programmable Read-Only Memory (EEPROM), Compact Disc Read-Only Memory (CD-ROM) or other optical disk storage, magnetic disk storage, tape storage, or any other computer-readable medium that can be used to carry or store data.
  • Read-Only Memory Read-Only Memory
  • RAM Random Access Memory
  • PROM Programmable Read-only Memory
  • PROM Programmable Read-only Memory
  • EPROM Erasable Programmable Read Only Memory
  • OTPROM One-time Programmable Read-Only Memory
  • EEPROM Electronically Erasable Programmable Read-Only Memory
  • CD-ROM Compact Disc Read-Only Memory
  • the technical solution of the present application is essentially or contributes to the existing technology, or all or part of the technical solution can be embodied in the form of a software product.
  • the computer program product is stored in a storage medium and includes a number of instructions. So that a device (which can be a personal computer, a server, or a network device, a robot, a microcontroller, a chip, a robot, etc.) executes all or part of the steps of the method described in various embodiments of the application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

提供了一种基数估计方法及装置,方法包括:根据目标分布特征和映射关系信息,从多个模型类型中选择目标模型类型,映射关系信息包括这多个模型类型与预知的分布特征之间的映射关系,目标分布特征是从基于分析指令采集到的样本数据中提取的(S201);根据目标模型类型,获得目标模型类型对应的目标模型,目标模型用于对数据库的查询指令执行基数估计(S202)。

Description

一种基数估计方法及装置
本申请要求于2022年03月31日提交中国知识产权局、申请号为202210334793.5、申请名称为“一种基数估计方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及数据库领域,尤其涉及一种基数估计方法及装置。
背景技术
基数估计技术广泛应用于数据库,是一种对查询指令或查询语句的结果行数进行执行前预估计的技术。基于对查询指令估计出的基数,能进行该查询指令的执行计划的代价估计以及最优执行计划的选择,有利于提高查询指令的执行效率,对提高数据库的整体性能具有重要作用。但当前的基数估计方法,存在基数估计的准确率低的问题。
发明内容
本申请公开了一种基数估计方法及装置,能够实现用于基数估计的模型的准确构建,有利于提高基数估计的效率和精准率。
第一方面,本申请提供了一种基数估计方法,该方法包括:根据目标分布特征和映射关系信息,从多个模型类型中选择目标模型类型,所述映射关系信息包括所述多个模型类型与预知的分布特征之间的映射关系,所述目标分布特征是从基于分析指令采集到的样本数据中提取的;根据所述目标模型类型,获得所述目标模型类型对应的目标模型,所述目标模型用于对数据库的查询指令执行基数估计。
该方法应用于第一设备,第一设备例如可以是部署在网络侧的服务器、或者为该服务器中的组件或者芯片。示例性地,第一设备可以部署在云环境中,或者第一设备也可以部署在边缘环境中。第一设备可以是集成的一个设备,也可以是分布式的多个设备,在此不作具体限定。
其中,分析指令用于收集与数据库中数据表内容相关的统计信息,例如,多列数据。分析指令可以是数据库的Analyze命令。数据库可以是结构化查询语言(Structured Query Language,SQL)数据库、MySQL数据库、Oracle数据库或其他数据库。
这里,预知的分布特征包括多条分析指令对应的分布特征,例如,第一分析指令对应的分布特征是从基于该第一分析指令采集到的第一样本数据中提取的。上述目标分布特征对应的分析指令可以包含于这多条分析指令中,也可以不包含于这多条分析指令中。
一种实现方式中,不同的分析指令对应的目标模型不同。其中,目标模型不同是指:目标模型的类型不同,或者,属于同一类型但是目标模型的模型参数不同。分析指令不同是指:两条分析指令中的列的数量不同,或者,两条分析指令中的列的标识不同,或者两条分析指令中的列的数量不同且列的标识也不同。
一种实现方式中,不同时刻接收到的同一分析指令对应的目标模型也可能不同,其中,同一分析指令可以是不同客户端在不同时刻发送的,也可以是同一客户端在不同时刻发送的。 尽管接收到同一分析指令,但由于发生在不同时刻,两次采集到的样本数据可能是不同的,例如,可能是历史查询记录发生变化和/或数据表发生变化导致的,进而提取出的分布特征也可能是不同的,故最后获得的目标模型也可能是不同的。可以理解,当同一分析指令对应的目标模型发生改变时,第一设备可以将该分析指令对应的最新的目标模型的相关信息(例如,模型参数等)在内存和/或磁盘中进行相应更新。
上述方法中,基于从样本数据中提取的分布特征从多个模型类型中选择适合该分布特征的目标模型类型,提供了适用多种应用场景下的多个模型类型,实现不同分布特征下模型类型的自适应选择。提高了用于基数估计的目标模型的准确率,有利于提高基数估计的准确率。
可选地,根据所述目标模型类型,获得所述目标模型类型对应的目标模型,包括:根据所述目标模型类型和所述样本数据,获得所述目标模型类型对应的目标模型。
实施上述实现方式,目标模型的获得不仅与目标模型类型有关,还与样本数据有关,实现了目标模型的精准构建。
可选地,数据库中设置有所述多个模型类型对应的算子,根据所述目标模型类型和所述样本数据,获得所述目标模型类型对应的目标模型,包括:从所述多个模型类型对应的算子中确定所述目标模型类型对应的算子;使用所述目标模型类型对应的算子,创建所述目标模型类型对应的未训练的模型;根据所述样本数据训练所述未训练的模型,获得所述目标模型类型对应的目标模型。
其中,上述多个模型类型中每个模型类型对应的算子均是数据库中的最小计算单元。
也就是说,在数据库中设置有每个模型类型有自身对应的算子。例如,当模型类型为贝叶斯网络时,该模型类型对应的算子为贝叶斯网络算子,该模型类型对应的模型为贝叶斯网络模型。例如,当模型类型为高斯核函数时,该模型类型对应的算子为高斯核函数算子,该模型类型对应的模型为高斯核函数模型。
示例性地,每个模型类型对应的算子也可以称作机器学习算子。所述多个模型类型中任一模型类型对应的算子具有下述特征中的至少一项:
所述模型类型对应的算子是数据库的执行计划中的原生节点;
所述模型类型对应的算子可以结合数据库内的优化器进行优化;和
所述模型类型对应的算子可以在数据库内直接访问存储和数据。
可以理解,所述目标模型类型对应的算子也具有上述三个特征中的至少一项。
这里,目标模型类型对应的未训练的模型在训练时所使用的数据与目标模型类型有关。
示例性地,根据样本数据中的部分数据训练目标模型类型对应的未训练的模型,其包括下述两种情况:(1)当目标模型类型为贝叶斯网络、自回归和和-积概率中任一类型时,根据样本数据中的部分数据训练该未训练的模型,其中,样本数据中的部分数据是指基于分析指令从数据库中的至少一个数据表中采集到的数据,即分析指令对应的多列数据;(2)当目标模型类型为混合均匀分布和高斯核函数中任一类型时,根据样本数据中的部分数据训练该未训练的模型,其中,样本数据中的部分数据是指基于分析指令从历史查询记录中采集到的数据,即分析指令对应的历史查询指令数据。
示例性地,根据样本数据中的全部数据训练目标模型类型对应的未训练的模型,在此情况下,目标模型类型可以是神经网络,样本数据中的全部数据是指根据分析指令从历史查询记录和数据库中的至少一个数据表中采样得到的数据。
一种实现方式中,目标模型类型对应的未训练的模型的训练和目标模型的使用可以发生在同一设备,例如,第一设备。
实施上述实现方式,基于样本数据和数据库中目标模型类型对应的算子获得用于基数估计的目标模型,提高了模型的构建效率。另外,根据不同类型的目标模型,使用样本数据中的不同数据实现对目标模型的针对性训练,有利于提高构建的模型的准确率。
可选地,所述样本数据是基于所述分析指令从历史查询记录和所述数据库中的至少一个数据表中采集到的。
历史查询记录包括用户输入的多条历史查询指令。其中,查询指令可以是对单表(或称为基表)的查询,也可以是对多表连接的查询。查询指令包括查询条件,例如,查询指令中的WHERE条件。查询条件包括谓词和列的标识信息,其中,查询条件中列的标识用于指示待查询的列,谓词包括like、between、is NULL/is NOT NULL、in、exists和比较运算符中的至少一种。
实施上述实现方式,样本数据的采集综合考虑了列数据和历史查询负载的采集,增加了样本数据的多样性。
可选地,样本数据包括分析指令对应的多列数据和分析指令对应的历史查询指令数据,目标分布特征包括第一特征和第二特征,其中,第一特征用于指示上述多列数据的分布特性,第二特征用于指示历史查询指令数据中列的分布特性。
可以理解,第一特征是基于分析指令对应的多列数据提取的,第二特征是基于分析指令对应的历史查询指令数据提取的。
实施上述实现方式,采集样本数据时,既采集了分析指令对应的多列数据,又采集了分析指令对应的历史查询指令数据,即综合考虑了列数据和历史查询负载的采集,增加了样本数据的多样性。从获得的样本数据中提取分布特征,也提高了分布特征的多样性。
可选地,第一特征包括下述特征中的至少一种:
多列数据中每列中不同数值的数量;
多列数据中各列之间的相关性信息;和
多列数据中列的数量。
其中,多列数据中各列之间的相关性信息包括下述中的至少一项:多列数据中各列之间的相关系数的平均值和多列数据中各列之间的相关系数的最大值。
通过提取多列数据中每列不同值的数量、列之间的相关性信息以及列的数量这些特征,实现了从多个维度表征这多列数据的分布特性。且在多列数据出现分布倾斜或者高相关性的场景下,这些特征也能精准地表征多列数据的分布特点。
可选地,第二特征包括下述特征中的至少一种:
历史查询指令数据的列组合覆盖度信息;
历史查询指令数据中每条历史查询指令对应的列查询范围覆盖度信息;和
历史查询指令数据中历史查询指令之间的列查询相似度信息。
其中,列组合覆盖度信息用于指示历史查询指令数据中已有的列组合数量与基于分析指令中的列获得的列组合总数的比值。例如,假设历史查询指令数据中有三条历史查询指令,这三条历史查询指令对应的列组合分别为:{c1,c2,c3}、{c1,c2}和{c2,c3},即历史查询指令数据中已有的列组合数量为3,假设分析指令为“Analyze(c1,c2,c3,c4)”,以分析指令中的列c1,c2,c3,c4中的至少2列进行组合,获得的列组合总数为11,故该历史查询指令中的列组合 覆盖度信息为
历史查询指令对应的列查询范围覆盖度信息用于指示该历史查询指令的列查询范围与该历史查询指令中的各列对应的列查询总范围的比值,各列对应的列查询总范围是根据各列中数值的最大值和最小值确定。例如,某历史查询指令中的查询条件为“WHERE 1<=c1<=3 and 1<=c2<=5 and 2<=c3<=10”,则该条历史查询指令的列查询范围为2*4*8=64,若这三列(即c1、c2和c3)中每列中数值的最小值为0以及最大值为100,则这三列中每列的查询范围为0-100,这三列对应的列查询总范围为100*100*100=106,通过计算获得该条历史查询指令对应的列查询范围覆盖度信息为
列查询相似度信息用于指示任意两条历史查询指令中的列查询范围的相似度。
实施上述实现方式,对历史查询指令数据进行特征提取时,考虑了历史查询负载的命中率,即列组合覆盖度,还考虑了范围查询,例如,列查询范围覆盖度、不同历史查询指令的列查询范围之间的相似度,实现了从多个维度对历史查询指令数据的分布特性进行表征。
可以看出,从样本数据中提取的分布特征不仅精细,而且精准,能够准确对列数据的分布特性和历史查询负载的分布特性进行表达。
可选地,在分析指令包括数据中的多个数据表的标识的情况下,样本数据的采集过程具体为:向至少一个数据节点发送采样指示信息,采样指示信息用于指示至少一个数据节点根据多个数据表之间的连接关系从多个数据表中采集分析指令中的列所在的数据;采集历史查询记录中与分析指令中的列相关的历史查询指令。
其中,多个数据表之间的连接关系可以通过各个数据表的主外键信息获得。数据表的主键为确定一条记录的唯一标识,数据表的外键用于关联另一个数据表。一个数据表有一个主键和至少一个外键,一个数据表的外键可以是另一个数据表的主键。
可以理解,样本数据中分析指令对应的多列数据为至少一个数据节点采集到后发送给第一设备的,样本数据中分析指令对应的历史查询指令数据为第一设备自身采集到的。
实施上述实现方式,依据数据表之间的连接关系依次采集各数据表中与分析指令中的列相关的数据,其避免了物化全部连接结果造成的内存爆炸,还提高了采样速率,实现了对多表连接的采样。
可选地,映射关系信息是基于人工智能算法的训练获得的。
示例性地,人工智能算法可以是决策树(Decision Tree)、随机森林(Random Forest,RF)、支持向量机(Support Vector Machine,SVM)或其他分类算法,在此不作具体限定。
一种实现方式中,人工智能算法的使用设备和人工智能算法的训练设备可以是不同的设备,例如,人工智能算法的使用设备为第一设备,人工智能算法的训练设备为独立于第一设备的另一设备。
当人工智能算法的训练和使用发生在不同设备时,也说明人工智能算法通过训练获得上述映射关系信息和使用该映射关系信息确定目标模型类型不是同时发生的。一般是先通过训练获得映射关系信息,再使用映射关系信息确定目标模型类型。如此,通过另外的训练设备承担训练获得映射关系信息的任务,可以减轻第一设备的负荷。
另一种实现方式中,人工智能算法的训练和使用也可以发生在同一设备,例如,人工智能算法的训练和使用均发生在第一设备。
可选地,目标模型的模型参数存储于内存和/或磁盘中。
其中,目标模型的模型参数存储至磁盘时需先执行序列化处理,以统一为可持久化存储的格式。因此,当目标模型的模型参数从磁盘中读取时,还需对读取到的模型参数进行反序列化处理以恢复出目标模型的模型参数的原有格式。
在一种实现方式中,第一设备在执行基数估计从磁盘中读取目标模型的模型参数后,第一设备还可以将该目标模型的模型参数存储至内存中,以便下次读取相同的目标模型的模型参数时可以直接从内存中获得,如此避免了模型参数从磁盘读取带来的频繁反序列化操作,相当于实现了模型参数的预读取,有效提高了目标模型参数的读取效率。
在另一种实现方式中,由于内存空间有限,内存中存储有模型参数的目标模型的数量也是有限的,可能出现当前内存已满但待调用的目标模型的模型参数也未存储于内存中,为了提高目标模型的模型参数的读取效率,在实际使用中,可以周期性地或者在内存存满的情况下,删除内存中的部分目标模型的模型参数,例如,被删除的目标模型满足:最近一次访问该目标模型的模型参数的时间与当前时间之间的间隔时长大于预设时长。如此,可使得后续待调用的目标模型的模型参数能尽可能从内存中直接获得,或者,待调用的目标模型的模型参数从磁盘中读取后还能存储至内存中。
可选地,多个模型类型包括下述类型中的至少两种:贝叶斯网络、自回归、混合均匀分布、高斯核函数、和-积概率和神经网络。
实施上述实现方式,提供了多个模型类型以供选择,丰富了基数估计的应用场景,有利于提高基数估计的效率和准确率。
第二方面,本申请提供了一种基数估计装置,该装置包括:类型选择单元,类型选择单元,用于根据目标分布特征和映射关系信息,从多个模型类型中选择目标模型类型,所述映射关系信息包括所述多个模型类型与预知的分布特征之间的映射关系,所述目标分布特征是从基于分析指令采集到的样本数据中提取的;模型建立单元,用于根据所述目标模型类型,获得所述目标模型类型对应的目标模型,所述目标模型用于对数据库的查询指令执行基数估计。
可选地,模型建立单元,具体用于:根据所述目标模型类型和所述样本数据,获得所述目标模型类型对应的目标模型。
可选地,模型建立单元,具体用于:从所述多个模型类型对应的算子中确定所述目标模型类型对应的算子;使用所述目标模型类型对应的算子,创建所述目标模型类型对应的未训练的模型;根据所述样本数据训练所述未训练的模型,获得所述目标模型类型对应的目标模型。
可选地,所述样本数据是基于所述分析指令从历史查询记录和所述数据库中的至少一个数据表中采集到的。
可选地,所述样本数据包括所述分析指令对应的多列数据和所述分析指令对应的历史查询指令数据,所述目标分布特征包括第一特征和第二特征,其中,所述第一特征用于指示所述多列数据的分布特性,所述第二特征用于指示所述历史查询指令数据中列的分布特性。
可选地,第一特征包括下述特征中的至少一种:
多列数据中每列中不同数值的数量;
多列数据中各列之间的相关性信息;和
多列数据中列的数量。
可选地,第二特征包括下述特征中的至少一种:
历史查询指令数据的列组合覆盖度信息;
历史查询指令数据中每条历史查询指令对应的列查询范围覆盖度信息;和
历史查询指令数据中历史查询指令之间的列查询相似度信息。
可选地,该装置还包括数据采集单元,在分析指令包括数据库中的多个数据表的标识的情况下,数据采集单元具体用于:向至少一个数据节点发送采样指示信息,采样指示信息用于指示至少一个数据节点根据多个数据表之间的连接关系从多个数据表中采集分析指令中的列所在的数据;采集历史查询记录中与分析指令中的列相关的历史查询指令。
可选地,映射关系信息是基于人工智能算法的训练获得的。
可选地,目标模型的模型参数存储于内存和/或磁盘中。
可选地,上述多个模型类型包括下述类型中的至少两种:贝叶斯网络、自回归、混合均匀分布、高斯核函数、和-积概率和神经网络。
第三方面,本申请提供了一种装置,该装置包括处理器和存储器,其中,存储器用于存储程序指令;所述处理器调用所述存储器中的程序指令,使得装置执行第一方面或者第一方面的任一可能的实现方式中的方法。
第四方面,本申请提供了一种计算机可读存储介质,包括计算机指令,当所述计算机指令在被处理器运行时,实现上述第一方面或者第一方面的任一可能的实现方式中的方法。
第五方面,本申请提供了一种计算机程序产品,当该计算机程序产品被处理器执行时,实现上述第一方面或者第一方面的任一可能的实施例中的所述方法。该计算机程序产品,例如可以为一个软件安装包,在需要使用上述第一方面的任一种可能的设计提供的方法的情况下,可以下载该计算机程序产品并在处理器上执行该计算机程序产品,以实现第一方面或者第一方面的任一可能的实施例中的所述方法。
上述第二方面至第五方面的技术效果与上述第一方面相同,在此不再赘述。
附图说明
图1是本申请实施例提供的一种系统架构的示意图;
图2是本申请实施例提供的一种协调节点的框图示意图;
图3是本申请实施例提供的一种基数估计方法的流程图;
图4是本申请实施例提供的又一种基数估计方法的流程图;
图5是本申请实施例提供的一种基数估计装置的结构示意图;
图6是本申请实施例提供的又一种基数估计装置的结构示意图。
具体实施方式
在本申请实施例中使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本申请。本申请实施例中的说明书和权利要求书中的术语“第一”、“第二”等是用于区别不同对象,而不是用于描述特定顺序。
需要说明的是,本申请中采用诸如“第一”、“第二”的前缀词,仅仅为了区分不同的描述对象,对被描述对象的位置、顺序、优先级、数量或内容等没有任何限定作用。此外,不同前缀词修饰的对象可以相同或不同,例如,被描述对象为“设备”,则“第一设备”和“第二设备” 可以是同一个设备、相同类型的设备或者不同类型的设备;再如,被描述对象为“信息”,则“第一信息”和“第二信息”可以是相同内容的信息或者不同内容的信息。总之,本申请实施例中对用于区分描述对象的前缀词的使用不构成对所描述对象的限制,对所描述对象的陈述参见权利要求或实施例中上下文的描述,不应因为使用这种前缀词而构成多余的限制。
需要说明的是,本申请实施例中采用诸如“a1、a2、……和an中的至少一项(或至少一个)”等的描述方式,包括了a1、a2、……和an中任意一个单独存在的情况,也包括了a1、a2、……和an中任意多个的任意组合情况,每种情况可以单独存在。例如,“a、b和c中的至少一项”的描述方式,包括了单独a、单独b、单独c、a和b组合、a和c组合、b和c组合,或abc三者组合的情况。
为了提高数据库的查询性能,在执行客户端发送的查询指令前,常采用基数估计方法对查询指令进行基数估计,以从查询指令的多个执行计划中选择代价最小的执行计划,从而提高查询指令的执行效率以及数据库的查询性能。
一般地,基数估计的具体过程如下:数据库的协调节点在监测到客户端发送的分析指令后,创建统计信息分析线程(或简称为分析器),其中,分析指令包括列的标识,分析指令中的列均属于同一个数据表(即基表);分析器根据分析指令中的列对该基表进行采样,以获得分析指令中的列对应的列数据,并对列数据进行分析获得简单统计信息,例如,当对列数据执行高频值统计操作时,简单统计信息包括列数据中的高频值,或者,当对列数据进行相关性分析时,简单统计信息包括列数据中各列之间的相关系数。然后,协调节点接收客户端发送的查询指令,根据查询指令生成多个执行计划,分别对多个执行计划中的每个执行计划进行基数估计,以执行计划1为例,确定执行计划1匹配的分析指令对应的简单统计信息,根据执行计划1匹配的简单统计信息对执行计划1进行基数估计。多个执行计划中各个执行计划的基数估计结果可以用于从这多个执行计划中选择一个最优的执行计划,从而协调节点可以基于这最优的执行计划执行该查询指令。可以看出,用于基数估计的简单统计信息较为粗粒度、单一,应用的场景也有限,例如只能实现对基表的多列基数估计,且估计的基数的准确率也低。
针对上述问题,本申请实施例提出一种基数估计方法,不仅能提高基数估计的准确率,还有利于提高基数的估计效率。
下面将结合附图,对本申请中的技术方案进行描述。
参见图1,图1示例性地给出了一种系统架构的示意图。该系统用于基数的估计。如图1所示,该系统至少包括客户端、协调节点和数据节点,其中,客户端与协调节点以有线或无线的方式进行连接,协调节点与数据节点以有线的方式进行连接。
客户端用于向协调节点发送分析指令,分析指令包括多列的标识,例如,分析指令为Analyze命令。数据节点作为用户数据的物理存储节点,数据节点包括至少一个节点(例如,图1中的数据节点1和数据节点2)。数据节点中存储有数据库中的数据,例如,数据表。数据节点可以根据协调节点的指令管理和操作自身存储的数据,例如,数据节点可以根据从协调节点接收到的采样指示信息,对自身存储的至少一个数据表中采集分析指令中的列所在的数据。协调节点用于响应于分析指令,执行模型类型的选择、模型创建、训练和存储,协调节点还用于使用基数估计模型执行基数估计。
协调节点例如可以是部署在网络侧的服务器、或者为该服务器中的组件或者芯片。示例性地,协调节点可以部署在云环境,即云计算服务器,或者协调节点也可以部署在边缘环境中,即边缘计算服务器。协调节点可以是集成的一个设备,也可以是分布式的多个设备,本申请实施例不做具体限定。
一种实现方式中,协调节点主要用于:从客户端接收分析指令;根据分析指令中的列进行采样,获得样本数据;在数据库内,根据样本数据和从样本数据中提取的目标分布特征,获得基数估计模型。在一些可能的实施例中,协调节点还可以接收客户端发送的查询指令,使用基数估计模型对该查询指令执行基数估计,获得的基数用于指示估计的查询指令执行后的结果行数。
上述系统中,考虑到数据传输的效率、安全稳定等要求,客户端与协调节点之间,协调节点与数据节点之间、数据节点与数据节点之间的通信通常使用局域网或者专线有线网络。
需要说明的是,图1仅为示例性架构图,但不限定图1所示系统包括的网元的数量。虽然图1未示出,但除图1所示的功能实体外,图1还可以包括其他功能实体。另外,本申请实施例提供的方法可以应用于图1所示的通信系统,当然本申请实施例提供的方法也可以适用其他通信系统,本申请实施例对此不予限制。
参见图2,图2是本申请实施例提供的一种协调节点的框图示意图,在图2中,协调节点内包括端口监听守护进程、分析器、查询优化器和查询执行器。其中,端口监听守护进程用于监听分析指令(或称为统计信息创建指令),分析指令可以是图1中的客户端发送的,分析器用于响应于端口监听守护进程监听到的分析指令,执行样本数据的采样以及基数估计模型的类型选择、模型创建、训练和存储,查询优化器可以使用获得的基数估计模型执行基数估计,查询执行器可以为分析器提供数据库内的算子,该算子可以用于基数估计模型的创建。其中,数据库可以是结构化查询语言(Structured Query Language,SQL)数据库、MySQL数据库、Oracle数据库或其他数据库。
示例性地,分析器具体包括数据采样模块、模型选择模块和模型创建模块,其中,数据采样模块用于根据分析指令从历史查询记录和数据库中的至少一个数据表中获得样本数据,模型选择模块用于从样本数据中提取目标分布特征;根据提取的目标分布特征,从多个模型类型中选择目标模型类型,模型创建模块用于根据目标模型类型,获得目标模型类型对应的目标模型。需要说明的是,该目标模型即为用于基数估计的模型,或称为基数估计模型。
其中,分析器在获得基数估计模型的过程中用到了上述查询执行器提供的数据库内目标模型类型对应的算子,该算子是数据库中的最小计算单元。在本申请实施例中,该算子是新设置在数据库中的智能算子。该算子与目标模型类型有关,示例性地,该算子包括但不限于贝叶斯网络算子、自回归模型算子、混合均匀分布模型算子、高斯核函数模型算子、神经网络模型算子、和-积概率模型算子等。
示例性地,目标模型类型对应的算子具有下述特征中的至少一项:
目标模型类型对应的算子是数据库的执行计划中的原生节点;
目标模型类型对应的算子可以结合数据库内的优化器进行优化;和
目标模型类型对应的算子可以在数据库内直接访问存储和数据。
查询优化器包括基数智能估计模块,基数智能估计模块能够使用分析器中模型创建模块 输出的基数估计模型对查询指令中的查询条件进行基数估计,获得估计的基数。在一些可能的实施例中,查询优化器还包括执行计划生成模块和执行计划选择模块,其中,执行计划生成模块用于根据查询指令生成多个候选的执行计划,其中,每个执行计划会经过基数智能估计模块的基数估计,执行计划选择模块用于根据每个执行计划的基数估计结果从多个候选的执行计划中选择最优的执行计划,并将该最优的执行计划输出给查询执行器,以使查询执行器根据最优的执行计划执行该查询指令。
需要说明的是,图2只是本申请实施例提供的一种协调节点的框图的示例,并不限定协调节点仅为图2所示形式。在一些可能的实施例中,协调节点也可以包括比图2所示更多或者更少的模块,在此不作具体限定。
参见图3,图3是本申请实施例提供的一种基数估计方法的流程图。该方法应用于第一设备,第一设备可以是上述图1中的协调节点,协调节点的具体构成可参考图2的相关描述,在此不再赘述。该方法包括但不限于以下步骤:
S101:根据分析指令中的列进行采样,获得样本数据。
在本申请实施例中,第一设备从客户端接收分析指令,分析指令包括待采集的多列的标识。列的标识用于指示数据表中的列。客户端可以是图1中的客户端。
其中,分析指令可以是Analyze命令,Analyze命令可以用于收集与数据库中数据表内容相关的统计信息,例如,多列数据。
在一些可能的实施例中,分析指令还包括数据表的标识。根据分析指令中数据表的标识的数量,可以确定分析指令中待采集的多列是否包含于同一个数据表。当分析指令包括一个数据表的标识时,说明分析指令中的多列包含于同一个数据表。当分析指令包括多个数据表的标识时,说明分析指令中的多列分别包含于不同的数据表中,在此情况下,分析指令还指示了列的标识与数据表的标识之间的归属关系。
示例性地,假设分析指令1为Analyze(tabel_1(c1,c2,c3);tabel_2(c4,c5)),可以看出,该分析指令中的多列包括{c1,c2,c3,c4,c5},该分析指令还包括两个数据表的标识,分别为表1tabel_1和表2tabel_2,其中,列c1、c2和c3均来自于表1,列c4和c5均来自于表2。
在本申请实施例中,样本数据包括分析指令对应的多列数据和分析指令对应的历史查询指令数据,其中,分析指令对应的多列数据包括分析指令中的各个列所在的数据,分析指令对应的历史查询指令数据包括与分析指令中的列相关的历史查询指令。
以分析指令2Analyze(tabel_1(c1,c2);tabel_2(c3,c4))为例,分析指令对应的多列数据即为列c1、c2、c3、c4和c5这五列数据;对于上述采集到的历史查询指令满足:历史查询指令中的列均包含于该分析指令中的列,在此情况下,根据分析指令1中的列采集到的历史查询指令中,列可能的组合方式包括下述情况中的至少一项:{c1,c2}、{c1,c3}、{c1,c4}、{c2,c3}、{c2,c4}、{c3,c4}、{c1,c2,c3}、{c1,c2,c4}、{c2,c3,c4}、{c1,c3,c4}和{c1,c2,c3,c4}。
以历史查询指令1为例具体说明历史查询指令1中列的组合方式,假设历史查询指令1为“SELECT*FROM a left join b ON a.c1=b.c1WHERE a.c2=”1”and b.c3>40”,其中,a和b为需要连接的两个数据表的标识,采取的连接方式为左连接left join,“ON a.c1=b.c1”表示数据表a和数据表b的连接条件,“WHERE a.c2=”1”and b.c3>40”表示查询条件,具体为对数据表a和数据表b连接后的中间表的查询条件。基于该条历史查询指令中的查询条件,可知 该条历史查询指令中列组合的方式为{c2,c3}。
一种实现方式中,当分析指令仅包括数据库中的一个数据表的标识时,根据分析指令中的列进行采样,包括:向至少一个数据节点发送第一采样指示信息,第一采样指示信息用于指示至少一个数据节点从分析指令中的列所在的数据表中采集该列数据;采集历史查询记录中与分析指令中的列相关的历史查询指令。
另一种实现方式中,当分析指令包括数据库中的多个数据表的标识时,根据分析指令中的列进行采样,包括:向至少一个数据节点发送第二采样指示信息,第二采样指示信息用于指示至少一个数据节点根据多个数据表之间的连接关系从多个数据表中采集该分析指令中的列所在的数据;采集历史查询记录中与该分析指令中的列相关的历史查询指令。
其中,多个数据表之间的连接关系可以通过各个数据表的主外键信息获得。在关系型的数据库中,数据表的主键为确定一条记录的唯一标识,数据表的外键用于关联另一个数据表。一个数据表有一个主键和至少一个外键,一个数据表的外键可以是另一个数据表的主键。例如,一条记录包括“学号-姓名-年龄”,学号能唯一确定一个学生,故学号是主键,姓名和/或年龄可以作为外键。
示例性地,当分析指令中的列包括数据库中的多个数据表的标识时,至少一个数据节点接收到第二采样指示信息后,在各个数据表的主外键信息定义完整的情况下,每个数据节点可以根据各个数据表的主外键信息获得多个数据表之间的连接关系,并根据多个数据表之间的连接关系对多个数据表进行逐步采样。例如从多个数据表中随机选择一个数据表作为采样起点,采集该数据表中相关的列数据;然后基于主外键信息查找该数据表连接的下个数据表并从中采集相关的列数据,依据此方式实现多个数据表的均匀采样,提高了数据表中的关联数据的采集效率。在至少一个数据节点完成采集后,至少一个数据节点将自身采集到的数据发送给第一设备。需要说明的是,至少一个数据节点采集到的数据即为上述中样本数据中分析指令对应的多列数据。
示例性地,结合上述图2具体说明样本数据的采样过程,具体可参见下述A11-A13:
A11:协调节点的端口监听守护进程监听到客户端发送的分析指令时,创建分析器。
A12:分析器中的数据采样模块判断分析指令是否包括多个数据表的标识,当数据采样模块确定分析指令仅包括一个数据表的标识时,向图1中的至少一个数据节点发送第一采样指示信息以采集该数据表中分析指令中的列所在的数据;在数据采样模块确定分析指令包括多个数据表的标识时,向图1中的至少一个数据节点发送第二采样指示信息以采集多个数据表中分析指令中的列所在的数据。有关第一采样指示信息和第二采样指示信息具体可参考上述相关叙述,在此不再赘述。
需要说明的是,至少一个数据节点在完成数据的采集后,至少一个数据节点还将采集到的数据返回给该协调节点。
A13:分析器中的数据采样模块还采集历史查询记录中与分析指令中的列相关的历史查询指令。
S102:在数据库内,根据样本数据和从样本数据中提取的目标分布特征,获得基数估计模型。
在本申请实施例中,样本数据包括分析指令对应的多列数据和分析指令对应的历史查询指令数据。其中,分析指令对应的多列数据是上述至少一个数据节点采集到后发送给第一设 备,分析指令对应的历史查询指令数据是第一设备自身采集到的。
基于分析指令中的列获得样本数据后,需先从样本数据中提取目标分布特征。由于样本数据包括分析指令对应的多列数据和分析指令对应的历史查询指令数据这两种数据,相应地,目标分布特征包括第一特征和第二特征,其中,第一特征是根据分析指令对应的多列数据提取出的,第一特征用于指示分析指令对应的多列数据的分布特性;第二特征是根据分析指令对应的历史查询指令数据提取出的,第二特征用于指示历史查询指令数据中列的分布特性。
下面分别介绍第一特征和第二特征的具体内容:
第一特征包括下述特征中的至少一种:
(1)多列数据中每列中不同数值的数量;
(2)多列数据中各列之间的相关性信息;和
(3)多列数据中列的数量。
多列数据中各列之间的相关性信息包括多列数据中各列之间的相关系数的平均值和多列数据中各列之间的相关系数的最大值中的至少一项。通过提取多列数据中每列不同值的数量、列之间的相关性信息以及列的数量,即使在这多列数据出现分布倾斜或者高相关性的场景下,也能准确表征多列数据的分布特性,有利于精准地选择适合的基数估计模型,还能提高估计的基数的准确率。
第二特征包括下述特征中的至少一种:
(1)历史查询指令数据中的列组合覆盖度信息;
(2)历史查询指令数据中每条历史查询指令对应的列查询范围覆盖度信息;和
(3)历史查询指令数据中历史查询指令之间的列查询相似度信息。
其中,列组合覆盖度信息用于指示历史查询指令数据中已有的列组合数量与根据分析指令中的列获得的列组合总数的比值。
例如,假设历史查询指令数据包括“select*from table where c1=1 and c2=2 and c3=3”、“select*from table where c1=1 and c2=2”和“select*from table where c2=2 and c3=3”,根据历史查询指令数据中各历史查询指令的查询条件,可知历史查询指令数据中已有的列组合数量为3,这三种列组合分别{c1,c2,c3}、{c1,c2}和{c2,c3}。假设分析指令例如为“Analyze(c1,c2,c3,c4)”,则其可能的列组合共有11种,分别为{c1,c2}、{c1,c3}、{c1,c4}、{c2,c3}、{c2,c4}、{c3,c4}、{c1,c2,c3}、{c1,c2,c4}、{c2,c3,c4}、{c1,c3,c4}和{c1,c2,c3,c4},通过计算获得该历史查询指令数据中的列组合覆盖度信息为
历史查询指令对应的列查询范围覆盖度信息用于指示该历史查询指令的列查询范围与该历史查询指令中的各列对应的列查询总范围的比值,各列对应的列查询总范围是根据各列中数值的最大值和最小值确定。
例如,假设历史查询指令数据中的某条历史查询指令中的查询条件为“WHERE 1<=c1<=3and 1<=c2<=5and 2<=c3<=10”,则该条历史查询指令的列查询范围为2*4*8=64,若这三列(即c1、c2和c3)中每列中数值的最小值为0以及最大值为100,则这三列中每列的查询范围为0-100,这三列对应的列查询总范围为100*100*100=106,通过计算获得该条历史查询指令对应的列查询范围覆盖度信息为
列查询相似度信息用于指示任意两条历史查询指令中的列查询范围的相似度。
例如,假设历史查询指令1中的查询条件为“WHERE 1<=c1<=3 and 1<=c2<=5 and  2<=c3<=10”,将该查询条件记作向量1[1,3,1,5,2,10];假设历史查询指令2中的查询条件为“WHERE 1<=c1<=6 and 2<=c2<=7”,将该查询条件记作向量2[1,6,2,7,0,0],则计算向量1与向量2之间的欧式距离为该欧式距离可以用于表征历史查询指令1和历史查询指令2中的列查询范围的相似度,故将该欧式距离作为这两条历史查询指令之间的列查询相似度信息。
可以看出,对历史查询指令数据进行特征提取时,还考虑到了范围查询,丰富了基数估计的应用场景,有利于提高后续估计的基数的精准率。
在本申请实施例中,根据样本数据和从样本数据中提取出的目标分布特征,获得基数估计模型,包括:根据从样本数据中提取出的目标分布特征,从多个模型类型中选择目标模型类型;根据目标模型类型和样本数据,获得目标模型类型对应的目标模型。
其中,多个模型类型包括下述类型中的至少两种:贝叶斯网络(Bayesian network)、自回归(Autoregressive)、混合均匀分布(Mixed Uniform Distribution)、高斯核函数(Radial Basis Function,RBF)、和-积概率(Sum-Product probability)和神经网络(Neural Networks,ANNs)。需要说明的是,本申请实施例并不限定多个模型类型仅为上述的示例,模型类型还可以是其他类型,在此不作具体限定。
进一步地,数据库中设置有上述多个模型类型对应的算子,根据目标模型类型和样本数据,获得目标模型类型对应的目标模型,包括:从多个模型类型对应的算子中确定目标模型类型对应的算子;使用目标模型类型对应的算子,创建目标模型类型对应的未训练的模型;根据样本数据训练该未训练的模型,获得标模型类型对应的目标模型。
其中,多个模型类型对应的算子中每个模型类型对应的算子均为数据库中的最小计算单元。示例性地,上述多个模型类型对应的算子包括但不限于贝叶斯网络算子、自回归模型算子、混合均匀分布模型算子、高斯核函数模型算子、神经网络模型算子、和-积概率模型算子等。
示例性地,在目标模型类型为贝叶斯网络时,目标模型类型对应的未训练的模型是未训练的贝叶斯网络模型,目标模型类型对应的目标模型为贝叶斯网络模型,即训练好的贝叶斯网络模型,在此情况下,贝叶斯网络模型即为上述基数估计模型。
也就是说,根据样本数据中提取出的目标分布特征可以确定基数估计模型的模型类型。
示例性地,基数估计模型的模型类型可以通过类型指示信息来指示。类型指示信息可以采用比特映射、二进制取值或其他方式来指示基数估计模型的模型类型。
参见表1,表1示例性地提供了一种类型指示信息的取值与基数估计模型的模型类型之间的映射表。由表1可知,当类型指示信息的取值为第一值时,表示基数估计模型的模型类型为贝叶斯网络;当类型指示信息的取值为第二值时,表示基数估计模型的模型类型为自回归;当类型指示信息的取值为第三值时,表示基数估计模型的模型类型为混合均匀分布,…,在此不作具体限定。
表1

可以理解,上述表1仅作为一个示例,以体现类型指示信息的取值与基数估计模型的模型类型之间的对应关系,在实际应用中,该对应关系的文字内容和存储方式还可以是其他形式,在此不作具体限定。
一具体实施中,根据从样本数据中提取出的目标分布特征,从多个模型类型中选择目标模型类型,包括:根据该目标分布特征和映射关系信息,从多个模型类型中选择目标模型类型,其中,映射关系信息包括上述多个模型类型与预知的分布特征之间的映射关系。
其中,映射关系信息可以是基于人工智能算法的训练获得的。示例性地,人工智能算法可以是决策树(Decision Tree)、随机森林(Random Forest,RF)、支持向量机(Support Vector Machine,SVM)或其他分类算法,在此不作具体限定。
示例性地,预知的分布特征包括多条分析指令对应的分布特征,例如,第一分析指令对应的分布特征是从基于该第一分析指令采集到的第一样本数据中提取的。上述目标分布特征对应的分析指令可以包含于这多条分析指令中,也可以不包含于这多条分析指令中。
一种实现方式中,人工智能算法的训练样本可以是基于专家经验获得的。人工智能算法的训练过程例如可以是:假设基于专家经验获得已知的多组分布特征与多个模型类型之间的映射关系,例如,分布特征1与模型类型1对应,将分布特征1输入人工智能算法进行分类,获得输出结果,根据该输出结果与模型类型1这一类型之间的模拟误差调整人工智能算法的参数,如此经过大量的训练样本的训练,训练好的人工智能算法能够根据输入的分布特征,准确地输出该分布特征对应的模型类型。因此,在人工智能算法的应用过程中,将提取出的分布特征输入该人工智能算法时,人工智能算法可以根据学习到的预知的分布特征与预知的模型类型之间的映射关系,从多个模型类型中为该分布特征选择合适的目标模型类型,即人工智能算法可以输出上述类型指示信息。
另外,用于人工智能算法训练的训练样本中的分布特征的提取方法可以是:先基于专家经验获得多条分析指令,每条分析指令对应的分布特征为根据该分析指令从历史查询记录和数据库中的至少一个数据表中采样到的数据中提取的,同样,该分析指令对应的分布特征中的具体内容可参考上述第一特征和第二特征的具体内容。
在一种实现方式中,协调节点使用的用于选择目标模型类型的人工智能算法可以是预先训练好的。人工智能算法的训练也可以不是在协调节点内完成的,即人工智能算法的使用设备和人工智能算法的训练设备可以是不同的设备,例如,人工智能算法的使用设备为第一设备,训练设备为独立于第一设备的另一设备。
人工智能算法的训练和使用分别发生在不同的设备,也说明人工智能算法通过训练获得上述映射关系信息和使用映射关系信息确定目标模型类型不是同时发生的。一般是先通过训练获得映射关系信息,再使用映射关系信息确定目标模型类型,也就是说,通过训练获得映射关系信息是在过去的时空发生的,而使用映射关系信息确定目标模型类型是在现在的时空发生的,在此场景下,通过另外的训练设备承担训练获得映射关系信息的任务,可以减轻第一设备的负荷。
在另一种实现方式中,人工智能算法的使用设备和人工智能算法的训练设备也可以是同一设备,例如,人工智能算法的训练和使用均在第一设备。
在本申请实施例中,目标模型类型的确定与目标分布特征有关。
示例性地,当选择的模型类型为贝叶斯网络时,从样本数据中提取出的目标分布特征具有下述特点中的多项:列的数量较多(例如,大于第一阈值)、列之间的相关系较高(例如,大于第二阈值))、每列中不同数值的数量比较均匀或适中、历史查询指令的数量较少(例如,小于第三阈值)和列组合覆盖度较小(例如,小于第四阈值)。
示例性地,当选择的模型类型为高斯核函数时,从样本数据中提取出的目标分布特征具有下述特点中的多项:列的数量较少(例如,小于第五阈值)、历史查询指令的数量较多(例如,大于第六阈值)、列组合覆盖度较高(例如,大于第七阈值)和历史查询指令对应的列查询范围覆盖度较高(例如,大于第八阈值)。
示例性地,结合图2说明目标模型类型对应的目标模型的获得过程进行说明,具体可参见下述B11-B13:
B11:协调节点的模型选择模块确定目标模型类型后,协调节点的模型创建模块使用查询执行器提供的该目标模型类型对应的算子创建该目标模型类型对应的未训练的模型。
B12:模型创建模块确定模型超参数,以尽可能选择收敛速度快、误差小的模型超参数。
步骤B12是可选的。在一些可能的实施例中,模型超参数也可以是用户或软件默认设置的,在此不作具体限定。
其中,模型超参数可以用于定义模型结构或优化策略。模型超参数包括梯度下降的步长、神经网络的层数、学习速率、迭代次数、批尺寸、正则化的系数等中的至少一项。模型超参数的确定方法可以是贪心算法、网格搜索算法、穷尽搜索算法或者其他算法,在此不作具体限定。
B13:模型创建模块根据模型超参数,使用上述S101中的样本数据训练该未训练的模型,获得目标模型类型对应的目标模型。
其中,目标模型类型对应的未训练的模型在训练时所使用的数据与目标模型类型有关。例如,一种未训练的模型在训练时所使用的数据为样本数据中的部分数据,另一种未训练的模型在训练时所使用的数据为样本数据中的全部数据。
例如,由S101可知样本数据包括分析指令对应的多列数据和分析指令对应的历史查询指令数据,其中,使用样本数据中的部分数据训练上述未训练的模型包括下述两种情况:(1)在目标模型类型为上述贝叶斯网络、自回归和和-积概率中任一类型的情况下,根据样本数据中的部分数据训练该未训练的模型,其中,样本数据中的部分数据为分析指令对应的多列数据;(2)在目标模型类型为上述混合均匀分布和高斯核函数中任一类型的情况下,根据样本数据中的部分数据训练该未训练的模型,其中,样本数据中的部分数据为分析指令对应的历史查询指令数据。使用样本数据中的全部数据训练该未训练的模型可以是:在目标模型类型为神经网络的情况下,根据样本数据中的全部数据训练该未训练的模型,样本数据中的全部数据为分析指令对应的多列数据和分析指令对应的历史查询指令数据,即样本数据中的全部数据。
在一种实现方式中,上述未训练的模型训练时的结束条件与目标模型类型有关。示例性地,在目标模型类型为上述神经网络或和-积概率时,训练的结束条件为模型的输出误差逐渐稳定达到收敛状态。示例性地,在目标模型类型为上述混合均匀分布模型、贝叶斯网络模型和高斯核函数中任一类型时,训练的结束条件为完成对训练数据的输入分析即可。
在本申请实施例中,还可获得模型元信息,模型元信息包括目标模型的标识、该分析指 令中的列的标识(即构建该目标模型所使用的列的标识)和该分析指令中数据表的标识(即构建目标模型使用的列所在的数据表的标识),模型元信息还记录了目标模型的标识、分析指令中的列的标识和分析指令中数据表的标识之间的映射关系。
一种实现方式中,目标模型类型对应的未训练的模型的训练和目标模型的使用可以发生在同一设备,例如,第一设备。
在本申请实施例中,第一设备可以存储模型元信息,例如,将模型元信息存储于内存和/或磁盘中。在一些可能的实施例中,第一设备将模型元信息存储至数据库的系统统计表(或称为统计表)。
在本申请实施例中,目标模型的模型参数可以存储在第一设备的内存和/或磁盘里。
一种实现方式中,第一设备将目标模型的模型参数存储至数据库的系统模型表(或称为系统表),其中,系统模型表位于磁盘中。第一设备将目标模型的模型参数存储至系统模型表之前,需先将目标模型的模型参数进行序列化处理。
在一些可能的实施例中,第一设备在获得当前分析指令对应的目标模型后,可以将目标模型的模型参数添加或更新至系统模型表中。在一些可能的实施例中,系统模型表中也可能存储有旧模型的参数,其中,旧模型为第一设备为根据历史分析指令获得的目标模型的模型参数。由此可以看出,在本申请实施例中,目标模型的模型参数可以由数据库进行管理、更新和维护。
在本申请实施例中,不同的分析指令对应的目标模型不同。其中,目标模型不同是指:不同的分析指令对应的目标模型的类型不同,或者,不同的分析指令对应同一种类型的目标模型但不同的分析指令对应的目标模型的模型参数不同。分析指令不同是指:两条分析指令中的列的数量不同,或者,两条分析指令中的列的标识不同,或者两条分析指令中的列的数量不同且列的标识也不同。
示例性地,假设协调节点基于接收到的分析指令1,经过图2所示的数据采样、模型选择以及模型创建后获得的基数估计模型为目标模型1;协调节点基于接收到的分析指令2,经过图2所示的数据采样、模型选择以及模型创建后获得的基数估计模型为目标模型2,其中,分析指令1与分析指令2不同,可能出现目标模型1的类型与目标模型2的类型不同,也可能出现目标模型1的类型1与目标模型2的类型相同(例如均为神经网络)但目标模型1的模型参数与目标模型2的模型参数不同(例如,神经网络的权重、偏置值等)。
在一些可能的实施例中,第一设备可能在不同时刻接收到相同的分析指令,不同时刻接收到的同一分析指令对应的目标模型也可能不同,其中,该分析指令可以是不同客户端在不同时刻发送的,也可以是同一客户端在不同时刻发送的,在此不作具体限定。例如,假设第一设备在时刻1接收到分析指令1,基于分析指令1执行了上述样本数据提取、体征提取、模型类型选择、模型创建以及训练等一系列操作获得目标模型1,第一设备在时刻2再次接收到分析指令1,假设此时基于分析指令1经过样本数据提取、体征提取、模型类型选择、模型创建以及训练等一系列操作获得目标模型2,则目标模型2可能与目标模型1不同,具体表现为目标模型2的类型与目标模型1的类型不同,或者,目标模型2和目标模型1为同一类型但目标模型2的参数与目标模型1的参数不同。尽管是对同一分析指令进行分析,但分析指令是在不同时刻接收到的,时刻1至时刻2期间,历史查询记录可能新增加了用户的查询指令和/或数据表中各列的数据有发生更改,导致基于不同时刻的分析指令提取出的样本 数据可能是不同的,进而提取出的分布特征也可能是不同的,故最后获得的目标模型也可能是不同的。可以理解,当同一分析指令对应的目标模型发生改变时,第一设备可以将新获得的该分析指令对应的目标模型的相关信息(例如,模型参数、模型元信息)在内存和/或磁盘中进行更新,以替换该分析指令对应的旧的目标模型的相关信息。
可选地,在一些可能的实施例中,还可以执行:
S103:使用基数估计模型对查询指令中的查询条件进行基数估计,获得的基数用于指示估计的查询指令的执行结果的行数。
在本申请实施例中,第一设备接收客户端发送的查询指令。
查询指令可以是对单表的查询,也可以是对多表的查询。其中,对单表的查询是根据WHERE条件过滤数据表中的记录形成中间表,然后根据SELECT选择的列返回查询结果;对多表的查询可以是先将第一个数据表和第二个数据表按照两表连接做查询,使用查询结果与第三个数据表进行连接查询,…,直至所有数据表完成连接形成一个中间表,然后根据WHERE条件过滤中间表的记录以及根据SELECT指定的列返回查询结果。
查询指令包括查询条件,查询条件可以根据查询指令中的WHERE获得。查询指令中的查询条件包括谓词和列的标识信息,查询条件中的谓词包括like、between、is NULL/is NOT NULL、in、exists和比较运算符中的至少一种,其中,like用于字符串的部分一致性查询或称为模糊查询,between用于范围查询,is NULL/is NOT NULL用于判断空值情况,具体地,is NULL用于选出值为NULL(空)的列的数据,is NOT NULL用于选出值为NOT NULL(非空)的列的数据,in用于查找属性值属于指定集合的数据,exists用于判断是否存在满足某种条件的记录,比较运算符用于比较大小。
为了提高该查询指令的查询效率,第一设备可以在执行该查询指令前,使用基数估计模型对该查询指令进行基数估计。
一具体实施中,结合图2,查询优化器中的执行计划生成模块根据查询指令中的查询条件获得多个候选的执行计划,基数智能估计模块使用基数估计模型对每个候选的执行计划进行基数估计,从而每个候选的执行计划有对应的估计的基数。在一些可能的实施例中,执行计划选择模块可以根据这多个候选的执行计划的基数估计结果从多个候选的执行计划中选择最优的执行计划,并将其输出给查询执行器,以使查询执行器根据最优的执行计划执行该查询指令。
一具体实施中,以一个执行计划的基数估计为例,在执行基数估计前,第一设备需先根据上述获得的模型元信息确定该执行计划匹配的目标模型,再从内存或磁盘中获取该目标模型的模型参数,最后使用该目标模型的模型参数对该执行计划进行基数估计。
进一步地,执行计划匹配的目标模型可以根据该执行计划中的列与模型元信息中每个目标模型中的列相同的最大数量确定。
例如,假设模型元信息包括2条映射关系,其中一条映射关系为:目标模型1-分析指令中列的标识为{c1,c2,c3,c4},另一条映射关系为:目标模型2-分析指令中列的标识为{c2,c5,c6},假设执行计划1中涉及的列的标识有{c1,c2,c3},执行计划2涉及的列的标识有{c2,c5,c6},对于执行计划1,执行计划1与目标模型1比较得出相同的列为{c1,c2,c3},执行计划1与目标模型2比较得出相同的列为{c2},故执行计划1匹配的目标模型为目标模型1;对于执行计划2,执行计划2与目标模型1比较得出相同的列为{c2},执行计划2与目标 模型2比较得出相同的列为{c2,c5,c6},故执行计划2匹配的目标模型为目标模型2。综上,使用目标模型1对执行计划1进行基数估计,使用目标模型2对执行计划2进行基数估计。
在本申请实施例中,第一设备在基于模型元信息确定目标模型后,即可获得目标模型的标识,进一步地,需要根据目标模型的标识获得该目标模型的模型参数。
在一种实现方式中,第一设备可以基于目标模型的标识,从内存中直接读取该目标模型的模型参数。如此,可以避免模型参数从磁盘读取带来的频繁反序列化操作,有利于提高基数的估计效率。
在另一种实现方式中,第一设备基于目标模型的标识,从第一设备的内存中读取该目标模型的模型参数失败时,第一设备从磁盘中读取,例如,从磁盘的系统模型表中读取该目标模型的模型参数,并对读取到的目标模型的模型参数进行反序列化处理,以实现对目标模型的模型参数的恢复。可以理解,目标模型的模型参数即为基数估计模型的参数。
在一些可能的实施例中,每当第一设备从磁盘中获得一目标模型的模型参数后,第一设备会将该目标模型的模型参数存储至内存中,以便下次读取相同的目标模型的模型参数时可以直接从内存中获得,如此避免了模型参数从磁盘读取带来的频繁反序列化操作,相当于实现了模型参数的预读取,有效提高了目标模型参数的读取效率。
但内存中的存储空间是有限的,内存中存储有模型参数的目标模型的数量也是有限的,且内存中存储有模型参数的目标模型的数量小于磁盘中存储有模型参数的目标模型的数量。为了提高模型参数的读取效率,可以周期性地或者在内存存满的情况下,删除内存中的部分目标模型的模型参数,例如,删除内存中目标模型总数量的三分之一的目标模型的模型参数,如此,使得后续待调用的目标模型的模型参数能尽可能从内存中直接获得,或者,待调用的目标模型的模型参数从磁盘中读取后还能存储至内存中。
例如,假设内存最多可以存储60个目标模型的模型参数,随着时间的推移,内存中存储满了60个目标模型的模型参数,但在使用过程中发现,有些待调用的目标模型的模型参数不在内存中以至于得去磁盘中读取,且内存中存在一些目标模型的模型参数很久未被访问,在此情况下,可以删除内存中相应的目标模型的模型参数以增加内存的可使用存储空间,例如,被删除的目标模型满足:最近一次访问该目标模型的模型参数的时间与当前时间之间的间隔时长大于预设时长。
可以看到,实施本申请实施例,通过提取的分布特征确定待构建的用于基数估计的模型的类型,利用数据库内设置的该类型对应的算子实现模型的构建,不仅提高了构建的模型的准确率,还有利于提高基数估计的效率,保证了数据库的安全,也方便数据库对用于基数估计的模型的相关信息的管理和维护。另外,针对分析指令对应的多列数据以及历史查询指令数据提取了多种特征,能更精准地为不同场景选择适合的模型类型,丰富了基数估计的应用场景,还提高了基数估计的准确率。
参见图4,图4是本申请实施例提供的又一种基数估计方法的流程图。该方法应用于第一设备,第一设备可以是上述图1中的协调节点,协调节点的具体构成可参考图2的相关描述,在此不再赘述。该方法包括但不限于以下步骤:
S201:根据目标分布特征和映射关系信息,从多个模型类型中选择目标模型类型,其中,映射关系信息包括上述多个模型类型与预知的分布特征之间的映射关系,目标分布特征是从 基于分析指令采集到的样本数据中提取的。
一具体实施中,样本数据是基于分析指令从历史查询记录和数据库中的至少一个数据表中采集到的。
一具体实施中,样本数据包括分析指令对应的多列数据和分析指令对应的历史查询指令数据。例如,分析指令对应的多列数据是基于分析指令中的列从数据库中的至少一个数据表中采集到的,分析指令对应的历史查询指令数据基于分析指令中的列从历史查询记录中采集到的。
进一步地,目标分布特征包括第一特征和第二特征,其中,第一特征用于指示分析指令对应的多列数据的分布特性,第二特征用于分析指令对应的历史查询指令数据中列的分布特性。可以理解,第一特征是从分析指令对应的多列数据中提取的,第二特征是从分析指令对应的历史查询指令数据中提取的。
一具体实施中,第一特征包括下述特征中的至少一项:所述多列数据中每列中不同数值的数量;所述多列数据中各列之间的相关性信息;和所述多列数据中列的数量。有关第一特征中的内容具体可参考图3实施例S102中第一特征的相关叙述,在此不再赘述。
一具体实施中,第二特征包括下述特征中的至少一项:所述历史查询指令数据的列组合覆盖度信息;所述历史查询指令数据中每条历史查询指令对应的列查询范围覆盖度信息;和所述历史查询指令数据中历史查询指令之间的列查询相似度信息。有关第二特征中的内容具体可参考图3实施例S102中第二特征的相关叙述,在此不再赘述。
一具体实施中,在分析指令包括数据库中的多个数据表的标识的情况下,样本数据的采集过程具体为:向至少一个数据节点发送采样指示信息,采样指示信息用于指示至少一个数据节点根据多个数据表之间的连接关系从多个数据表中采集分析指令中的列所在的数据;采集历史查询记录中与分析指令中的列相关的历史查询指令。有关样本数据的采集过程具体可参考图3实施例中对样本数据的采集的相关叙述。
一具体实施中,上述映射关系信息可以是基于人工智能算法的训练获得的。
一具体实施中,上述多个模型类型包括下述类型中的至少两种:贝叶斯网络、自回归、混合均匀分布、高斯核函数、和-积概率和神经网络。
S202:根据目标模型类型,获得目标模型类型对应的目标模型,目标模型用于对数据库的查询指令执行基数估计。
一具体实施中,根据目标模型类型,获得目标模型类型对应的目标模型,包括:根据目标模型类型和样本数据,获得目标模型类型对应的目标模型。
一具体实施中,数据库中设置有上述多个模型类型对应的算子,根据目标模型类型和样本数据,获得目标模型类型对应的目标模型,包括:从上述多个模型类型对应的算子中确定目标模型类型对应的算子;使用目标模型类型对应的算子,创建目标模型类型对应的未训练的模型;根据样本数据训练该未训练的模型,获得目标模型类型对应的目标模型。需要说明的是,目标模型即为图3实施例S102中获得的基数估计模型,目标模型的获取过程具体可参考上述S102中相应内容的叙述,在此不再赘述。
一具体实施中,目标模型的模型参数存储于内存和/或磁盘中。
一具体实施中,还可以使用该目标模型对接收到的查询指令执行基数估计,其过程具体可参考图3实施例中S103的相关叙述,在此不再赘述。
可以看到,实施本申请实施例,通过提取的分布特征先确定待构建的用于基数估计的模型的类型,基于确定的模型的类型构建用于基数估计的模型,不仅提高了构建的模型的准确率,还利于提高基数估计的准确率。另外,分布特征从列数据的值、相关性、列的数量、历史查询负载对应的查询范围等多个维度进行提取,增加了分布特征的多样性,能更精准地为不同场景选择适合的模型类型,丰富了基数估计的应用场景。
参见图5,图5是本申请实施例提供的一种基数估计装置的结构示意图,基数估计装置30包括类型选择单元310和模型建立单元312。该基数估计装置30可以通过硬件、软件或者软硬件结合的方式来实现。
其中,类型选择单元310,用于根据目标分布特征和映射关系信息,从多个模型类型中选择目标模型类型,所述映射关系信息包括所述多个模型类型与预知的分布特征之间的映射关系,目标分布特征是从基于分析指令采集到的样本数据中提取的;模型建立单元312,用于根据目标模型类型,获得目标模型类型对应的目标模型,目标模型用于对数据库的查询指令执行基数估计。
在一些可能的实施例中,基数估计装置30还包括数据采集单元(图未示),数据采集单元用于采集样本数据。示例性地,在分析指令包括数据库中的多个数据表的标识的情况下,数据采集单元具体用于:向至少一个数据节点发送采样指示信息,采样指示信息用于指示至少一个数据节点根据上述多个数据表之间的连接关系从这多个数据表中采集分析指令中的列所在的数据;采集历史查询记录中与分析指令中的列相关的历史查询指令。
基数估计装置30可用于实现图4实施例所描述的方法。在图4实施例中,类型选择单元310可用于执行S201,模型建立单元312可用于执行S202。该基数估计装置30还可用于实现图3实施例所描述的方法,为了说明书的简洁,在此不再赘述。
以上图5所示实施例中的各个单元的一个或多个可以软件、硬件、固件或其结合实现。所述软件或固件包括但不限于计算机程序指令或代码,并可以被硬件处理器所执行。所述硬件包括但不限于各类集成电路,如中央处理单元(central processing unit,CPU)、数字信号处理器(digital signal processor,DSP)、现场可编程门阵列(field-programmable gate array,FPGA)或专用集成电路(application-specific integrated circuit,ASIC)。
参见图6,图6是本申请实施例提供的又一种基数估计装置的结构示意图。
在图6中,基数估计装置40包括:处理器401、通信接口402、存储器403和总线404。处理器401、存储器403和通信接口402之间通过总线404通信。应理解,本申请不限定基数估计装置40中的处理器、存储器的个数。
基数估计装置40可以是上述图1中的协调节点,或者图3所示实施例中的第一设备。
总线404可以是外设部件互连标准(peripheral component interconnect,PCI)总线或扩展工业标准结构(extended industry standard architecture,EISA)总线等。总线可以分为地址总线、数据总线、控制总线等。为便于表示,图6中仅用一条线表示,但并不表示仅有一根总线或一种类型的总线。总线404可包括在基数估计装置40各个部件(例如,存储器403、处理器401、通信接口402)之间传送信息的通路。
处理器401可以包括中央处理器(central processing unit,CPU)、微处理器(micro processor,MP)或者数字信号处理器(digital signal processor,DSP)等处理器中的任意一种或多种。
存储器403用于提供存储空间,存储空间中可以存储操作系统和计算机程序等数据。存储器403可以是随机存取存储器(random access memory,RAM)、可擦除可编程只读存储器(erasable programmable read only memory,EPROM)、只读存储器(read-only memory,ROM),或便携式只读存储器(compact disc read memory,CD-ROM)等中的一种或者多种的组合。存储器403可以单独存在,也可以集成于处理器401内部。
通信接口402可用于为处理器401提供信息输入或输出。或者可替换的,该通信接口402可用于接收外部发送的数据和/或向外部发送数据,可以为包括诸如以太网电缆等的有线链路接口,也可以是无线链路(如Wi-Fi、蓝牙、通用无线传输等)接口。或者可替换的,通信接口402还可以包括与接口耦合的发射器(如射频发射器、天线等),或者接收器等。
该基数估计装置40中的处理器401用于读取存储器403中存储的计算机程序,用于执行前述的方法,例如图3或图4所描述的方法。
在一种可能的设计方式中,基数估计装置40可为执行图3或图4所示方法的执行主体中的一个或多个模块,该处理器401可用于读取存储器中存储的一个或多个计算机程序,用于执行以下操作:
通过类型选择单元310根据目标分布特征和映射关系信息,从多个模型类型中选择目标模型类型,所述映射关系信息包括所述多个模型类型与预知的分布特征之间的映射关系,目标分布特征是从基于分析指令采集到的样本数据中提取的;
通过模型建立单元312根据目标模型类型,获得目标模型类型对应的目标模型,目标模型用于对数据库的查询指令执行基数估计。
在本文上述的实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详细描述的部分,可以参见其他实施例的相关描述。
需要说明的是,本领域普通技术人员可以看到上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成,该程序可以存储于一计算机可读存储介质中,存储介质包括只读存储器(Read-Only Memory,ROM)、随机存储器(Random Access Memory,RAM)、可编程只读存储器(Programmable Read-only Memory,PROM)、可擦除可编程只读存储器(Erasable Programmable Read Only Memory,EPROM)、一次可编程只读存储器(One-time Programmable Read-Only Memory,OTPROM)、电子抹除式可复写只读存储(Electrically-Erasable Programmable Read-Only Memory,EEPROM)、只读光盘(Compact Disc Read-Only Memory,CD-ROM)或其他光盘存储器、磁盘存储器、磁带存储器、或者能够用于携带或存储数据的计算机可读的任何其他介质。
本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机程序产品存储在一个存储介质中,包括若干指令用以使得一个设备(可以是个人计算机,服务器,或者网络设备、机器人、单片机、芯片、机器人等)执行本申请各个实施例所述方法的全部或部分步骤。

Claims (25)

  1. 一种基数估计方法,其特征在于,所述方法包括:
    根据目标分布特征和映射关系信息,从多个模型类型中选择目标模型类型,所述映射关系信息包括所述多个模型类型与预知的分布特征之间的映射关系,所述目标分布特征是从基于分析指令采集到的样本数据中提取的;
    根据所述目标模型类型,获得所述目标模型类型对应的目标模型,所述目标模型用于对数据库的查询指令执行基数估计。
  2. 根据权利要求1所述的方法,其特征在于,所述根据所述目标模型类型,获得所述目标模型类型对应的目标模型,包括:
    根据所述目标模型类型和所述样本数据,获得所述目标模型类型对应的目标模型。
  3. 根据权利要求2所述的方法,其特征在于,所述数据库中设置有所述多个模型类型对应的算子,所述根据所述目标模型类型和所述样本数据,获得所述目标模型类型对应的目标模型,包括:
    从所述多个模型类型对应的算子中确定所述目标模型类型对应的算子;
    使用所述目标模型类型对应的算子,创建所述目标模型类型对应的未训练的模型;
    根据所述样本数据训练所述未训练的模型,获得所述目标模型类型对应的目标模型。
  4. 根据权利要求1-3任一项所述的方法,其特征在于,所述样本数据是基于所述分析指令从历史查询记录和所述数据库中的至少一个数据表中采集到的。
  5. 根据权利要求1-4任一项所述的方法,其特征在于,所述样本数据包括所述分析指令对应的多列数据和所述分析指令对应的历史查询指令数据,所述目标分布特征包括第一特征和第二特征,其中,所述第一特征用于指示所述多列数据的分布特性,所述第二特征用于指示所述历史查询指令数据中列的分布特性。
  6. 根据权利要求5所述的方法,其特征在于,所述第一特征包括下述特征中的至少一项:
    所述多列数据中每列中不同数值的数量;
    所述多列数据中各列之间的相关性信息;和
    所述多列数据中列的数量。
  7. 根据权利要求5或6所述的方法,其特征在于,所述第二特征包括下述特征中的至少一项:
    所述历史查询指令数据的列组合覆盖度信息;
    所述历史查询指令数据中每条历史查询指令对应的列查询范围覆盖度信息;和
    所述历史查询指令数据中历史查询指令之间的列查询相似度信息。
  8. 根据权利要求1-7任一项所述的方法,其特征在于,在所述分析指令包括所述数据库中的多个数据表的标识的情况下,所述样本数据的采集过程具体为:
    向至少一个数据节点发送采样指示信息,所述采样指示信息用于指示所述至少一个数据节点根据所述多个数据表之间的连接关系从所述多个数据表中采集所述分析指令中的列所在的数据;
    采集历史查询记录中与所述分析指令中的列相关的历史查询指令。
  9. 根据权利要求1-8任一项所述的方法,其特征在于,所述映射关系信息是基于人工智能算法的训练获得的。
  10. 根据权利要求1-9任一项所述的方法,其特征在于,所述目标模型的模型参数存储于内存和/或磁盘中。
  11. 根据权利要求1-10任一项所述的方法,其特征在于,所述多个模型类型包括下述类型中的至少两种:
    贝叶斯网络、自回归、混合均匀分布、高斯核函数、和-积概率和神经网络。
  12. 一种基数估计装置,其特征在于,所述装置包括:
    类型选择单元,用于根据目标分布特征和映射关系信息,从多个模型类型中选择目标模型类型,所述映射关系信息包括所述多个模型类型与预知的分布特征之间的映射关系,所述目标分布特征是从基于分析指令采集到的样本数据中提取的;
    模型建立单元,用于根据所述目标模型类型,获得所述目标模型类型对应的目标模型,所述目标模型用于对数据库的查询指令执行基数估计。
  13. 根据权利要求12所述的装置,其特征在于,所述模型建立单元,具体用于:
    根据所述目标模型类型和所述样本数据,获得所述目标模型类型对应的目标模型。
  14. 根据权利要求12或13所述的装置,其特征在于,所述数据库中设置有所述多个模型类型对应的算子,所述模型建立单元,具体用于:
    从所述多个模型类型对应的算子中确定所述目标模型类型对应的算子;
    使用所述目标模型类型对应的算子,创建所述目标模型类型对应的未训练的模型;
    根据所述样本数据训练所述未训练的模型,获得所述目标模型类型对应的目标模型。
  15. 根据权利要求12-14任一项所述的装置,其特征在于,所述样本数据是基于所述分析指令从历史查询记录和所述数据库中的至少一个数据表中采集到的。
  16. 根据权利要求12-15任一项所述的装置,其特征在于,所述样本数据包括所述分析指令对应的多列数据和所述分析指令对应的历史查询指令数据,所述目标分布特征包括第一特征和第二特征,其中,所述第一特征用于指示所述多列数据的分布特性,所述第二特征用于指示所述历史查询指令数据中列的分布特性。
  17. 根据权利要求16所述的装置,其特征在于,所述第一特征包括下述特征中的至少一种:
    所述多列数据中每列中不同数值的数量;
    所述多列数据中各列之间的相关性信息;和
    所述多列数据中列的数量。
  18. 根据权利要求16或17所述的装置,其特征在于,所述第二特征包括下述特征中的至少一种:
    所述历史查询指令数据的列组合覆盖度信息;
    所述历史查询指令数据中每条历史查询指令对应的列查询范围覆盖度信息;和
    所述历史查询指令数据中历史查询指令之间的列查询相似度信息。
  19. 根据权利要求12-18任一项所述的装置,其特征在于,所述装置还包括数据采集单元,在所述分析指令包括所述数据库中的多个数据表的标识的情况下,所述数据采集单元,具体用于:
    向至少一个数据节点发送采样指示信息,所述采样指示信息用于指示所述至少一个数据节点根据所述多个数据表之间的连接关系从所述多个数据表中采集所述分析指令中的列所在 的数据;
    采集历史查询记录中与所述分析指令中的列相关的历史查询指令。
  20. 根据权利要求12-19任一项所述的装置,其特征在于,所述映射关系信息是基于人工智能算法的训练获得的。
  21. 根据权利要求12-20任一项所述的装置,其特征在于,所述目标模型的模型参数存储于内存和/或磁盘中。
  22. 根据权利要求12-21任一项所述的装置,其特征在于,所述多个模型类型包括下述类型中的至少两种:
    贝叶斯网络、自回归、混合均匀分布、高斯核函数、和-积概率和神经网络。
  23. 一种基数估计装置,其特征在于,所述装置包括存储器和处理器,所述存储器存储计算机程序指令,所述处理器运行所述计算机程序指令以使所述装置执行如权利要求1-11任一项所述的方法。
  24. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有程序指令,所述程序指令用于实现权利要求1-11中任一项所述的方法。
  25. 一种计算机程序产品,其特征在于,当所述计算机程序产品在处理器上运行时,使得用于基数估计的装置执行如权利要求1-11任一项所述的方法。
PCT/CN2023/075191 2022-03-31 2023-02-09 一种基数估计方法及装置 WO2023185264A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210334793.5A CN116932580A (zh) 2022-03-31 2022-03-31 一种基数估计方法及装置
CN202210334793.5 2022-03-31

Publications (1)

Publication Number Publication Date
WO2023185264A1 true WO2023185264A1 (zh) 2023-10-05

Family

ID=88199099

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/075191 WO2023185264A1 (zh) 2022-03-31 2023-02-09 一种基数估计方法及装置

Country Status (2)

Country Link
CN (1) CN116932580A (zh)
WO (1) WO2023185264A1 (zh)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040128287A1 (en) * 2002-12-20 2004-07-01 International Business Machines Corporation Self tuning database retrieval optimization using regression functions
US20090077054A1 (en) * 2007-09-13 2009-03-19 Brian Robert Muras Cardinality Statistic for Optimizing Database Queries with Aggregation Functions
CN107885865A (zh) * 2017-11-22 2018-04-06 星环信息科技(上海)有限公司 一种代价优化器与代价估计的方法及其设备
US10942923B1 (en) * 2018-12-14 2021-03-09 Teradata Us, Inc. Deep learning for optimizer cardinality estimation
US20220067045A1 (en) * 2020-08-27 2022-03-03 International Business Machines Corporation Automated query predicate selectivity prediction using machine learning models
CN114153874A (zh) * 2021-11-12 2022-03-08 北京人大金仓信息技术股份有限公司 基数估计方法、装置、电子设备及存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040128287A1 (en) * 2002-12-20 2004-07-01 International Business Machines Corporation Self tuning database retrieval optimization using regression functions
US20090077054A1 (en) * 2007-09-13 2009-03-19 Brian Robert Muras Cardinality Statistic for Optimizing Database Queries with Aggregation Functions
CN107885865A (zh) * 2017-11-22 2018-04-06 星环信息科技(上海)有限公司 一种代价优化器与代价估计的方法及其设备
US10942923B1 (en) * 2018-12-14 2021-03-09 Teradata Us, Inc. Deep learning for optimizer cardinality estimation
US20220067045A1 (en) * 2020-08-27 2022-03-03 International Business Machines Corporation Automated query predicate selectivity prediction using machine learning models
CN114153874A (zh) * 2021-11-12 2022-03-08 北京人大金仓信息技术股份有限公司 基数估计方法、装置、电子设备及存储介质

Also Published As

Publication number Publication date
CN116932580A (zh) 2023-10-24

Similar Documents

Publication Publication Date Title
WO2022068196A1 (zh) 跨模态的数据处理方法、装置、存储介质以及电子装置
CN108038183B (zh) 结构化实体收录方法、装置、服务器和存储介质
WO2021089013A1 (zh) 空间图卷积网络的训练方法、电子设备及存储介质
WO2021052177A1 (zh) 日志解析方法、装置、服务器和存储介质
CN111046035B (zh) 数据自动化处理方法、系统、计算机设备及可读存储介质
CN110471916A (zh) 数据库的查询方法、装置、服务器及介质
US20230139783A1 (en) Schema-adaptable data enrichment and retrieval
JP6355840B2 (ja) ストップワード識別方法および装置
CN113010547A (zh) 一种基于图神经网络的数据库查询优化方法及系统
CN110569289B (zh) 基于大数据的列数据处理方法、设备及介质
CN111627552B (zh) 一种医疗流式数据血缘关系分析、存储方法及装置
WO2021159655A1 (zh) 数据属性填充方法、装置、设备及计算机可读存储介质
US20230315727A1 (en) Cost-based query optimization for untyped fields in database systems
CN111125199B (zh) 一种数据库访问方法、装置及电子设备
CN110018997B (zh) 一种基于hdfs的海量小文件存储优化方法
US20240095529A1 (en) Neural Network Optimization Method and Apparatus
WO2023185264A1 (zh) 一种基数估计方法及装置
CN107861769B (zh) 应用清理方法、装置、存储介质及电子设备
CN110162574B (zh) 数据重分布方式的确定方法、装置、服务器及存储介质
CN114036345A (zh) 一种轨迹数据的处理方法、设备及存储介质
WO2021143010A1 (zh) 一种分布式计算任务的响应方法及设备
Ma et al. Feature selection using forest optimization algorithm based on contribution degree
CN112860736A (zh) 大数据查询优化方法、设备及可读存储介质
WO2021012211A1 (zh) 一种为数据建立索引的方法以及装置
CN114817315B (zh) 数据处理方法以及系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23777643

Country of ref document: EP

Kind code of ref document: A1