CN116361329A - Method and device for estimating search data volume - Google Patents

Method and device for estimating search data volume Download PDF

Info

Publication number
CN116361329A
CN116361329A CN202310344062.3A CN202310344062A CN116361329A CN 116361329 A CN116361329 A CN 116361329A CN 202310344062 A CN202310344062 A CN 202310344062A CN 116361329 A CN116361329 A CN 116361329A
Authority
CN
China
Prior art keywords
data
query
query request
data table
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310344062.3A
Other languages
Chinese (zh)
Inventor
李俊虎
徐泉清
聂铁铮
杨传辉
申德荣
寇月
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Oceanbase Technology Co Ltd
Original Assignee
Beijing Oceanbase Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Oceanbase Technology Co Ltd filed Critical Beijing Oceanbase Technology Co Ltd
Priority to CN202310344062.3A priority Critical patent/CN116361329A/en
Publication of CN116361329A publication Critical patent/CN116361329A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • G06F16/24534Query rewriting; Transformation
    • G06F16/24542Plan optimisation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Operations Research (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the specification provides a method and a device for estimating the retrieval data volume, which aim at a specific data query request, not only consider the query characteristics in the query request, but also determine the data table information in a related data table according to the query request, and extract the data characteristics from the data table information. Thus, both data characteristics and query characteristics can be considered for data volume estimation. In the method, the data information in the data table is introduced under the Query driving Query-drive architecture to serve as supplement, so that the Query coding capacity can be enhanced, and the accuracy of data quantity estimation can be improved.

Description

Method and device for estimating search data volume
Technical Field
One or more embodiments of the present disclosure relate to the field of database query optimization, and in particular, to a method and apparatus for estimating a search data volume.
Background
SQL is a structured query language that typically only tells the database what it "wants" and does not tell the database how to get the corresponding results. The process of how to acquire may be determined by an "execution plan" (also referred to as a query strategy) determined by a brain "optimizer" of the database. Under the internet environment of the increasing information volume, the requirements of developers on the software performance of the database are gradually increased, and one important index is the execution performance of the database on external query sentences (such as SQL sentences). The performance is typically optimized by a database query optimizer for query policies of query statements. The optimization of the query strategy of a query statement by a query optimizer is often based on radix estimation. Where cardinality estimation (cardinality estimation) is typically an estimation of data entries in a given dataset that satisfy the extrinsic query conditions, from which the query optimizer generates as least costly a physical execution plan as possible. As such, radix estimation has an extremely important impact on the performance of database queries.
Disclosure of Invention
One or more embodiments of the present specification describe a method, apparatus, and system for estimating an amount of retrieved data to address one or more of the problems mentioned in the background.
According to a first aspect, there is provided a method of estimating an amount of retrieved data for estimating an amount of retrieved data in a database for a received retrieval statement; the method comprises the following steps: under the condition that a data query request is received, analyzing the data query request through an encoder so as to determine query characteristics; determining each data table involved in the data query request; respectively analyzing the table information of each data table to obtain corresponding data characteristics through the table information; and processing the query characteristics and the data characteristics by using a prediction model, so as to determine the retrieval data quantity of the data query request according to the output result of the prediction model.
In one embodiment, the table information includes at least one of: the number of columns of the data table, the maximum value, the minimum value and the number of non-repeated elements of each attribute column.
In one embodiment, the data features include respective feature maps that are in one-to-one correspondence with respective data tables, and the feature map corresponding to a single data table is determined by: acquiring a data density thermodynamic diagram of the single data table, wherein the data density thermodynamic diagram is an MXn two-dimensional tensor and is used for describing the data density distribution of each attribute column in a corresponding data point or data range under the condition that M attribute columns of the single data table are fixedly divided into n barrels; and fusing the numerical range defined by the data query request aiming at the attribute column of the single data table with the data density thermodynamic diagram so as to obtain a corresponding characteristic diagram.
In one embodiment, the fusing the numerical range defined by the data query request for the attribute column of the single data table with the data density thermodynamic diagram to obtain the corresponding feature map includes: determining a mapping relation between a numerical range defined by an attribute column of the single data table and the data density thermodynamic diagram according to the data query request; and reserving the numerical value of the mapped position on the data density thermodynamic diagram, and replacing other positions with 0 values or null values to obtain a corresponding characteristic diagram.
In one embodiment, in the case that the number of data tables involved in the data query request is k and k is greater than 1, the k feature maps respectively correspond to the feature channels, so as to form a kxMxn three-dimensional tensor.
In one embodiment, the query features include at least one of: data table features, connection features between tables, predicate features.
In one embodiment, samples of the predictive model are trained to correspond to sample query statements and data volume labels, the predictive model being trained via: extracting sample query features and sample data features from the sample query statement; taking the query characteristics and the data characteristics as input information of a prediction model to obtain a sample output result of the prediction model; determining model loss according to comparison between a sample output result of the prediction model and a data quantity label; and adjusting undetermined parameters in the prediction model towards the direction of reducing model loss, so as to update the prediction model.
In a further embodiment, the data range defined by the sample query statement generates the respective predicate determination by the data generator in a uniformly random manner.
According to a second aspect, there is provided an apparatus for estimating an amount of search data for estimating an amount of query data in a database for a received search statement; the device comprises:
the analysis unit is configured to analyze the data query request through the encoder under the condition of receiving the data query request, so as to obtain query characteristics;
a determining unit configured to determine respective data tables to which the data query request relates;
the acquisition unit is configured to respectively analyze the table information of each data table so as to acquire corresponding data features through the table information, wherein the data features comprise feature graphs which are in one-to-one correspondence with each data table;
and a prediction unit configured to process the query feature and the data feature by using a prediction model, so as to determine the retrieval data amount of the data query request according to the output result of the prediction model.
According to a third aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of the first aspect.
According to a fourth aspect, there is provided a computing device comprising a memory and a processor, characterised in that the memory has executable code stored therein, the processor implementing the method of the first aspect when executing the executable code.
According to the method and the device provided by the embodiment of the specification, in the process of carrying out query data volume estimation on a specific data retrieval request, not only query characteristics in the query request are considered, but also data table information in a related data table is determined according to the query request, and the data characteristics are extracted from the data table information. Thus, both data characteristics and query characteristics can be considered for data volume estimation. The data information in the data table is introduced under the Query-Driven (Query-Driven) architecture to supplement the data information, so that the Query coding capacity can be enhanced, and the accuracy of data quantity estimation can be improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of a system architecture for a database query of the present specification;
FIG. 2 is a schematic diagram of a query flow under the technical concept of the present specification;
FIG. 3 shows a flow diagram of estimating an amount of retrieved data according to one embodiment of the present description;
FIG. 4 shows a schematic representation of a data density thermodynamic diagram of one embodiment of the present description;
fig. 5 shows a schematic block diagram of an apparatus for estimating the amount of search data provided to a tag member according to an embodiment of the present specification.
Detailed Description
The following describes the scheme provided in the present specification with reference to the drawings.
Fig. 1 illustrates one implementation architecture of the present description. The implementation architecture involves at least one traffic server and a database. The service server may provide corresponding service support for related services (e.g., search service, query service, pay-and-pay service, navigation service, etc.) performed by respective users on the corresponding terminals. The service server can write data into and read data from the database in the process of providing corresponding service support for the terminal. For example, the service server is a server that provides service support for shopping services, lending services, navigation services, and the like of the terminal. Accordingly, the service server may write or read shopping record data (e.g., shopping category, amount, time, etc.) and loan record data (e.g., loan amount, loan category, repayment time, etc.), and navigation record data (e.g., navigation time, destination arrival time, navigation route, actual route, etc.) into the database. The database can provide data storage service for one business server, and can also provide data storage service for a plurality of business servers. In the case of providing data storage services for a plurality of service servers, a single service server may read data from a plurality of service data tables of a database based on mutual authorization among the service servers.
The service server reads data from the database typically based on a corresponding query request. The query request may be provided at the database end or processed by a computing platform (such as a query server) operating on the database. The service server sends a corresponding query request, and the database determines a corresponding query strategy through the computing platform and queries corresponding data from the database table. To reduce database pressure, computing platforms may typically determine multiple candidate query strategies from the query request and select a less costly strategy from among the multiple candidate query strategies as the target strategy for data querying. The computing platform may be a computer, a device, a server, etc. connected to the database device, or may be an executor embedded in or running on the database device, such as an optimizer, etc. The cost of a query policy may be measured by the query base.
Learning-based estimators employed in conventional techniques typically include Query-based Query-Driven and Data-based analysis Data-Driven. The Query-drive type estimator often employs a DNN depth neural network to involve the Encoder, treating the radix estimate as a regression problem to capture the relevant features of the Query. However, current coding of SQL is limited to extracting features from Range information of predicates.
For example, given a field age, the value range of which is, for example, domain (age) = [12, 60], a Query condition defined by a range in which the predicate of a Query statement includes "age < = 18", two processing results can be obtained by a preprocessing scheme of Query predicates in the conventional technology:
1. determining a query range through intersection of the value range of the age field and the value range defined by the age field, for example: domain (age) Range (Q, age) = [12, 18];
2. and (3) introducing a linear function normalization min-max scaling mechanism, and mapping the query data volume defined by the query condition into a numerical value between 0 and 1, wherein the numerical value is (18-12)/60=0.1, so that the 10% quantile position of the range end point defined by the query predicate in the corresponding field value range of the data table is indicated.
However, both of these methods estimate cardinality by characterizing Query Range (Range), even if the same Query is performed multiple times in a dynamic data context, it is possible to obtain a differentiated cardinality estimation result.
In view of this, the present description provides a solution that considers both Query and Data of context information, thereby establishing an association of the Query with the Data. Fig. 2 shows a schematic diagram of an embodiment of the technical idea of the present specification. As shown in fig. 2, for a received query statement, it may be parsed by an encoder on the one hand and query characteristics determined, and on the other hand, data table information may be obtained by a data Sniffer (Sniffer) to determine data information in the data table along with the query characteristics. Further, by inputting the query feature and the data feature together into the prediction model, an output result of the prediction model can be obtained. From the output result, a corresponding radix estimation result can be obtained. The cardinality estimation result may be used to determine a query policy to query the database via execution of the query policy.
Thus, the expression capability of encoding the query statement can be enhanced by means of additional input data features, and the estimation accuracy of the base number (namely the query data quantity) is improved.
The technical idea of the present specification will be described in detail with reference to a specific embodiment shown in fig. 3.
Fig. 3 shows a flow of estimating the amount of retrieved data according to one embodiment of the present description. The execution subject of the flow may be a computer, a device, a server with a certain computing power. More specifically, as may be the computing platform in fig. 1, 2. The execution subject of the flow may be provided in a database device, or may be a control device or the like capable of accessing a database. The flow of estimating the amount of retrieved data shown in fig. 3 may be used in the case of database queries to determine the query base, or the amount of data to be queried, for a query request.
As shown in fig. 3, the process of estimating the amount of the retrieved data may include: step 302, under the condition that a data query request is received, analyzing the data query request through an encoder so as to determine query characteristics; step 304, determining each data table involved in the data query request; step 306, respectively analyzing the table information of each data table to obtain corresponding data features through the table information, wherein the data features comprise feature graphs corresponding to each data table one by one; step 308, the query feature and the data feature are processed by using the prediction model, so as to determine the retrieval data amount of the data query request according to the output result of the prediction model.
First, in step 302, in the event that a data query request is received, the data query request is parsed by an encoder to determine query characteristics.
The data query request may be from a user terminal or from a service server. For example, a query request may be generated by the service server based on a form page submitted by the user terminal to query "purchasing a distribution area of watch users" and so on.
The data query request may describe the desired target data. Thus, the data query request may include query conditions such as a query data table, a target field (corresponding to a target attribute column), an attribute screening condition (e.g., less than 20 a.l.) and the like. It will be appreciated that the data query request may involve a single pieceA plurality of data tables may be referred to. In the case of multiple data tables, the data query request may also include query conditions such as connection relationships between different tables. For example, for the query target "distribution area of users purchasing watches", it may involve shopping record tables (e.g., containing user purchase merchandise information, etc.) and user information tables (e.g., containing user geographic location information, etc.), both of which may contain unique user identifications (which may be used to describe connection relationships). On the one hand, the user who purchased the watch (e.g. by a unique user identification) can be queried from the shopping record table, for example, the purchased commodity information item is c3, the user identification information item is c1, the watch is identified by a, the shopping record table is t1, then the c1 value in the entry of c3=a in the t1 table can be queried, for example, the value is marked as a set b, and c1=a in the t1 table is taken as a query condition. On the other hand, the region corresponding to the user identification of the purchased watch can be queried from the user table information t 2. Assuming that the geographic location information item is c2, then for each element b in set b i The value of c2 (i.e., the region value) in the corresponding entry is queried.
From the parsing of these descriptors, corresponding features can be extracted therefrom as query features. Parsing of these descriptions may be performed by an Encoder, e.g. denoted as Query Encoder. The query characteristics may include, for example, at least one of data table characteristics, connection characteristics between tables, predicate characteristics, and the like. The data table features may be the queried data table names, attribute columns, and the like. For example T.a, represents an "a" attribute column (e.g., a payout attribute column) in table T. The connection characteristics between tables are, for example: whether there is a connection relationship between tables, an attribute column in which equivalent connection occurs between tables, and the like. Predicates are typically some defined grammatical words that define query conditions, query results, e.g., predicate "age < = 18" may be used to define a query condition that the value of the age field is less than or equal to 18, predicate "ROWNUM < = 10" may be a number of defined current page query terms in a paged query that does not exceed 10, etc. The predicate characteristics may include, among other things, qualifiers "age", "ROWNUM", qualifying tendencies "<=", qualifying endpoint values "18", "10", and the like. Alternatively, the predicates "age < =18", "ROWNUM < =10" may also be used directly as this feature.
Meanwhile, in the parsing process of the data query request, each data table related to the data query request may be determined through step 304. For example, the shopping list in the above example is t1, user list information t2, or the like.
Further, the table information of each data table is parsed, via step 306, to obtain corresponding data features through the table information.
It will be appreciated that in the case of determining the data table to which the data query request relates, table information (which may also be referred to as metadata, i.e., information of the data) of the data table may be obtained from the database. The table information of the data table may be description information describing the data table. The table information may include, for example, a data table name (e.g., denoted as t.name), the number of data pieces in the data table, a field corresponding to each attribute column, a maximum value (e.g., denoted as Max (Attr)) in each attribute column, a minimum value (e.g., denoted as Min (Attr)), and the number of elements of the column that are non-duplicate (Number of Distinct Values, e.g., denoted as NDV (Attr)). In addition, a value range (e.g., noted as Domain (Attr)) of the attribute column may also be constructed. Wherein the smaller the value of NDV, the higher the dispersion of the corresponding attribute column. As an example, like the "pay status" attribute column, its NDV may be 2, like: { "to pay", "paid" }, which is a discrete attribute column; for example, the "height" attribute column, its NDV is a larger value, because "height" can be any floating point real value within the corresponding value range. In extreme cases, NDV (Attr) may be consistent with the number of data entries in the entire table, i.e., each piece of data takes a different value on the attribute column.
The above table information can be extracted as data features. Portions of the data features may be extracted from the respective data tables, the features extracted from the respective data tables constituting the data features. For example, the feature extracted from a single data table is in a vector form, the data feature is in a vector form when the number of data tables is 1, and the data feature may be a two-dimensional tensor when the number of data tables is more than 1. For another example, the feature extracted from a single data table is in the form of a two-dimensional tensor, where the number of data tables is 1, the data feature is in the form of a two-dimensional tensor, and where the number of data tables is more than 1, the data feature may be in the form of a three-dimensional tensor.
According to one possible design, the data features include respective feature maps that are in one-to-one correspondence with respective data tables, and the feature maps corresponding to the individual data tables are determined by: and acquiring a data density thermodynamic diagram of the single data table, and fusing a numerical range defined by the data query request aiming at the attribute column of the single data table with the data density thermodynamic diagram to obtain a corresponding characteristic diagram. The data density thermodynamic diagram contains information describing the data distribution density, and can be in the form of an image, an array, a set and the like. In a data densitometry, individual values may be used to represent the data density of a corresponding data point or data range. The data density may be described by the amount of data (e.g., 100 pieces), or by a percentage (e.g., 1%), etc.
Fig. 4 shows a schematic diagram of a data density thermodynamic diagram. In fig. 4, a single row corresponds to an attribute column, the attribute value on the single attribute column is divided into a plurality of intervals, a single interval corresponds to a data point (such as a working age 15) or a data range (such as a expenditure 3000 to 4000 yuan), the number of data bars of the single interval is marked by different colors (the bar graph on the right side describes the number of data bars of 0 to 50000 represented by different colors), and thus the data density thermodynamic diagram shown in fig. 4 is obtained.
In an alternative implementation, each attribute column in the data table may be binned, e.g., a single attribute column is binned into n bins, and then for a single attribute column, a corresponding data density may be determined for each data point or data range corresponding to the number of data stripes, for a total of n. Thus, a single attribute column may have a density set of n data densities, which may be described in terms of arrays, vectors, etc. Assuming a single data table corresponds to M attribute columns, M density sets may be obtained.
Alternatively, these density sets may be represented as two-dimensional tensors of mxn. Further, in the case that there are a plurality of data tables related to the current data query request, each two-dimensional Zhang Liangkan is made as each feature map, and each channel is respectively corresponding to each two-dimensional data query request, and may be arranged in the channel dimension to form a three-dimensional tensor, for example, the feature maps of the k data tables playing each may be arranged into a three-dimensional tensor of kxMxn.
Wherein fusing the data query request with the data density thermodynamic diagram for the numerical range defined by the attribute column of the single data table may be performed by: determining a mapping relation between a numerical range defined by an attribute column of the single data table and a data density thermodynamic diagram according to the data query request; and (3) preserving the numerical value of the mapped position on the data densitometry, and replacing other positions with 0 values or null values to obtain corresponding characteristic diagrams. For example, the user defines a query Range "Range (Q, T.a) = [20, 40]", by querying the "a" field (e.g., work age) of the request Q in the table T, and the "a" field has the following barreled results on the data density thermodynamic diagram: [1,6], [7, 12], [13, 18], [19, 24], [25, 30], [31, 36], [37, 42], corresponding density values in the data densitometry are: [0.25,0.15,0.1,0.05,0.15,0.2,0.1], the interval to which the query scope is mapped includes: [19, 24], [25, 30], [31, 36], [37, 42], taking the density value of the corresponding section, otherwise referring to 0, in the corresponding characteristic diagram, the characteristic value corresponding to the "a" field of the table T is [0,0,0,0.05,0.15,0.2,0.1]. If the data query request does not define other fields, in the corresponding feature map, feature values of the other fields are all 0. As in fig. 4, the numerical ranges of the respective fields corresponding to the query conditions are covered by color patches.
In other embodiments, other data features may be obtained from the table information, for example, the percentage features of the data related to the single field in the data query request, and the like, which are not described herein.
The query features and data features are then processed using the predictive model to determine the amount of retrieved data for the data query request based on the output of the predictive model in step 308.
The prediction model is a model for data amount estimation, which can simultaneously receive query features and data features as input data, and output data amount estimation values, such as 20 ten thousand pieces. The query features and the data features can be processed by splicing, adding and the like and then input into the prediction model, or the prediction model can be input through different channels.
Wherein the predictive model may be trained in advance. Samples of the training predictive model may be made up of query statements and actual data volumes. The sample query feature and the sample data feature can be extracted according to the methods from step 302 to step 308 through the query statement, then the query feature and the data feature are used as input information of a prediction model, the actual data quantity is used as label information, model loss is determined through comparison of an output result of the prediction model and the label information, and undetermined parameters in the prediction model are adjusted towards the direction of reducing the model loss. In general, the predictive model training is completed in the event that predetermined end conditions are met, such as model loss convergence, model parameter convergence, accuracy of the predictive model on the test set reaching a predetermined value, etc.
Those skilled in the art will appreciate that the training samples of the predictive model may be determined based on historical query collection. In an alternative embodiment, predicates in the sample query statement can also be generated in a uniform and random mode, and the database is queried according to the generated query statement, so that the actual number of query data is obtained as the tag data size. Here, the predicate is used to define information such as a query range satisfying a condition, or the number of output data pieces of the current page. The uniform randomization may be for each field in the data table, e.g., the numerical ranges defined for a field predicate in a query statement are uniformly distributed, e.g., 10 for all age fields, and 0 to 10, 20 to 30, and 10 to 20 … …, respectively. Thus, by randomly generating end points or intermediate values, a data range is determined based on the generated values.
Reviewing the above, in the data volume estimation process for a specific data retrieval request, not only the query characteristics in the query request are considered, but also the data table information in the relevant data table is determined according to the query request, and the data characteristics are extracted therefrom. Thus, both data characteristics and query characteristics may be considered for query data volume estimation. The data information in the data table is introduced under the Query-Driven (Query-Driven) architecture to supplement the data information, so that the Query coding capacity can be enhanced, and the accuracy of data quantity estimation can be improved. In addition, as the data change information can be captured through the data characteristics, the update condition of the data with obvious change and tiny change can be distinguished, the update time is increased only when the data with obvious change occurs, the update efficiency of the tiny change is improved, and the adaptability of the data change is enhanced.
According to another aspect, the present disclosure also provides a device for estimating the amount of search data, where the device may be located at a database end, or at a database control end (such as a query server). More specifically, for example, the computing platforms shown in fig. 1, 2. Fig. 5 illustrates an apparatus 500 for estimating an amount of retrieved data in accordance with a particular embodiment.
As shown in fig. 5, the apparatus 500 includes:
the parsing unit 501 is configured to parse the data query request through the encoder, so as to obtain query features when the data query request is received;
a determining unit 502 configured to determine respective data tables to which the data query request relates;
an obtaining unit 503, configured to parse the table information of each data table respectively, so as to obtain corresponding data features through the table information, where the data features include each feature map corresponding to each data table one by one;
and a prediction unit 504 configured to process the query feature and the data feature using the prediction model, thereby determining a retrieval data amount of the data query request according to an output result of the prediction model.
According to one possible design, the apparatus 500 may further comprise a mapping unit (not shown) configured to determine the feature map corresponding to the single data table by:
acquiring a data density thermodynamic diagram of the single data table, wherein the data density thermodynamic diagram is an Mxn two-dimensional tensor and is used for describing the data density of the single data table in M attribute columns which are fixedly divided into n barrels, and each attribute column is in a corresponding data point or data range;
and fusing the numerical range defined by the attribute column of the single data table aiming at the data query request with the data density thermodynamic diagram to obtain a corresponding characteristic diagram.
In one embodiment, the mapping unit is further configured to: determining a mapping relation between a numerical range defined by an attribute column of the single data table and the data density thermodynamic diagram according to a data query request; and (3) preserving the numerical value of the mapped position on the data densitometry, and replacing other positions with 0 values or null values to obtain corresponding characteristic diagrams.
It should be noted that the apparatus 500 shown in fig. 5 corresponds to the method embodiment shown in fig. 3, and thus, the description related to the method in fig. 3 may also be applicable to the apparatus 500 shown in fig. 5, which is not repeated herein.
According to an embodiment of another aspect, there is also provided a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method described in connection with fig. 3 and the like.
According to an embodiment of yet another aspect, there is also provided a computing device including a memory having executable code stored therein and a processor that, when executing the executable code, implements the method described in connection with fig. 3 and the like.
Those skilled in the art will appreciate that in one or more of the examples described above, the functions described in the embodiments of the present disclosure may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, these functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.
The above-described specific embodiments are used for further describing the technical concept of the present disclosure in detail, and it should be understood that the above description is only specific embodiments of the technical concept of the present disclosure, and is not intended to limit the scope of the technical concept of the present disclosure, and any modifications, equivalent substitutions, improvements, etc. made on the basis of the technical scheme of the embodiment of the present disclosure should be included in the scope of the technical concept of the present disclosure.

Claims (12)

1. A method for estimating the amount of search data in a database for a received search statement; the method comprises the following steps:
under the condition that a data query request is received, analyzing the data query request through an encoder so as to determine query characteristics;
determining each data table involved in the data query request;
respectively analyzing the table information of each data table to obtain corresponding data characteristics through the table information;
and processing the query characteristics and the data characteristics by using a prediction model, so as to determine the retrieval data quantity of the data query request according to the output result of the prediction model.
2. The method of claim 1, wherein the table information comprises at least one of: the number of columns of the data table, the maximum value, the minimum value and the number of non-repeated elements of each attribute column.
3. The method of claim 1, wherein the data features comprise respective feature maps that are in one-to-one correspondence with respective data tables, the feature map corresponding to a single data table being determined by:
acquiring a data density thermodynamic diagram of the single data table, wherein the data density thermodynamic diagram is an MXn two-dimensional tensor and is used for describing the data density distribution of each attribute column in a corresponding data point or data range under the condition that M attribute columns of the single data table are fixedly divided into n barrels;
and fusing the numerical range defined by the data query request aiming at the attribute column of the single data table with the data density thermodynamic diagram so as to obtain a corresponding characteristic diagram.
4. The method of claim 3, wherein said fusing the numerical range defined by the data query request for the attribute column of the single data table with the data densitometry to obtain the corresponding signature comprises:
determining a mapping relation between a numerical range defined by an attribute column of the single data table and the data density thermodynamic diagram according to the data query request;
and reserving the numerical value of the mapped position on the data density thermodynamic diagram, and replacing other positions with 0 values or null values to obtain a corresponding characteristic diagram.
5. A method according to claim 3, wherein in the case where the number of data tables involved in the data query request is k and k is greater than 1, the k feature maps correspond to the respective feature channels, respectively, thereby constituting a kxmxn three-dimensional tensor.
6. The method of claim 1, wherein the query features include at least one of: data table features, connection features between tables, predicate features.
7. The method of claim 1, wherein the samples of the predictive model are trained to correspond to sample query statements and data volume labels, the predictive model being trained via:
extracting sample query features and sample data features from the sample query statement;
taking the query characteristics and the data characteristics as input information of a prediction model to obtain a sample output result of the prediction model;
determining model loss according to comparison between a sample output result of the prediction model and a data quantity label;
and adjusting undetermined parameters in the prediction model towards the direction of reducing model loss, so as to update the prediction model.
8. The method of claim 7, wherein the data range defined by the sample query statement generates respective predicate determinations by a data generator in a uniformly random manner.
9. Means for estimating an amount of search data for estimating an amount of query data in a database for the received search statement; the device comprises:
the analysis unit is configured to analyze the data query request through the encoder under the condition of receiving the data query request, so as to obtain query characteristics;
a determining unit configured to determine respective data tables to which the data query request relates;
the acquisition unit is configured to respectively analyze the table information of each data table so as to acquire corresponding data features through the table information, wherein the data features comprise feature graphs which are in one-to-one correspondence with each data table;
and a prediction unit configured to process the query feature and the data feature by using a prediction model, so as to determine the retrieval data amount of the data query request according to the output result of the prediction model.
10. The apparatus of claim 9, wherein the apparatus further comprises a mapping unit configured to determine the feature map corresponding to the single data table by:
acquiring a data density thermodynamic diagram of the single data table, wherein the data density thermodynamic diagram is an MXn two-dimensional tensor and is used for describing the data density distribution of each attribute column in a corresponding data point or data range under the condition that M attribute columns of the single data table are fixedly divided into n barrels;
and fusing the numerical range defined by the data query request aiming at the attribute column of the single data table with the data density thermodynamic diagram so as to obtain a corresponding characteristic diagram.
11. A computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of any of claims 1-8.
12. A computing device comprising a memory and a processor, wherein the memory has executable code stored therein, which when executed by the processor, implements the method of any of claims 1-8.
CN202310344062.3A 2023-03-31 2023-03-31 Method and device for estimating search data volume Pending CN116361329A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310344062.3A CN116361329A (en) 2023-03-31 2023-03-31 Method and device for estimating search data volume

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310344062.3A CN116361329A (en) 2023-03-31 2023-03-31 Method and device for estimating search data volume

Publications (1)

Publication Number Publication Date
CN116361329A true CN116361329A (en) 2023-06-30

Family

ID=86941650

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310344062.3A Pending CN116361329A (en) 2023-03-31 2023-03-31 Method and device for estimating search data volume

Country Status (1)

Country Link
CN (1) CN116361329A (en)

Similar Documents

Publication Publication Date Title
US11977541B2 (en) Systems and methods for rapid data analysis
CN108804641B (en) Text similarity calculation method, device, equipment and storage medium
Baumann The OGC web coverage processing service (WCPS) standard
US6178424B1 (en) Information distributing system and storage medium recorded with a program for distributing information
Sapia On Modeling and Predicting Query Behavior in OLAP Systems.
US20080208652A1 (en) Method and system utilizing online analytical processing (olap) for making predictions about business locations
US20070239742A1 (en) Determining data elements in heterogeneous schema definitions for possible mapping
CN103262076A (en) Analytical data processing
CN111553151A (en) Question recommendation method and device based on field similarity calculation and server
WO2022252782A1 (en) Cloud computing index recommendation method and system
Tang et al. Determining the impact regions of competing options in preference space
CN113535963A (en) Long text event extraction method and device, computer equipment and storage medium
CN111159563A (en) Method, device and equipment for determining user interest point information and storage medium
CN113641813A (en) Knowledge graph-based question-answering system and method, electronic equipment and storage medium
US20030167275A1 (en) Computation of frequent data values
CN116830097A (en) Automatic linear clustering recommendation for database region maps
EP3407206B1 (en) Reconciled data storage system
CN117217933A (en) Data multidimensional analysis method and device for insurance industry
CN112069269B (en) Big data and multidimensional feature-based data tracing method and big data cloud server
CN116933130A (en) Enterprise industry classification method, system, equipment and medium based on big data
CN116361329A (en) Method and device for estimating search data volume
CN111104422A (en) Training method, device, equipment and storage medium of data recommendation model
CN115952800A (en) Named entity recognition method and device, computer equipment and readable storage medium
CN115203206A (en) Data content searching method and device, computer equipment and readable storage medium
CN114416848A (en) Data blood relationship processing method and device based on data warehouse

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination