CN116821436A - Fuzzy query-oriented character string predicate accurate selection estimation method - Google Patents
Fuzzy query-oriented character string predicate accurate selection estimation method Download PDFInfo
- Publication number
- CN116821436A CN116821436A CN202311072853.1A CN202311072853A CN116821436A CN 116821436 A CN116821436 A CN 116821436A CN 202311072853 A CN202311072853 A CN 202311072853A CN 116821436 A CN116821436 A CN 116821436A
- Authority
- CN
- China
- Prior art keywords
- time step
- character
- hidden state
- input
- current time
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 45
- 230000001537 neural effect Effects 0.000 claims abstract description 38
- 238000011156 evaluation Methods 0.000 claims abstract description 30
- 238000012549 training Methods 0.000 claims description 50
- 239000013598 vector Substances 0.000 claims description 40
- 230000006870 function Effects 0.000 claims description 12
- 238000003058 natural language processing Methods 0.000 abstract description 2
- 238000012545 processing Methods 0.000 description 5
- 238000004364 calculation method Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000013459 approach Methods 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 238000013138 pruning Methods 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 230000003213 activating effect Effects 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 230000006978 adaptation Effects 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 230000006386 memory function Effects 0.000 description 1
- 210000002569 neuron Anatomy 0.000 description 1
- 238000013139 quantization Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/90335—Query processing
- G06F16/90344—Query processing by using string matching techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Machine Translation (AREA)
Abstract
A fuzzy query-oriented character string predicate accurate selection estimation method uses an acquired query statement and a corpus in a database as input of an autoregressive neural language model architecture, and trains the autoregressive neural language model. And sequentially taking single characters of predicates in the actual query statement as the input of the current time step, and determining the hidden state of the current time step by combining the hidden state of the previous time step. And determining the selective evaluation probability of each predicate based on the probability distribution of the next character of each predicate in the actual query statement. The past neuro-language model is mainly used for natural language processing. The method considers the traditional method as a simple language model, proposes to apply NLM to the selective estimation of the predicates of the database character strings, and the NLM can estimate without constructing dictionary and statistical information, thus opening up a new efficient solution for the task of selectively estimating the predicates of the database character strings.
Description
Technical Field
The application relates to the field of database searching methods, in particular to a fuzzy query-oriented character string predicate accurate selection estimation method.
Background
In databases, accurate selective estimation of string predicates has been a long-standing research challenge. In database query optimization, selective estimation is a critical step that determines which execution plan the optimizer selects to minimize the execution cost of the query. For queries containing string predicates, particularly pattern matching (e.g., prefix, substring, and suffix matching) involving strings, selectivity estimation becomes more complex and difficult because the complexity and diversity of strings makes accurate estimation of selectivity a challenging task.
Conventional approaches typically employ pruning summary data structures (e.g., tries) to build an index or data structure to speed up the string pattern matching process, and then estimate the string predicate selectivity by statistical correlation. However, this approach has some limitations, firstly, pruning summary data structures may require a large memory space, especially for databases containing a large number of strings. Second, statistical correlations may produce less accurate cardinality estimates when processing string pattern matches, resulting in a query optimizer selecting a suboptimal execution plan.
Therefore, a method for accurately selecting and estimating the predicates of the character string for fuzzy query is needed.
The application comprises the following steps:
the application provides a fuzzy query-oriented character string predicate accurate selection estimation method, which is used for solving the problems that a large number of databases are needed and calculation is inaccurate when the existing databases are used for accurately estimating selectivity, which is described in the background art. The specific technical scheme of the application is as follows:
in a first aspect, the present application provides a fuzzy query-oriented character string predicate accurate selection estimation method, including:
training the autoregressive neural language model by taking the acquired query sentence and the corpus in the database as the input of the autoregressive neural language model architecture, so that the autoregressive neural language model generates a hidden state of a time step;
sequentially taking single characters of a character string in an actual query sentence as the input of a current time step, determining the hidden state of the current time step by combining the hidden state of a previous time step, and determining the probability distribution of the next character, wherein the hidden state of the current time step is used for predicting the probability distribution of the next character of each predicate;
and determining the selective evaluation probability of each character string based on the probability distribution of the next character of each character string in the actual query statement, wherein the selective evaluation probability is used for an execution optimizer of a database to select an optimal plan.
Further, the training the autoregressive neural language model by using the obtained query sentence and the corpus in the database as the input of the autoregressive neural language model architecture includes:
vectorizing the whole corpus of the database to generate a character vector sequence;
vectorizing the acquired query sentence to generate a character vector;
model training is carried out by taking the character vector sequence and the character vector as the input of the autoregressive neural language model architecture;
combining the hidden state of the previous time step with a current input to obtain the hidden state of the current time step, wherein the current input is a character vector input by the autoregressive language model framework;
and determining the prediction probability distribution of the current input based on the hidden state of the current time step.
Further, taking the obtained query sentence and the corpus in the database as the input of the autoregressive neural language model architecture, and then further comprising:
in the model training process, determining the difference between the prediction probability distribution and the actual target character through a cross entropy loss function;
and correcting the trainable parameters of the autoregressive neural language model architecture based on the difference between the predicted probability distribution and the actual target character.
Further, the step of determining the hidden state of the current time step by sequentially taking the single characters of the character string in the actual query sentence as the input of the current time step and combining the hidden state of the previous time step includes:
sequentially acquiring single characters of a character string in an actual query sentence as input of a current time step;
and combining the input of the current time step and the hidden state of the previous time step to determine the hidden state of the current time step.
Further, after the determining the hidden state of the current time step, the method further includes:
and determining the probability distribution of the next character based on the hidden state of the current time step.
Further, the determining the selective evaluation probability of each character string based on the probability distribution of the next character of each character string in the actual query sentence includes:
determining the conditional probability corresponding to the single byte based on probability distribution corresponding to all the single bytes in each character string in the actual query statement;
based on conditional probabilities corresponding to all individual bytes of the predicate, a selectivity evaluation probability of the string is determined.
Further, the determining the selective evaluation probability of the predicate based on the conditional probabilities corresponding to all the single bytes of the character string specifically includes:
selectively evaluating probabilities for strings of individual predicates;
conditional probability for individual characters;
For representing the next character +.>In a given state->Probability of the next generation.
In a second aspect, the present application further provides a fuzzy query-oriented string predicate accurate selection estimation system, where the system includes: the system comprises a model training module, a probability prediction module and a selectivity evaluation module;
the model training module is used for taking the acquired query statement and the corpus in the database as the input of an autoregressive neural language model architecture, training the autoregressive neural language model, and enabling the autoregressive neural language model to generate a time step hiding state;
the probability prediction module is used for sequentially taking single characters of the character strings in the actual query statement as the input of the current time step, determining the hidden state of the current time step by combining the hidden state of the previous time step, and determining the probability distribution of the next character, wherein the hidden state of the current time step is used for predicting the probability distribution of the next character of each character string;
the selectivity evaluation module is used for determining the selectivity evaluation probability of each character string based on the probability distribution of the next character of each character string in the actual query statement, and the selectivity evaluation probability is used for an execution optimizer of the database to select an optimal plan.
Preferably, the model training module is further configured to:
vectorizing the whole corpus of the database to generate a character vector sequence;
vectorizing the acquired query sentence to generate a character vector;
model training is carried out by taking the character vector sequence and the character vector as the input of the autoregressive neural language model architecture;
combining the hidden state of the previous time step with a current input to obtain the hidden state of the current time step, wherein the current input is a character vector input by the autoregressive language model framework;
and determining the prediction probability distribution of the current input based on the hidden state of the current time step.
Preferably, the probability prediction module is further configured to:
sequentially acquiring single characters of a character string in an actual query sentence as input of a current time step;
combining the input of the current time step with the hidden state of the previous time step to determine the hidden state of the current time step;
and determining probability distribution of the next character based on the hidden state of the current time step.
The application has the following effects:
the application provides a fuzzy query-oriented character string predicate accurate selection estimation method, which maps each character to a vector space, learns the association between character strings by using a sequence model, can accurately estimate the selectivity of any sub-character string, and is more suitable for character string query. And determining probability distribution of the next character through the hidden state of the autoregressive neural language model framework, thereby determining the selective evaluation probability of the character string, and using an execution optimizer of the database to select an optimal plan. The application efficiently and accurately carries out the selective estimation of the predicates of the character strings, and solves the limitation of the traditional method and the general model on the task.
Description of the drawings:
FIG. 1 is a flow chart of a fuzzy query oriented character string predicate accurate selection estimation method of the present application;
FIG. 2 is a diagram of a neuro-linguistic model architecture of a fuzzy query-oriented character string predicate accurate selection estimation method of the present application;
FIG. 3 is a diagram of a predicate selection function implementation process of the fuzzy query-oriented character string predicate accurate selection estimation method.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. It will be apparent that the described embodiments are only some, but not all, embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
String predicates refer to operators in SQL queries that are used to fuzzy match strings, such as LIKE operators of SQL. These operators are typically used to pattern match strings, allowing queries to specify conditions for fuzzy matches in terms of wild cards (e.g.,%) and placeholders (e.g., _).
In an example, "name LIKE ' abc% ' AND zipcode LIKE '%123' AND ssn LIKE '% 1234%", each predicate is a string predicate for performing fuzzy matching in a corresponding column (name, zipcode, ssn). Such a query would return all tuples that satisfy the three predicate conditions.
In optimizing a query, the query optimizer needs to estimate the selectivity of each predicate, i.e., the ratio of tuples meeting a particular predicate condition to the total number of tuples. The purpose of the selective estimation is to help the optimizer determine which predicate should be processed first when executing a query to obtain a more efficient query execution plan. Selectivity estimation is important to determine the order of connection and which indexes to use to optimize a query, as low selectivity predicates may return a large amount of data, while high selectivity predicates may narrow the result set faster.
Because of the fuzzy matching nature of string predicates, estimating their selectivity may be more difficult because queries may be matched based on prefixes, suffixes, substrings, or combinations thereof. Inaccurate selectivity estimates may cause the query optimizer to select an inappropriate execution plan, thereby affecting query performance.
Thus, in optimizing queries, accurate selectivity estimation is important to obtain an optimal query execution plan for queries involving string predicates.
Example 1
The application provides a fuzzy query-oriented character string predicate accurate selection estimation method, which is used for solving the problems that a large number of databases are needed and calculation is inaccurate when the existing databases are used for accurately estimating selectivity, which is described in the background art. The specific technical scheme of the application is as follows:
in a first aspect, the present application provides a fuzzy query-oriented character string predicate accurate selection estimation method, including: training the autoregressive neural language model by taking the acquired query sentence and the corpus in the database as the input of the autoregressive neural language model architecture, so that the autoregressive neural language model generates a hidden state of a time step; sequentially taking single characters of a character string in an actual query sentence as the input of a current time step, determining the hidden state of the current time step by combining the hidden state of a previous time step, and determining the probability distribution of the next character, wherein the hidden state of the current time step is used for predicting the probability distribution of the next character of each predicate; and determining the selective evaluation probability of each character string based on the probability distribution of the next character of each character string in the actual query statement, wherein the selective evaluation probability is used for an execution optimizer of a database to select an optimal plan.
In this embodiment, first, training is performed on the autoregressive application language model architecture through the existing query sentence and the corpus in the database. The query statement used in training the model is an existing query statement, and the query result corresponding to the query statement is also known. After training is completed, hidden states are formed within the autoregressive neuro-language model. The hidden state is an abstract representation of the structure between neurons after training of the neuro-language model is completed. As shown in fig. 2, the probability distribution of the next character is determined by the hidden state. When the model is used, the information of the character strings and predicates of the actual query statement is acquired. When the actual query statement is specifically applied, the actual query statement refers to the query statement of the current round.
Further, the training the autoregressive neural language model by using the obtained query sentence and the corpus in the database as the input of the autoregressive neural language model architecture includes: vectorizing the whole corpus of the database to generate a character vector sequence; vectorizing the acquired query sentence to generate a character vector; model training is carried out by taking the character vector sequence and the character vector as the input of the autoregressive neural language model architecture; combining the hidden state of the previous time step with a current input to obtain the hidden state of the current time step, wherein the current input is a character vector input by the autoregressive language model framework; and determining the prediction probability distribution of the current input based on the hidden state of the current time step.
Further, taking the obtained query sentence and the corpus in the database as the input of the autoregressive neural language model architecture, and then further comprising: in the model training process, determining the difference between the prediction probability distribution and the actual target character through a cross entropy loss function; and correcting the trainable parameters of the autoregressive neural language model architecture based on the difference between the predicted probability distribution and the actual target character.
Further, the step of determining the hidden state of the current time step by sequentially taking the single characters of the character string in the actual query sentence as the input of the current time step and combining the hidden state of the previous time step includes: sequentially acquiring single characters of a character string in an actual query sentence as input of a current time step; and combining the input of the current time step and the hidden state of the previous time step to determine the hidden state of the current time step.
Further, after the determining the hidden state of the current time step, the method further includes: and determining the probability distribution of the next character based on the hidden state of the current time step.
Further, the determining the selective evaluation probability of each character string based on the probability distribution of the next character of each character string in the actual query sentence includes: determining the conditional probability corresponding to the single byte based on probability distribution corresponding to all the single bytes in each character string in the actual query statement; based on conditional probabilities corresponding to all individual bytes of the predicate, a selectivity evaluation probability of the string is determined.
Further, the determining the selective evaluation probability of the predicate based on the conditional probabilities corresponding to all the single bytes of the character string specifically includes:
selectively evaluating probabilities for strings of individual predicates;Conditional probabilities for the respective characters;For representing the next character +.>In a given state->Probability of the next generation.
Example 2
1. Basic information
Let a be a finite alphabet. We have a relationship ofEach of which is->. In this relationship there are n strings.
Querying the character string predicates: SQL supports two wild card characters,% and_, for specifying string patterns. The percentile and underline allow for replacement of one or more characters in the string. The query "LIKE% Sam%" matches all strings that contain the substring "Sam". The query "LIKE s_m" matches all strings containing three words, the first letter of which is s and the last letter of which is m. Various queries may be expressed using these wild card characters. The prefix query matches all strings beginning with abc, which may be denoted as "LIKE abc%". The suffix query "LIKE% abc" matches all strings ending with abc. The substring query "LIKE% abc%" matches all of the strings containing the word "abc".
q-gram: let s.epsilon.R be a string of characters with a length of |s|. Let s [ i ] denote the ith character of s and s [ i, j ] denote the substring from the ith character to the jth character. Given a positive number q, q-gram of s is the set obtained by sliding a window of length q over s. For example, if q=2, s=sam, then q-gram is { sa, am }.
2. Training targets
In this task, we can provide a given sequence of characters by training a character-based language modelIs a probability of (2). For the prefix query "q%", we want to find all strings beginning with string "q". In the language model we can use a character sequence +.>To represent the string "q%", where +.>The character "q" is indicated, and->Indicating "%".
Now, from the probabilistic definition of the language model, we can calculate the string "q%" (i.e.) The probability of (2) is:
Wherein,,representing the probability of the character "q", whereas +.>Representing each subsequent character given the preceding character +.>Is a probability of (2). Therefore (S)>Representing the probability of occurrence of the entire character sequence "q%". Since the prefix query "q%" represents a string beginning with "q", the probability of this character sequence "q%" appearing is equivalent to a selective estimate of the prefix query "q%". Therefore, by means of the trained language model we can use +.>To estimate the selectivity of the prefix query "q%", we can similarly obtain a selectivity estimate of the suffix query "% q" by training a language model on the inverse corpus.
To achieve this we do not need to build a suffix tree, but can train a neuro-language model directly at the character level to learn.
3. Data preparation and input:
given a set of character stringsEach of which is->Is a sequence of characters that may be a query condition entered by a user or a known fuzzy query sample. Creating a large corpus by concatenating all strings>. For example, if->Q in the q-gram is set to 1, resulting inWherein->Is a special token that indicates the beginning and end of a word.Sequences represented as characters:WhereinRepresentation corpus->Middle->A character. Processing +.>The representation of each character as a vector is achieved using coding techniques and the entire corpus will be converted into a sequence of character vectors as input to the NLM.
4. NLM training:
NLM is a neuro-linguistic model architecture. At each time stepThe model receives the character vector +.>As input, the vector represents character information of the current time step. Sequentially removing the character vector sequence from the corpus +.>In the alternative, in embodiments of the application, the training model may be a through-neural language model, including TA random, GPT series model, and the like. Hidden state of previous time step +.>And the current input +.>Combining to obtain hidden state +.>Namely +.>This hidden state will contain information learned from previous characters that reflects statistical correlation and semantic information between characters. Then, the model will hide the state +.>Input into a full connection layer, the output of the full connection layer selects an activation function Sigmoid function, and the formula is +.>. After activating the function, the model will use the softmax function to convert the output of the fully connected layer into a probability distribution +.>. The Softmax function maps each element to a real number ranging between 0 and 1 and ensures that the sum of all elements is equal to 1, thus forming a probability distribution. Thus, the probability estimate +.for the next character given the preceding character can be obtained>。
5. Output prediction:
obtaining probability distributionThe model will then select the character with the highest probability as the output prediction for the current time step. During the training phase, this prediction is performedThe actual next character is compared, the loss (e.g., cross entropy loss) is calculated, and then the parameters of the model are updated using a back propagation algorithm and an optimizer to minimize the loss. In the inference phase, the next character can be sampled according to the predicted probability distribution and then input as input to the next time step.
6. Loss calculation:
during training, cross entropy loss function is used to measure predictive probability distributionThe difference from the actual target character makes the prediction closer to the true value, enabling a better prediction of the next character.
Indicating time->The language model loss function of (1) is local cross entropy, CE represents cross entropy (cross entropy), wherein ∈>And->Position +.>In the alphabet>Predicted and actual probabilities of individual characters. Express time->Prediction of->And->The cross entropy between the two is the local cross entropy, which is used for measuring the error of the current prediction, and the function of the local cross entropy is how accurate the quantization model predicts the next character under the condition of given the context of the previous character.
Representing the whole corpus +.>Global cross entropy penalty on. It calculates the total loss across the entire data set by averaging the local cross entropy losses across the entire corpus, where N is the total number of characters in the corpus.Is the loss at time t. The formula averages the loss at all times t to obtain the integral training loss of the language model on the corpus. The purpose of global cross entropy is to comprehensively evaluate the performance of the model on the whole corpus and measure the accuracy of character prediction of the model on the whole sequence.
The local cross entropy loss helps the model adjust its parameters at each time step so that it makes accurate predictions on individual characters. In order to optimize the local cross entropy loss, the parameters of the model need to be fine-tuned according to the loss. This process is implemented by a back-propagation algorithm that calculates the gradient of the local cross entropy loss with respect to the weight parameters of the model. The weight parameters include all trainable parameters in the model, such as connection weights, bias terms, and the like.
Global cross entropy loss by minimizing cross entropy loss ensures that the model can more consistently predict characters across the entire dataset, improving performance of string-selective estimation tasks, calculating gradients of global cross entropy loss relative to weight parameters of the model, including all trainable parameters in the model, such as connection weights, bias terms, and the like.
7. Selectivity estimation:
in the fuzzy query-oriented character string predicate accurate selection estimation method, we pay attention to the whole character string sequenceProbability estimates of (a) are determined. For calculating->Using a conditional probability of a continuous multiplication, in particular, < +.>Representing an entire string sequenceCan be determined by adding the conditional probability of the respective character +.>And (5) carrying out continuous multiplication. This conditional probability is expressed in terms of the given prefix +.>In the case of (2) the next character +.>Probability of occurrence. Also, we can use +.>To represent the next character +.>In a given state->Down-producingProbability of success.
Here we use a neuro-linguistic model (NLM) to implement the state memory function. NLM receives a characterAs input, then calculated by means of a model, resulting in a state +.>. This state->Remembering in the current time step the character which has been previously processed +.>Thereby constructing a study of the statistical relevance of the prefix characters. In the language model, status->For outputting the next character +.>Is a probability distribution of (c).
The training objective is to minimize the overall cross entropy loss by adjusting the parameters (weights and biases) of the model using gradient descent optimization techniquesSo that the model can more accurately estimate the probability distribution of the character, and at the same time, the model also learns and captures the statistical relevance of the prefix character during the training process, so as to more accurately predict the likelihood of the next character given the prefix. The same method is also applicable to suffix queries by reversing the corpus +.>The selectivity of the suffix query "% q" can be estimated by training the language model. This further enhances the performance of the model on fuzzy query tasks. Thus, modeling probabilities of string sequences by effectively utilizing NLMThe distribution is realized, and the method has good application potential in a fuzzy query scene.
9. State reset:
the model described in the previous steps may be used directly to estimate the selectivity of a prefix query or a suffix query,
for sub-string query tasks, e.g. one example of operationIt is desirable to estimate the selectivity of the substring query, '% a%', but a value closer to 0 is produced. In particular, it tries to estimate +.>. Since NLM does not see any +.>This probability will be estimated to be close to 0 for the beginning string. The root cause is NLM processing string +.>From the first character to the last character. Therefore, it only learns the probability +.corresponding to the prefix query>. In the context of natural language processing, this limitation is natural because NLM processes strings from scratch for various tasks in language modeling. However, estimating substring queries requires processing fragments of sentences, which is a very important requirement. For answer substring queries, they must be able to estimate the probability of any substring in the corpus. None of the existing NLMs can be used to estimate substring selectivity. In view of the earlier training, a novel NLM training adaptation method is proposed to alleviate this problem.
NLM training is adjusted to fit sub-string queries. The ability of NLM to accurately learn prefix query selectivity is preserved while expanding its use for substring queries. The reason for poor substring query performance is entirely due to the default NLM training not attempting to learn substringConditional probability of query. In our running example, the reason that the substring query '% a%' estimation is inaccurate is thatAre not accurately defined. For->Only->And->Non-zero results are produced because they are the only two characters at the beginning of any string.
By making simple changes to the training, namely state-reset (state-reset), the NLM can learn these conditional probabilities. In continuously processing a corpus, the following changes are consideredAt the time we will randomly->Reset to a certain small probability. We consider the sequence +.>. NLM treatment->And generates a new state +.>. In unmodified NLM, the probability +.>. However, suppose that the process is +>Previously, randomly crossThe journey will be->Reset to->. In other words, in this case, the dummies input sequence is +.>. The NLM will now attempt to handle the processThis will result in a better estimate of the substring query over unmodified NLM.
Specifically, the state reset is subject to a superparameterThe parameter controls the probability of resetting the state. Setting upEquivalent to training a classical NLM. Set larger->Resulting in a higher reset probability. The concept of state reset is similar to dropout. In a feed-forward neural network, nodes are randomly discarded with a certain dropout probability. Although dropout may destabilize the training process, it acts as an effective regularization method that may improve generalization ability. The idea of state reset is to support substring queries, but although some dropout works on a round-robin network, only applies to non-round-robin connections to avoid impeding memory capacity. In contrast, the circular connection is explicitly reset, allowing the NLM to learn how to estimate the selectivity of the substrings in the corpus.
Accurate selectivity estimation for prefix, substring, and suffix queries is a challenging problem. The method provided by the application is used for generating the character string predicate accurate selection estimation method facing to fuzzy query through training the neuro-language model, can adapt to prefix, substring and suffix query tasks, and solves the challenging problem of accurately and selectively estimating the character string predicate in a database.
Example 3
In a second aspect, the present application further provides a fuzzy query-oriented string predicate accurate selection estimation system, where the system includes: the system comprises a model training module, a probability prediction module and a selectivity evaluation module;
the model training module is used for taking the acquired query statement and the corpus in the database as the input of an autoregressive neural language model architecture, training the autoregressive neural language model, and enabling the autoregressive neural language model to generate a time step hiding state;
the probability prediction module is used for sequentially taking single characters of predicates in an actual query statement as input of a current time step, determining the hidden state of the current time step by combining the hidden state of a previous time step, wherein the hidden state of the current time step is used for predicting probability distribution of the next character of each predicate;
the selectivity evaluation module is used for determining the selectivity evaluation probability of each predicate based on the probability distribution of the next character of each predicate in the actual query statement, wherein the selectivity evaluation probability is used for an execution optimizer of a database to select an optimal plan.
Preferably, the model training module is further configured to:
vectorizing the whole corpus of the database to generate a character vector sequence;
vectorizing the acquired query sentence to generate a character vector;
model training is carried out by taking the character vector sequence and the character vector as the input of the autoregressive neural language model architecture;
combining the hidden state of the previous time step with a current input to obtain the hidden state of the current time step, wherein the current input is a character vector input by the autoregressive language model framework;
and determining the prediction probability distribution of the current input based on the hidden state of the current time step.
Preferably, the probability prediction module is further configured to:
sequentially acquiring single characters of predicates in an actual query sentence as input of a current time step;
combining the input of the current time step with the hidden state of the previous time step to determine the hidden state of the current time step;
and determining probability distribution of the next character based on the hidden state of the current time step.
Preferably, the selectivity evaluation module is further configured to:
determining the conditional probability corresponding to the single byte based on probability distribution corresponding to all single bytes in each predicate in an actual query statement;
a selectivity evaluation probability of the predicate is determined based on conditional probabilities corresponding to all individual bytes of the predicate.
Preferably, the selectivity evaluation module is specifically configured to:
selectively evaluating probabilities for strings of individual predicates;
conditional probabilities for the respective characters;
for representing the next character +.>In a given state->Probability of the next generation.
Example 3
To further illustrate the specific application of the present application, a neuro-linguistic model architecture diagram is disclosed as shown in fig. 2.
Input layer: representing the input character sequence, such as "BOW, t, i, m".
An embedding layer: the input characters are mapped to a word vector representation.
Hidden status layer: a model is utilized to capture long-term dependencies of the character sequence and output a state representation.
Output layer: the probability distribution of the next character, such as p (|BOW, t, i, m), is output based on the model state.
Specifically:
a character sequence "BOW, t, i, m" is entered.
The embedding layer maps each character into a vector representation.
The hidden state layer outputs a state vector based on the current input "m" and the preceding character sequence. This state vector encodes the relevant information of the previous sequence.
Based on the hidden state layer, the output layer gives the probability distribution p (|BOW, t, i, m) of the next character, the characterSeems to be the most likely one. />
Claims (10)
1. A fuzzy query-oriented character string predicate accurate selection estimation method is characterized by comprising the following steps:
training the autoregressive neural language model by taking the acquired query sentence and the corpus in the database as the input of the autoregressive neural language model architecture, so that the autoregressive neural language model generates a hidden state of a time step;
sequentially taking single characters of a character string in an actual query sentence as input of a current time step, determining the hidden state of the current time step by combining the hidden state of a previous time step, and determining probability distribution of a next character, wherein the hidden state of the current time step is used for predicting the probability distribution of the next character of each character string;
and determining the selective evaluation probability of each character string based on the probability distribution of the next character of each character string in the actual query statement, wherein the selective evaluation probability is used for an execution optimizer of a database to select an optimal plan.
2. The method for estimating accurate selection of a predicate for a fuzzy query according to claim 1, wherein training the autoregressive neural language model by using the obtained query sentence and the corpus in the database as inputs of the autoregressive neural language model architecture comprises:
vectorizing the whole corpus of the database to generate a character vector sequence;
vectorizing the acquired query sentence to generate a character vector;
model training is carried out by taking the character vector sequence and the character vector as the input of the autoregressive neural language model architecture;
combining the hidden state of the previous time step with a current input to obtain the hidden state of the current time step, wherein the current input is a character vector input by the autoregressive language model framework;
and determining the prediction probability distribution of the current input based on the hidden state of the current time step.
3. The method for accurately selecting and estimating a predicate of a character string for fuzzy query according to claim 2, wherein the input of the obtained query sentence and the corpus in the database as the autoregressive neuro-language model structure further comprises:
in the model training process, determining the difference between the prediction probability distribution and the actual target character through a cross entropy loss function;
and correcting the trainable parameters of the autoregressive neural language model architecture based on the difference between the predicted probability distribution and the actual target character.
4. The method for accurately selecting and estimating a predicate of a character string for fuzzy query according to claim 1, wherein the step of determining the hidden state of the current time step by sequentially using individual characters of the character string in the actual query sentence as the input of the current time step and combining the hidden state of the previous time step comprises the steps of:
sequentially acquiring single characters of a character string in an actual query sentence as input of a current time step;
and combining the input of the current time step and the hidden state of the previous time step to determine the hidden state of the current time step.
5. The method for accurately selecting and estimating a predicate of a character string for fuzzy query according to claim 4, further comprising, after said determining the hidden state of the current time step:
and determining the probability distribution of the next character based on the hidden state of the current time step.
6. The method for estimating accurate selection of a predicate for a character string for a fuzzy query according to claim 1, wherein the determining the probability of estimating the selectivity of each character string based on the probability distribution of the next character of each character string in the actual query sentence comprises:
determining the conditional probability corresponding to the single byte based on probability distribution corresponding to all the single bytes in each character string in the actual query statement;
and determining the selectivity evaluation probability of the character string based on the conditional probabilities corresponding to all the single bytes of the character string.
7. The method for accurately selecting and estimating predicate of a character string for fuzzy query according to claim 6, wherein the determining the selectively evaluating probability of the predicate is based on conditional probabilities corresponding to all single bytes of the character string, specifically:
;
selectively evaluating probabilities for strings of individual predicates;
conditional probabilities for the respective characters;
for representing the next character +.>In a given state->Probability of the next generation.
8. A fuzzy query-oriented string predicate accurate selection estimation system, the system comprising: the system comprises a model training module, a probability prediction module and a selectivity evaluation module;
the model training module is used for taking the acquired query statement and the corpus in the database as the input of an autoregressive neural language model architecture, training the autoregressive neural language model, and enabling the autoregressive neural language model to generate a time step hiding state;
the probability prediction module is used for sequentially taking single characters of the character strings in the actual query statement as the input of the current time step, determining the hidden state of the current time step by combining the hidden state of the previous time step, and determining the probability distribution of the next character, wherein the hidden state of the current time step is used for predicting the probability distribution of the next character of each character string;
the selectivity evaluation module is used for determining the selectivity evaluation probability of each character string based on the probability distribution of the next character of each character string in the actual query statement, and the selectivity evaluation probability is used for an execution optimizer of the database to select an optimal plan.
9. The fuzzy query oriented string predicate accurate selection estimation system of claim 8, wherein the model training module is further to:
vectorizing the whole corpus of the database to generate a character vector sequence;
vectorizing the acquired query sentence to generate a character vector;
model training is carried out by taking the character vector sequence and the character vector as the input of the autoregressive neural language model architecture;
combining the hidden state of the previous time step with a current input to obtain the hidden state of the current time step, wherein the current input is a character vector input by the autoregressive language model framework;
and determining the prediction probability distribution of the current input based on the hidden state of the current time step.
10. The fuzzy query oriented string predicate accurate selection estimation system of claim 9, wherein the probabilistic predictive module is further configured to:
sequentially acquiring single characters of a character string in an actual query sentence as input of a current time step;
combining the input of the current time step with the hidden state of the previous time step to determine the hidden state of the current time step;
and determining probability distribution of the next character based on the hidden state of the current time step.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311072853.1A CN116821436B (en) | 2023-08-24 | 2023-08-24 | Fuzzy query-oriented character string predicate accurate selection estimation method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311072853.1A CN116821436B (en) | 2023-08-24 | 2023-08-24 | Fuzzy query-oriented character string predicate accurate selection estimation method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116821436A true CN116821436A (en) | 2023-09-29 |
CN116821436B CN116821436B (en) | 2024-01-02 |
Family
ID=88127730
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311072853.1A Active CN116821436B (en) | 2023-08-24 | 2023-08-24 | Fuzzy query-oriented character string predicate accurate selection estimation method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116821436B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117892711A (en) * | 2023-12-11 | 2024-04-16 | 中新金桥数字科技(北京)有限公司 | Method for obtaining text correlation based on large model |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060206477A1 (en) * | 2004-11-18 | 2006-09-14 | University Of Washington | Computing probabilistic answers to queries |
US20100010989A1 (en) * | 2008-07-03 | 2010-01-14 | The Regents Of The University Of California | Method for Efficiently Supporting Interactive, Fuzzy Search on Structured Data |
US20140108378A1 (en) * | 2012-10-17 | 2014-04-17 | International Business Machines Corporation | Technique for factoring uncertainty into cost-based query optimization |
WO2018033030A1 (en) * | 2016-08-19 | 2018-02-22 | 中兴通讯股份有限公司 | Natural language library generation method and device |
WO2022033073A1 (en) * | 2020-08-12 | 2022-02-17 | 哈尔滨工业大学 | Cognitive service-oriented user intention recognition method and system |
US20220108081A1 (en) * | 2020-10-01 | 2022-04-07 | Naver Corporation | Method and system for controlling distributions of attributes in language models for text generation |
CN115965033A (en) * | 2023-03-16 | 2023-04-14 | 安徽大学 | Generation type text summarization method and device based on sequence level prefix prompt |
CN116245106A (en) * | 2023-03-14 | 2023-06-09 | 北京理工大学 | Cross-domain named entity identification method based on autoregressive model |
-
2023
- 2023-08-24 CN CN202311072853.1A patent/CN116821436B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060206477A1 (en) * | 2004-11-18 | 2006-09-14 | University Of Washington | Computing probabilistic answers to queries |
US20100010989A1 (en) * | 2008-07-03 | 2010-01-14 | The Regents Of The University Of California | Method for Efficiently Supporting Interactive, Fuzzy Search on Structured Data |
US20140108378A1 (en) * | 2012-10-17 | 2014-04-17 | International Business Machines Corporation | Technique for factoring uncertainty into cost-based query optimization |
WO2018033030A1 (en) * | 2016-08-19 | 2018-02-22 | 中兴通讯股份有限公司 | Natural language library generation method and device |
WO2022033073A1 (en) * | 2020-08-12 | 2022-02-17 | 哈尔滨工业大学 | Cognitive service-oriented user intention recognition method and system |
US20220108081A1 (en) * | 2020-10-01 | 2022-04-07 | Naver Corporation | Method and system for controlling distributions of attributes in language models for text generation |
CN116245106A (en) * | 2023-03-14 | 2023-06-09 | 北京理工大学 | Cross-domain named entity identification method based on autoregressive model |
CN115965033A (en) * | 2023-03-16 | 2023-04-14 | 安徽大学 | Generation type text summarization method and device based on sequence level prefix prompt |
Non-Patent Citations (2)
Title |
---|
HAOYUAN GUAN ET AL.: "Efficient SPARQL Query Processing Based on Adjacent-Predicate Structure Index", 《 IEEE XPLORE》 * |
张宗仁 等: "基于自然语言理解的SPARQL本体查询", 计算机应用, no. 12 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117892711A (en) * | 2023-12-11 | 2024-04-16 | 中新金桥数字科技(北京)有限公司 | Method for obtaining text correlation based on large model |
Also Published As
Publication number | Publication date |
---|---|
CN116821436B (en) | 2024-01-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Wu et al. | Hyperparameter optimization for machine learning models based on Bayesian optimization | |
CN112487807B (en) | Text relation extraction method based on expansion gate convolutional neural network | |
CN106502985B (en) | neural network modeling method and device for generating titles | |
US20200012953A1 (en) | Method and apparatus for generating model | |
CN106897371B (en) | Chinese text classification system and method | |
CN111414749B (en) | Social text dependency syntactic analysis system based on deep neural network | |
JP2020520516A5 (en) | ||
CN110928993A (en) | User position prediction method and system based on deep cycle neural network | |
CN112232087B (en) | Specific aspect emotion analysis method of multi-granularity attention model based on Transformer | |
CN110458181A (en) | A kind of syntax dependency model, training method and analysis method based on width random forest | |
CN116821436B (en) | Fuzzy query-oriented character string predicate accurate selection estimation method | |
Santacroce et al. | What matters in the structured pruning of generative language models? | |
CN115422369B (en) | Knowledge graph completion method and device based on improved TextRank | |
CN113935489A (en) | Variational quantum model TFQ-VQA based on quantum neural network and two-stage optimization method thereof | |
CN112766603A (en) | Traffic flow prediction method, system, computer device and storage medium | |
CN110851584A (en) | Accurate recommendation system and method for legal provision | |
CN113157919A (en) | Sentence text aspect level emotion classification method and system | |
WO2020100738A1 (en) | Processing device, processing method, and processing program | |
Azeraf et al. | Highly fast text segmentation with pairwise markov chains | |
US11941360B2 (en) | Acronym definition network | |
CN112347783B (en) | Alarm condition and stroke data event type identification method without trigger words | |
Seilsepour et al. | Self-supervised sentiment classification based on semantic similarity measures and contextual embedding using metaheuristic optimizer | |
CN116720519B (en) | Seedling medicine named entity identification method | |
CN117436451A (en) | Agricultural pest and disease damage named entity identification method based on IDCNN-Attention | |
CN114997155A (en) | Fact verification method and device based on table retrieval and entity graph reasoning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |