CN116821436B - Fuzzy query-oriented character string predicate accurate selection estimation method - Google Patents

Fuzzy query-oriented character string predicate accurate selection estimation method Download PDF

Info

Publication number
CN116821436B
CN116821436B CN202311072853.1A CN202311072853A CN116821436B CN 116821436 B CN116821436 B CN 116821436B CN 202311072853 A CN202311072853 A CN 202311072853A CN 116821436 B CN116821436 B CN 116821436B
Authority
CN
China
Prior art keywords
time step
character
input
hidden state
current time
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311072853.1A
Other languages
Chinese (zh)
Other versions
CN116821436A (en
Inventor
张睿恒
赵怡婧
闫紫滕
刘雨蒙
苏毅
王潮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Remote Sensing Equipment
Original Assignee
Beijing Institute of Remote Sensing Equipment
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Remote Sensing Equipment filed Critical Beijing Institute of Remote Sensing Equipment
Priority to CN202311072853.1A priority Critical patent/CN116821436B/en
Publication of CN116821436A publication Critical patent/CN116821436A/en
Application granted granted Critical
Publication of CN116821436B publication Critical patent/CN116821436B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)

Abstract

A fuzzy query-oriented character string predicate accurate selection estimation method uses an acquired query statement and a corpus in a database as input of an autoregressive neural language model architecture, and trains the autoregressive neural language model. And sequentially taking single characters of predicates in the actual query statement as the input of the current time step, and determining the hidden state of the current time step by combining the hidden state of the previous time step. And determining the selective evaluation probability of each predicate based on the probability distribution of the next character of each predicate in the actual query statement. The past neuro-language model is mainly used for natural language processing. The method considers the traditional method as a simple language model, proposes to apply NLM to the selective estimation of the predicates of the database character strings, and the NLM can estimate without constructing dictionary and statistical information, thus opening up a new efficient solution for the task of selectively estimating the predicates of the database character strings.

Description

Fuzzy query-oriented character string predicate accurate selection estimation method
Technical Field
The invention relates to the field of database searching methods, in particular to a fuzzy query-oriented character string predicate accurate selection estimation method.
Background
In databases, accurate selective estimation of string predicates has been a long-standing research challenge. In database query optimization, selective estimation is a critical step that determines which execution plan the optimizer selects to minimize the execution cost of the query. For queries containing string predicates, particularly pattern matching (e.g., prefix, substring, and suffix matching) involving strings, selectivity estimation becomes more complex and difficult because the complexity and diversity of strings makes accurate estimation of selectivity a challenging task.
Conventional approaches typically employ pruning summary data structures (e.g., tries) to build an index or data structure to speed up the string pattern matching process, and then estimate the string predicate selectivity by statistical correlation. However, this approach has some limitations, firstly, pruning summary data structures may require a large memory space, especially for databases containing a large number of strings. Second, statistical correlations may produce less accurate cardinality estimates when processing string pattern matches, resulting in a query optimizer selecting a suboptimal execution plan.
Therefore, a method for accurately selecting and estimating the predicates of the character string for fuzzy query is needed.
The invention comprises the following steps:
the invention provides a fuzzy query-oriented character string predicate accurate selection estimation method, which is used for solving the problems that a large number of databases are needed and calculation is inaccurate when the existing databases are used for accurately estimating selectivity, which is described in the background art. The specific technical scheme of the invention is as follows:
in a first aspect, the present invention provides a fuzzy query-oriented character string predicate accurate selection estimation method, including:
training the autoregressive neural language model by taking the acquired query sentence and the corpus in the database as the input of the autoregressive neural language model architecture, so that the autoregressive neural language model generates a hidden state of a time step;
sequentially taking single characters of a character string in an actual query sentence as the input of a current time step, determining the hidden state of the current time step by combining the hidden state of a previous time step, and determining the probability distribution of the next character, wherein the hidden state of the current time step is used for predicting the probability distribution of the next character of each predicate;
and determining the selective evaluation probability of each character string based on the probability distribution of the next character of each character string in the actual query statement, wherein the selective evaluation probability is used for an execution optimizer of a database to select an optimal plan.
Further, the training the autoregressive neural language model by using the obtained query sentence and the corpus in the database as the input of the autoregressive neural language model architecture includes:
vectorizing the whole corpus of the database to generate a character vector sequence;
vectorizing the acquired query sentence to generate a character vector;
model training is carried out by taking the character vector sequence and the character vector as the input of the autoregressive neural language model architecture;
combining the hidden state of the previous time step with a current input to obtain the hidden state of the current time step, wherein the current input is a character vector input by the autoregressive language model framework;
and determining the prediction probability distribution of the current input based on the hidden state of the current time step.
Further, taking the obtained query sentence and the corpus in the database as the input of the autoregressive neural language model architecture, and then further comprising:
in the model training process, determining the difference between the prediction probability distribution and the actual target character through a cross entropy loss function;
and correcting the trainable parameters of the autoregressive neural language model architecture based on the difference between the predicted probability distribution and the actual target character.
Further, the step of determining the hidden state of the current time step by sequentially taking the single characters of the character string in the actual query sentence as the input of the current time step and combining the hidden state of the previous time step includes:
sequentially acquiring single characters of a character string in an actual query sentence as input of a current time step;
and combining the input of the current time step and the hidden state of the previous time step to determine the hidden state of the current time step.
Further, after the determining the hidden state of the current time step, the method further includes:
and determining the probability distribution of the next character based on the hidden state of the current time step.
Further, the determining the selective evaluation probability of each character string based on the probability distribution of the next character of each character string in the actual query sentence includes:
determining the conditional probability corresponding to the single byte based on probability distribution corresponding to all the single bytes in each character string in the actual query statement;
based on conditional probabilities corresponding to all individual bytes of the predicate, a selectivity evaluation probability of the string is determined.
Further, the determining the selective evaluation probability of the predicate based on the conditional probabilities corresponding to all the single bytes of the character string specifically includes:
selectively evaluating probabilities for strings of individual predicates;
conditional probabilities for the respective characters;
for representing the next character +.>In a given state->Probability of the next generation.
In a second aspect, the present invention further provides a fuzzy query-oriented string predicate accurate selection estimation system, where the system includes: the system comprises a model training module, a probability prediction module and a selectivity evaluation module;
the model training module is used for taking the acquired query statement and the corpus in the database as the input of an autoregressive neural language model architecture, training the autoregressive neural language model, and enabling the autoregressive neural language model to generate a time step hiding state;
the probability prediction module is used for sequentially taking single characters of the character strings in the actual query statement as the input of the current time step, determining the hidden state of the current time step by combining the hidden state of the previous time step, and determining the probability distribution of the next character, wherein the hidden state of the current time step is used for predicting the probability distribution of the next character of each character string;
the selectivity evaluation module is used for determining the selectivity evaluation probability of each character string based on the probability distribution of the next character of each character string in the actual query statement, and the selectivity evaluation probability is used for an execution optimizer of the database to select an optimal plan.
Preferably, the model training module is further configured to:
vectorizing the whole corpus of the database to generate a character vector sequence;
vectorizing the acquired query sentence to generate a character vector;
model training is carried out by taking the character vector sequence and the character vector as the input of the autoregressive neural language model architecture;
combining the hidden state of the previous time step with a current input to obtain the hidden state of the current time step, wherein the current input is a character vector input by the autoregressive language model framework;
and determining the prediction probability distribution of the current input based on the hidden state of the current time step.
Preferably, the probability prediction module is further configured to:
sequentially acquiring single characters of a character string in an actual query sentence as input of a current time step;
combining the input of the current time step with the hidden state of the previous time step to determine the hidden state of the current time step;
and determining probability distribution of the next character based on the hidden state of the current time step.
The invention has the following effects:
the invention provides a fuzzy query-oriented character string predicate accurate selection estimation method, which maps each character to a vector space, learns the association between character strings by using a sequence model, can accurately estimate the selectivity of any sub-character string, and is more suitable for character string query. And determining probability distribution of the next character through the hidden state of the autoregressive neural language model framework, thereby determining the selective evaluation probability of the character string, and using an execution optimizer of the database to select an optimal plan. The invention efficiently and accurately carries out the selective estimation of the predicates of the character strings, and solves the limitation of the traditional method and the general model on the task.
Description of the drawings:
FIG. 1 is a flow chart of a fuzzy query oriented character string predicate accurate selection estimation method of the present invention;
FIG. 2 is a diagram of a neuro-linguistic model architecture of a fuzzy query-oriented character string predicate accurate selection estimation method of the present invention;
FIG. 3 is a diagram of a predicate selection function implementation process of the fuzzy query-oriented character string predicate accurate selection estimation method.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
String predicates refer to operators in SQL queries that are used to fuzzy match strings, such as LIKE operators of SQL. These operators are typically used to pattern match strings, allowing queries to specify conditions for fuzzy matches in terms of wild cards (e.g.,%) and placeholders (e.g., _).
In an example, "name LIKE ' abc% ' AND zipcode LIKE '%123' AND ssn LIKE '% 1234%", each predicate is a string predicate for performing fuzzy matching in a corresponding column (name, zipcode, ssn). Such a query would return all tuples that satisfy the three predicate conditions.
In optimizing a query, the query optimizer needs to estimate the selectivity of each predicate, i.e., the ratio of tuples meeting a particular predicate condition to the total number of tuples. The purpose of the selective estimation is to help the optimizer determine which predicate should be processed first when executing a query to obtain a more efficient query execution plan. Selectivity estimation is important to determine the order of connection and which indexes to use to optimize a query, as low selectivity predicates may return a large amount of data, while high selectivity predicates may narrow the result set faster.
Because of the fuzzy matching nature of string predicates, estimating their selectivity may be more difficult because queries may be matched based on prefixes, suffixes, substrings, or combinations thereof. Inaccurate selectivity estimates may cause the query optimizer to select an inappropriate execution plan, thereby affecting query performance.
Thus, in optimizing queries, accurate selectivity estimation is important to obtain an optimal query execution plan for queries involving string predicates.
Specific example 1:
the invention provides a fuzzy query-oriented character string predicate accurate selection estimation method, which is used for solving the problems that a large number of databases are needed and calculation is inaccurate when the existing databases are used for accurately estimating selectivity, which is described in the background art. The specific technical scheme of the invention is as follows:
in a first aspect, the present invention provides a fuzzy query-oriented character string predicate accurate selection estimation method, including: training the autoregressive neural language model by taking the acquired query sentence and the corpus in the database as the input of the autoregressive neural language model architecture, so that the autoregressive neural language model generates a hidden state of a time step; sequentially taking single characters of a character string in an actual query sentence as the input of a current time step, determining the hidden state of the current time step by combining the hidden state of a previous time step, and determining the probability distribution of the next character, wherein the hidden state of the current time step is used for predicting the probability distribution of the next character of each predicate; and determining the selective evaluation probability of each character string based on the probability distribution of the next character of each character string in the actual query statement, wherein the selective evaluation probability is used for an execution optimizer of a database to select an optimal plan.
In this embodiment, first, training is performed on the autoregressive application language model architecture through the existing query sentence and the corpus in the database. The query statement used in training the model is an existing query statement, and the query result corresponding to the query statement is also known. After training is completed, hidden states are formed within the autoregressive neuro-language model. The hidden state is an abstract representation of the structure between neurons after training of the neuro-language model is completed. As shown in fig. 2, the probability distribution of the next character is determined by the hidden state. When the model is used, the information of the character strings and predicates of the actual query statement is acquired. When the actual query statement is specifically applied, the actual query statement refers to the query statement of the current round.
Further, the training the autoregressive neural language model by using the obtained query sentence and the corpus in the database as the input of the autoregressive neural language model architecture includes: vectorizing the whole corpus of the database to generate a character vector sequence; vectorizing the acquired query sentence to generate a character vector; model training is carried out by taking the character vector sequence and the character vector as the input of the autoregressive neural language model architecture; combining the hidden state of the previous time step with a current input to obtain the hidden state of the current time step, wherein the current input is a character vector input by the autoregressive language model framework; and determining the prediction probability distribution of the current input based on the hidden state of the current time step.
Further, taking the obtained query sentence and the corpus in the database as the input of the autoregressive neural language model architecture, and then further comprising: in the model training process, determining the difference between the prediction probability distribution and the actual target character through a cross entropy loss function; and correcting the trainable parameters of the autoregressive neural language model architecture based on the difference between the predicted probability distribution and the actual target character.
Further, the step of determining the hidden state of the current time step by sequentially taking the single characters of the character string in the actual query sentence as the input of the current time step and combining the hidden state of the previous time step includes: sequentially acquiring single characters of a character string in an actual query sentence as input of a current time step; and combining the input of the current time step and the hidden state of the previous time step to determine the hidden state of the current time step.
Further, after the determining the hidden state of the current time step, the method further includes: and determining the probability distribution of the next character based on the hidden state of the current time step.
Further, the determining the selective evaluation probability of each character string based on the probability distribution of the next character of each character string in the actual query sentence includes: determining the conditional probability corresponding to the single byte based on probability distribution corresponding to all the single bytes in each character string in the actual query statement; based on conditional probabilities corresponding to all individual bytes of the predicate, a selectivity evaluation probability of the string is determined.
Further, the determining the selective evaluation probability of the predicate based on the conditional probabilities corresponding to all the single bytes of the character string specifically includes:
selectively evaluating probabilities for strings of individual predicates; />Conditional probabilities for the respective characters; />For representing the next character +.>In a given state->Probability of the next generation.
Specific example 2:
1. basic information
Let a be a finite alphabet. We have a relationship ofEach of which is provided with. In this relationship there are n strings.
Querying the character string predicates: SQL supports two wild card characters,% and_, for specifying string patterns. The percentile and underline allow for replacement of one or more characters in the string. The query "LIKE% Sam%" matches all strings that contain the substring "Sam". The query "LIKE s_m" matches all strings containing three words, the first letter of which is s and the last letter of which is m. Various queries may be expressed using these wild card characters. The prefix query matches all strings beginning with abc, which may be denoted as "LIKE abc%". The suffix query "LIKE% abc" matches all strings ending with abc. The substring query "LIKE% abc%" matches all of the strings containing the word "abc".
q-gram: let s.epsilon.R be a string of characters with a length of |s|. Let s [ i ] denote the ith character of s and s [ i, j ] denote the substring from the ith character to the jth character. Given a positive number q, q-gram of s is the set obtained by sliding a window of length q over s. For example, if q=2, s=sam, then q-gram is { sa, am }.
2. Training targets
In this task, we can provide a given sequence of characters by training a character-based language modelIs a probability of (2). For the prefix query "q%", we want to find all strings beginning with string "q". In the language model we can use a character sequence +.>To represent the string "q%", where +.>The character "q" is indicated, and->Indicating "%".
Now, from the probabilistic definition of the language model, we can calculate the string "q%" (i.e.) The probability of (2) is:
wherein,representing the probability of the character "q", whereas +.>Representing each subsequent character given the preceding character +.>Is a probability of (2). Therefore (S)>Representing the probability of occurrence of the entire character sequence "q%". Since the prefix query "q%" represents a string beginning with "q", the probability of this character sequence "q%" appearing is equivalent to a selective estimate of the prefix query "q%". Therefore, by means of the trained language model we can use +.>To estimate the selectivity of the prefix query "q%", we can similarly obtain a selectivity estimate of the suffix query "% q" by training a language model on the inverse corpus.
To achieve this we do not need to build a suffix tree, but can train a neuro-language model directly at the character level to learn.
3. Data preparation and input:
given a set of character stringsEach of which is->Is a sequence of characters that may be a query condition entered by a user or a known fuzzy query sample. Creating a large corpus by concatenating all strings>. For example, if->Q in the q-gram is set to 1, resulting inWhereinIs a special token that indicates the beginning and end of a word. />Sequences represented as characters:wherein->Representation corpus->Middle->A character. Processing +.>The representation of each character as a vector is achieved using coding techniques and the entire corpus will be converted into a sequence of character vectors as input to the NLM.
4. NLM training:
NLM is a neuro-linguistic model architecture. At each time stepThe model receives the character vector +.>As input, the vector represents character information of the current time step. Sequentially removing the character vector sequence from the corpus +.>Alternatively, in embodiments of the present application, the training model may be a through-neural language model, including a transducer, a GPT series model, or the like. Hidden state of previous time step +.>And the current input +.>Combining to obtain hidden state +.>Namely +.>This hidden state will contain information learned from previous characters that reflects statistical correlation and semantic information between characters. Then, the model will hide the state +.>Input into a full connection layer, the output of the full connection layer selects an activation function Sigmoid function, and the formula is +.>. After activating the function, the model will use the softmax function to convert the output of the fully connected layer into a probability distribution +.>. The Softmax function maps each element to a real number ranging between 0 and 1 and ensures that the sum of all elements is equal to 1, thus forming a probability distribution. Thus, the probability estimate +.for the next character given the preceding character can be obtained>
5. Output prediction:
obtaining probability distributionThe model will then select the character with the highest probability as the output prediction for the current time step. In the training phase, this prediction is compared to the actual next character, the loss (e.g., cross entropy loss) is calculated, and then the parameters of the model are updated using a back propagation algorithm and optimizer to minimize the loss. In the inference phase, the next character can be sampled according to the predicted probability distribution and then input as input to the next time step.
6. Loss calculation:
during training, cross entropy loss function is used to measure predictive probability distributionThe difference from the actual target character makes the prediction closer to the true value, enabling a better prediction of the next character.
Indicating time->The language model loss function of (1) is local cross entropy, CE represents cross entropy (cross entropy), wherein ∈>And->Position +.>In the alphabet>Predicted and actual probabilities of individual characters. Express time->Prediction of->And->The cross entropy between the two is the local cross entropy, which is used for measuring the error of the current prediction, and the function of the local cross entropy is how accurate the quantization model predicts the next character under the condition of given the context of the previous character.
Representing the whole corpus +.>Global cross entropy penalty on. It calculates the total loss across the entire data set by averaging the local cross entropy losses across the entire corpus, where N is the total number of characters in the corpus.Is the loss at time t. The formula averages the loss at all times t to obtain the integral training loss of the language model on the corpus. The purpose of global cross entropy is to comprehensively evaluate the performance of the model on the whole corpus and measure the accuracy of character prediction of the model on the whole sequence.
The local cross entropy loss helps the model adjust its parameters at each time step so that it makes accurate predictions on individual characters. In order to optimize the local cross entropy loss, the parameters of the model need to be fine-tuned according to the loss. This process is implemented by a back-propagation algorithm that calculates the gradient of the local cross entropy loss with respect to the weight parameters of the model. The weight parameters include all trainable parameters in the model, such as connection weights, bias terms, and the like.
Global cross entropy loss by minimizing cross entropy loss ensures that the model can more consistently predict characters across the entire dataset, improving performance of string-selective estimation tasks, calculating gradients of global cross entropy loss relative to weight parameters of the model, including all trainable parameters in the model, such as connection weights, bias terms, and the like.
7. Selectivity estimation:
in the fuzzy query-oriented character string predicate accurate selection estimation method, we pay attention to the whole character string sequenceProbability estimates of (a) are determined. For calculating->Using a conditional probability of a continuous multiplication, in particular, < +.>Representing an entire string sequenceCan be determined by adding the conditional probability of the respective character +.>And (5) carrying out continuous multiplication. This conditional probability is expressed in terms of the given prefix +.>In the case of (2) the next character +.>Probability of occurrence. Also, we can use +.>To represent the next character +.>In a given state->Probability of the next generation.
Here we use a neuro-linguistic model (NLM) to implement the state memory function. NLM receives a characterAs input, then calculated by means of a model, resulting in a state +.>. This state->Remembering in the current time step the character which has been previously processed +.>Thereby constructing a study of the statistical relevance of the prefix characters. In the language model, status->For outputting the next character +.>Is a probability distribution of (c).
The training objective is to minimize the overall cross entropy loss by adjusting the parameters (weights and biases) of the model using gradient descent optimization techniquesSo that the model can more accurately estimate the probability distribution of the character, and at the same time, the model also learns and captures the statistical relevance of the prefix character during the training process, so as to more accurately predict the likelihood of the next character given the prefix. The same method is also applicable to suffix queries by reversing the corpus +.>The selectivity of the suffix query "% q" can be estimated by training the language model. This further enhances the performance of the model on fuzzy query tasks. Therefore, the method has good application potential in the fuzzy query scene by effectively utilizing NLM to model the probability distribution of the character string sequence.
9. State reset:
the model described in the previous steps may be used directly to estimate the selectivity of a prefix query or a suffix query,
for sub-string query tasks, e.g. one example of operationIt is desirable to estimate the selectivity of the substring query, '% a%', but a value closer to 0 is produced. In particular, it attempts to estimate. Since NLM does not see any +.>This probability will be estimated to be close to 0 for the beginning string. The root cause is NLM processing string +.>From the first character to the last character. Thus, it only learns the probabilities corresponding to prefix queries. In the context of natural language processing, this limitation is natural because NLM processes strings from scratch for various tasks in language modeling. However, estimating substring queries requires processing fragments of sentences, which is a very important requirement. For answer substring queries, they must be able to estimate the probability of any substring in the corpus. None of the existing NLMs can be used to estimate substring selectivity. In view of the earlier training, a novel NLM training adaptation method is proposed to alleviate this problem.
NLM training is adjusted to fit sub-string queries. The ability of NLM to accurately learn prefix query selectivity is preserved while expanding its use for substring queries. The reason for poor performance of substring queries is entirely due to the conditional probability that the default NLM training is not attempting to learn substring queries. In our running example, the reason that the substring query '% a%' estimation is inaccurate is thatAre not accurately defined. For->Only->Andnon-zero results are produced because they are the only two characters at the beginning of any string.
By making simple changes to the training, namely state-reset (state-reset), the NLM can learn these conditional probabilities. In continuously processing a corpus, the following changes are consideredAt the time we will randomly->Reset to a certain small probability. We consider the sequence +.>. NLM treatment->And generates a new state +.>. In unmodified NLM, the probability +.>. However, suppose that the process is +>Before, the random process will->Reset to->. In other words, in this case, the dummies input sequence is +.>. The NLM will now try to handle +.>This will result in a better estimate of the substring query over unmodified NLM.
Specifically, the state reset is subject to a superparameterThe parameter controls the probability of resetting the state. Setting upEquivalent to training a classical NLM. Set larger->Resulting in a higher reset probability. The concept of state reset is similar to dropout. In a feed-forward neural network, nodes are randomly discarded with a certain dropout probability. Although dropout may destabilize the training process, it acts as an effective regularization method that may improve generalization ability. The idea of state reset is to support substring queries, but although some dropout works on a round-robin network, only applies to non-round-robin connections to avoid impeding memory capacity. In contrast, the circular connection is explicitly reset, allowing the NLM to learn how to estimate the selectivity of the substrings in the corpus.
Accurate selectivity estimation for prefix, substring, and suffix queries is a challenging problem. The method provided by the invention is used for generating the character string predicate accurate selection estimation method facing to fuzzy query through training the neuro-language model, can adapt to prefix, substring and suffix query tasks, and solves the challenging problem of accurately and selectively estimating the character string predicate in a database.
Specific example 3:
in a second aspect, the present invention further provides a fuzzy query-oriented string predicate accurate selection estimation system, where the system includes: the system comprises a model training module, a probability prediction module and a selectivity evaluation module;
the model training module is used for taking the acquired query statement and the corpus in the database as the input of an autoregressive neural language model architecture, training the autoregressive neural language model, and enabling the autoregressive neural language model to generate a time step hiding state;
the probability prediction module is used for sequentially taking single characters of predicates in an actual query statement as input of a current time step, determining the hidden state of the current time step by combining the hidden state of a previous time step, wherein the hidden state of the current time step is used for predicting probability distribution of the next character of each predicate;
the selectivity evaluation module is used for determining the selectivity evaluation probability of each predicate based on the probability distribution of the next character of each predicate in the actual query statement, wherein the selectivity evaluation probability is used for an execution optimizer of a database to select an optimal plan.
Preferably, the model training module is further configured to:
vectorizing the whole corpus of the database to generate a character vector sequence;
vectorizing the acquired query sentence to generate a character vector;
model training is carried out by taking the character vector sequence and the character vector as the input of the autoregressive neural language model architecture;
combining the hidden state of the previous time step with a current input to obtain the hidden state of the current time step, wherein the current input is a character vector input by the autoregressive language model framework;
and determining the prediction probability distribution of the current input based on the hidden state of the current time step.
Preferably, the probability prediction module is further configured to:
sequentially acquiring single characters of predicates in an actual query sentence as input of a current time step;
combining the input of the current time step with the hidden state of the previous time step to determine the hidden state of the current time step;
and determining probability distribution of the next character based on the hidden state of the current time step.
Preferably, the selectivity evaluation module is further configured to:
determining the conditional probability corresponding to the single byte based on probability distribution corresponding to all single bytes in each predicate in an actual query statement;
a selectivity evaluation probability of the predicate is determined based on conditional probabilities corresponding to all individual bytes of the predicate.
Preferably, the selectivity evaluation module is specifically configured to:
selectively evaluating probabilities for strings of individual predicates; />
Conditional probabilities for the respective characters;
for representing the next character +.>In a given state->Probability of the next generation.
Specific example 3:
to further illustrate the specific application of the present invention, a neuro-linguistic model architecture diagram is disclosed as shown in fig. 2.
Input layer: representing the input character sequence, such as "BOW, t, i, m".
An embedding layer: the input characters are mapped to a word vector representation.
Hidden status layer: a model is utilized to capture long-term dependencies of the character sequence and output a state representation.
Output layer: the probability distribution of the next character, such as p (|BOW, t, i, m), is output based on the model state.
Specifically:
a character sequence "BOW, t, i, m" is entered.
The embedding layer maps each character into a vector representation.
The hidden state layer outputs a state vector based on the current input "m" and the preceding character sequence. This state vector encodes the relevant information of the previous sequence.
Based on the hidden state layer, the output layer gives the probability distribution p (|BOW, t, i, m) of the next character, the characterSeems to be the most likely one. />

Claims (8)

1. A fuzzy query-oriented character string predicate accurate selection estimation method is characterized by comprising the following steps:
training the autoregressive neural language model by taking the acquired query sentence and the corpus in the database as the input of the autoregressive neural language model architecture, so that the autoregressive neural language model generates a hidden state of a time step;
sequentially taking single characters of a character string in an actual query sentence as input of a current time step, determining the hidden state of the current time step by combining the hidden state of a previous time step, and determining probability distribution of a next character, wherein the hidden state of the current time step is used for predicting the probability distribution of the next character of each character string;
determining the selective evaluation probability of each character string based on the probability distribution of the next character of each character string in the actual query statement, wherein the selective evaluation probability is used for an execution optimizer of a database to select an optimal plan;
the training of the autoregressive neural language model by taking the acquired query sentence and the corpus in the database as the input of the autoregressive neural language model architecture comprises the following steps:
vectorizing the whole corpus of the database to generate a character vector sequence;
vectorizing the acquired query sentence to generate a character vector;
model training is carried out by taking the character vector sequence and the character vector as the input of the autoregressive neural language model architecture;
combining the hidden state of the previous time step with a current input to obtain the hidden state of the current time step, wherein the current input is a character vector input by the autoregressive neural language model architecture;
and determining the prediction probability distribution of the current input based on the hidden state of the current time step.
2. The method for accurately selecting and estimating a predicate of a character string for fuzzy query according to claim 1, wherein the input of the autoregressive neural language model structure is obtained by using the obtained query sentence and the corpus in the database, and further comprising:
in the model training process, determining the difference between the prediction probability distribution and the actual target character through a cross entropy loss function;
and correcting the trainable parameters of the autoregressive neural language model architecture based on the difference between the predicted probability distribution and the actual target character.
3. The method for accurately selecting and estimating a predicate of a character string for fuzzy query according to claim 1, wherein the step of determining the hidden state of the current time step by sequentially using individual characters of the character string in the actual query sentence as the input of the current time step and combining the hidden state of the previous time step comprises the steps of:
sequentially acquiring single characters of a character string in an actual query sentence as input of a current time step;
and combining the input of the current time step and the hidden state of the previous time step to determine the hidden state of the current time step.
4. The method for accurately selecting and estimating a predicate of a character string for fuzzy query according to claim 3, further comprising, after said determining the hidden state of the current time step:
and determining the probability distribution of the next character based on the hidden state of the current time step.
5. The method for estimating accurate selection of a predicate for a character string for a fuzzy query according to claim 1, wherein the determining the probability of estimating the selectivity of each character string based on the probability distribution of the next character of each character string in the actual query sentence comprises:
determining the conditional probability corresponding to the single byte based on probability distribution corresponding to all the single bytes in each character string in the actual query statement;
and determining the selectivity evaluation probability of the character string based on the conditional probabilities corresponding to all the single bytes of the character string.
6. The method for accurately selecting and estimating predicate of a character string for fuzzy query according to claim 5, wherein the determining the selectively evaluating probability of the predicate is based on conditional probabilities corresponding to all single bytes of the character string, specifically:
selectively evaluating probabilities for strings of individual predicates;
conditional probabilities for the respective characters;
for representing the next character +.>In a given state->Probability of the next generation.
7. A fuzzy query-oriented string predicate accurate selection estimation system, the system comprising: the system comprises a model training module, a probability prediction module and a selectivity evaluation module;
the model training module is used for taking the acquired query statement and the corpus in the database as the input of an autoregressive neural language model architecture, training the autoregressive neural language model, and enabling the autoregressive neural language model to generate a time step hiding state;
the probability prediction module is used for sequentially taking single characters of the character strings in the actual query statement as the input of the current time step, determining the hidden state of the current time step by combining the hidden state of the previous time step, and determining the probability distribution of the next character, wherein the hidden state of the current time step is used for predicting the probability distribution of the next character of each character string;
the selectivity evaluation module is used for determining the selectivity evaluation probability of each character string based on the probability distribution of the next character of each character string in the actual query statement, wherein the selectivity evaluation probability is used for an execution optimizer of a database to select an optimal plan;
the model training module is further configured to:
vectorizing the whole corpus of the database to generate a character vector sequence;
vectorizing the acquired query sentence to generate a character vector;
model training is carried out by taking the character vector sequence and the character vector as the input of the autoregressive neural language model architecture;
combining the hidden state of the previous time step with a current input to obtain the hidden state of the current time step, wherein the current input is a character vector input by the autoregressive neural language model architecture;
and determining the prediction probability distribution of the current input based on the hidden state of the current time step.
8. The fuzzy query oriented string predicate accurate selection estimation system of claim 7, wherein the probabilistic predictive module is further configured to:
sequentially acquiring single characters of a character string in an actual query sentence as input of a current time step;
combining the input of the current time step with the hidden state of the previous time step to determine the hidden state of the current time step;
and determining probability distribution of the next character based on the hidden state of the current time step.
CN202311072853.1A 2023-08-24 2023-08-24 Fuzzy query-oriented character string predicate accurate selection estimation method Active CN116821436B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311072853.1A CN116821436B (en) 2023-08-24 2023-08-24 Fuzzy query-oriented character string predicate accurate selection estimation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311072853.1A CN116821436B (en) 2023-08-24 2023-08-24 Fuzzy query-oriented character string predicate accurate selection estimation method

Publications (2)

Publication Number Publication Date
CN116821436A CN116821436A (en) 2023-09-29
CN116821436B true CN116821436B (en) 2024-01-02

Family

ID=88127730

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311072853.1A Active CN116821436B (en) 2023-08-24 2023-08-24 Fuzzy query-oriented character string predicate accurate selection estimation method

Country Status (1)

Country Link
CN (1) CN116821436B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018033030A1 (en) * 2016-08-19 2018-02-22 中兴通讯股份有限公司 Natural language library generation method and device
WO2022033073A1 (en) * 2020-08-12 2022-02-17 哈尔滨工业大学 Cognitive service-oriented user intention recognition method and system
CN115965033A (en) * 2023-03-16 2023-04-14 安徽大学 Generation type text summarization method and device based on sequence level prefix prompt
CN116245106A (en) * 2023-03-14 2023-06-09 北京理工大学 Cross-domain named entity identification method based on autoregressive model

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7363299B2 (en) * 2004-11-18 2008-04-22 University Of Washington Computing probabilistic answers to queries
CN102084363B (en) * 2008-07-03 2014-11-12 加利福尼亚大学董事会 A method for efficiently supporting interactive, fuzzy search on structured data
US9208198B2 (en) * 2012-10-17 2015-12-08 International Business Machines Corporation Technique for factoring uncertainty into cost-based query optimization
EP3979121A1 (en) * 2020-10-01 2022-04-06 Naver Corporation Method and system for controlling distributions of attributes in language models for text generation

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018033030A1 (en) * 2016-08-19 2018-02-22 中兴通讯股份有限公司 Natural language library generation method and device
WO2022033073A1 (en) * 2020-08-12 2022-02-17 哈尔滨工业大学 Cognitive service-oriented user intention recognition method and system
CN116245106A (en) * 2023-03-14 2023-06-09 北京理工大学 Cross-domain named entity identification method based on autoregressive model
CN115965033A (en) * 2023-03-16 2023-04-14 安徽大学 Generation type text summarization method and device based on sequence level prefix prompt

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Efficient SPARQL Query Processing Based on Adjacent-Predicate Structure Index;Haoyuan Guan et al.;《 IEEE Xplore》;全文 *
基于自然语言理解的SPARQL本体查询;张宗仁 等;计算机应用(第12期);全文 *

Also Published As

Publication number Publication date
CN116821436A (en) 2023-09-29

Similar Documents

Publication Publication Date Title
US11501182B2 (en) Method and apparatus for generating model
CN110309514B (en) Semantic recognition method and device
US10929744B2 (en) Fixed-point training method for deep neural networks based on dynamic fixed-point conversion scheme
US11256487B2 (en) Vectorized representation method of software source code
CN109783817B (en) Text semantic similarity calculation model based on deep reinforcement learning
CN106502985B (en) neural network modeling method and device for generating titles
Mueller et al. Siamese recurrent architectures for learning sentence similarity
CN112487807B (en) Text relation extraction method based on expansion gate convolutional neural network
CN110503192A (en) The effective neural framework of resource
Guo et al. Question generation from sql queries improves neural semantic parsing
JP2020520516A5 (en)
CN106897371B (en) Chinese text classification system and method
CN111414749B (en) Social text dependency syntactic analysis system based on deep neural network
CN110458181A (en) A kind of syntax dependency model, training method and analysis method based on width random forest
CN112232087B (en) Specific aspect emotion analysis method of multi-granularity attention model based on Transformer
CN111274790A (en) Chapter-level event embedding method and device based on syntactic dependency graph
CN105830058B (en) Dialog manager
CN113935489A (en) Variational quantum model TFQ-VQA based on quantum neural network and two-stage optimization method thereof
CN113157919A (en) Sentence text aspect level emotion classification method and system
Santacroce et al. What matters in the structured pruning of generative language models?
Azeraf et al. Highly fast text segmentation with pairwise markov chains
CN110851584A (en) Accurate recommendation system and method for legal provision
US11941360B2 (en) Acronym definition network
CN112347783B (en) Alarm condition and stroke data event type identification method without trigger words
Seilsepour et al. Self-supervised sentiment classification based on semantic similarity measures and contextual embedding using metaheuristic optimizer

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant