CN111274267A - Database query method and device and computer readable storage medium - Google Patents

Database query method and device and computer readable storage medium Download PDF

Info

Publication number
CN111274267A
CN111274267A CN201911407758.6A CN201911407758A CN111274267A CN 111274267 A CN111274267 A CN 111274267A CN 201911407758 A CN201911407758 A CN 201911407758A CN 111274267 A CN111274267 A CN 111274267A
Authority
CN
China
Prior art keywords
module
database
user input
information
decoding
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911407758.6A
Other languages
Chinese (zh)
Inventor
周慕哲
汤斯亮
凌立刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Quantity Intelligent Technology Co ltd
Original Assignee
Hangzhou Quantity Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Quantity Intelligent Technology Co ltd filed Critical Hangzhou Quantity Intelligent Technology Co ltd
Priority to CN201911407758.6A priority Critical patent/CN111274267A/en
Publication of CN111274267A publication Critical patent/CN111274267A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2452Query translation
    • G06F16/24522Translation of natural language queries to structured queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/248Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a database query method, a database query device and a computer readable storage medium, wherein the database query method comprises the following steps: s1, acquiring user input and selecting a database to be queried; s2, converting the user input into an SQL statement and performing database query to obtain a data table; s3, converting the user input and the data sheet into a visual chart; and S4, displaying the visual chart. The database query method, the database query device and the computer readable storage medium can improve the efficiency of querying and analyzing data by a user; when the visualization form is generated, the use of all data in the data table is avoided, certain sensitive information in the data is prevented from being leaked, the data safety is ensured, and the data processing speed of the model is improved; meanwhile, the connection between the modules is strengthened, so that the model can fully understand the input information.

Description

Database query method and device and computer readable storage medium
Technical Field
The present invention relates to the field of database query technologies, and in particular, to a database query method, an apparatus, and a computer-readable storage medium.
Background
Relational databases are widely used in various fields such as medicine, finance, business, etc. as an effective way to store information, which users acquire by using structured query sentences. However, although the SQL statement is simple and efficient enough, it still has high learning requirement, and the user needs to learn it for a long time, so that only professional developers usually have the ability to query using SQL. Meanwhile, for the queried data, certain relations often exist among the data, and the visualization can help a user to analyze the data better if the data can be displayed in a visualization form, and the visualization needs enough knowledge about the data and experience related to the visualization operation, which can be achieved by professional developers. With the popularization of intelligent portable terminals, more and more common users want to interact with a computer by directly using natural language in daily communication so as to quickly inquire information. In this context, there is an increasing need for natural language to be translated into structured query statements, and for the results to be presented in a suitable form.
At present, no work is carried out at home and abroad to realize a whole set of end-to-end system, only independent work of generating SQL sentences by natural language and generating recommended visual forms according to a data table is completed respectively, and the efficiency of querying and analyzing data by a user is low.
Disclosure of Invention
In view of this, the technical problem to be solved by the present invention is to provide a database query method, apparatus and computer readable storage medium, which can improve the efficiency of querying and analyzing data by a user.
The technical scheme of the invention is realized as follows:
a database query method, comprising the steps of:
s1, acquiring user input and selecting a database to be queried;
s2, converting the user input into an SQL statement and performing database query to obtain a data table;
s3, converting the user input and the data sheet into a visual chart;
and S4, displaying the visual chart.
Preferably, in S1, the method further includes:
and judging whether the user input is valid input.
Preferably, the S2 specifically includes:
s21, completing the mapping of the user input and the database information to a vector space through a word vector tool and inputting the mapping to a Bi-LSTM network;
s22, encoding the user input and the database information by using a Bi-LSTM network, and acquiring the intermediate semantic vector representation of the user input and the database information;
s23, extracting the contact information of two intermediate semantics through an Attention mechanism to complete coding;
s24, decoding; the SQL sentence splicing method comprises the following steps of dividing a plurality of modules according to SQL keywords, predicting by each module, and filling missing information of each module to complete splicing of SQL sentences;
and S25, querying the database through the SQL statement to obtain a data table.
Preferably, the S24 specifically includes:
and dividing the SQL keywords into a SELECT module, a WHERE module, an ORDER module and a GROUP module, decoding, and then decoding the FROM module.
Preferably, the S3 specifically includes:
s31, completing the mapping of the user input and the data table to a vector space through a word vector tool and inputting the mapping to a Bi-LSTM network;
s32, encoding the user input and the data table by using a Bi-LSTM network, taking the field names of the database as text sequences, and inputting the aggregation function information of each field and the text sequences into the Bi-LSTM network; acquiring intermediate semantic vector representation of the two;
s33, extracting the contact information of two intermediate semantics through an Attention mechanism to complete the input coding;
s34, decoding; dividing a task for generating visualization into a plurality of sub-modules, and predicting a generated result by splicing each module to obtain necessary information for generating visualization;
and S35, obtaining a visualization chart according to the necessary information for generating the visualization.
Preferably, the S34 includes:
s341, a LAYOUT decoding module;
s342, judging whether a plurality of coordinate series exist in the visual chart, if so, executing S343, otherwise, executing S344;
s343, a decode services module,
s344, decoding the X-AXIS module;
s345, decoding the Y-AXIS module;
and S346, splicing the results generated by the prediction of the modules to obtain necessary information for generating visualization.
The invention also provides a database query device, which comprises:
the acquisition module is used for acquiring user input and selecting a database to be queried;
the query module is used for converting the user input into an SQL statement and performing database query to obtain a data table;
the conversion module is used for converting the user input and the data table into a visual chart;
and the display module is used for displaying the visual chart.
Preferably, the acquiring module further comprises a judging unit;
the judging unit is used for judging whether the user input is valid input.
Preferably, the query module includes:
the first mapping unit is used for completing the mapping of the user input and the database information to a vector space through a word vector tool and inputting the mapping to the Bi-LSTM network;
the first coding unit is used for coding the user input and the database information by using a Bi-LSTM network to obtain the intermediate semantic vector representation of the user input and the database information;
the first extraction unit is used for extracting the contact information of two intermediate semantics through an Attention mechanism to complete coding;
a first decoding unit for decoding; the SQL sentence splicing method comprises the following steps of dividing a plurality of modules according to SQL keywords, predicting by each module, and filling missing information of each module to complete splicing of SQL sentences;
the query unit is used for querying the database through SQL statements to obtain a data table;
and/or;
the conversion unit comprises:
the second mapping unit is used for completing the mapping from the user input and the data table to the vector space through a word vector tool and inputting the mapping to the Bi-LSTM network;
the second coding unit is used for coding the user input and the data table by using a Bi-LSTM network, taking the field names of the database as text sequences, and inputting the aggregation function information of each field and the text sequences into the Bi-LSTM network; acquiring intermediate semantic vector representation of the two;
the second extraction unit is used for extracting the contact information of two intermediate semantics through an Attention mechanism to complete the input coding;
a second decoding unit for decoding; dividing a task for generating visualization into a plurality of sub-modules, and predicting a generated result by splicing each module to obtain necessary information for generating visualization;
and the generating unit is used for obtaining a visualization chart according to the necessary information for generating the visualization.
The present invention also provides a computer readable storage medium storing a plurality of instructions adapted to be loaded by a processor to perform the steps of the database query method according to any one of claims 1 to 6.
According to the database query method, the database query device and the computer readable storage medium, when the user uses the database, the user only needs to input the problem by using the natural language and select the database which the user wants to query, and then can complete processing and output the queried data table and the recommended visual chart in the background, so that the user can pay attention to the problem without considering the internal structure of the database system, and the efficiency of querying and analyzing data by the user is obviously improved. By using the database query method, the database query device and the computer readable storage medium, a user without computer related experience can also obtain desired data through daily used natural language, and can visually find out the internal relation of the data through data visualization, so that the user does not need to learn complicated SQL grammar and control the operation of a visualization form, and the cost of querying and analyzing the data is reduced. Finally, when the user uses the database query method, the database query device and the computer readable storage medium, the user can use different sentences to express the same meaning in the query process because the input natural language is not limited by strict grammar any more, and the technical process in the operation can be simplified.
Drawings
Fig. 1 is a flowchart of a database query method according to an embodiment of the present invention;
FIG. 2 is a flowchart of a database query method according to an embodiment of the present invention;
FIG. 3 is a flowchart of a database query method according to an embodiment of the present invention;
fig. 4 is a block diagram of a database query device according to an embodiment of the present invention;
FIG. 5 is a diagram showing the fields selected in the SELECT section;
FIG. 6 is a schematic diagram of a decode WHERE module;
FIG. 7 is a schematic diagram of a decode ORDER module;
fig. 8 is a schematic diagram of a decoding GROUP module.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As shown in fig. 1, an embodiment of the present invention provides a database query method, including the following steps:
and S1, acquiring user input and selecting the database to be queried.
The user inputs the question of the inquiry and selects the database desired to be searched, in which it is necessary to judge whether the input is a normal sentence according to the length of the input sentence, the word property of each word, if only some meaningless phrases or symbols are input, an error is suggested and re-input is required, otherwise S2 is performed.
And S2, converting the user input into an SQL statement and performing database query to obtain a data table.
Firstly, mapping of natural language and database information to a vector space is completed through a word vector tool to be used as input of a neural network model, then two Bi-LSTM networks are used for coding problem input of the natural language and corresponding database information to obtain intermediate semantic vector representation of the natural language and the corresponding database information, and then an Attention mechanism is used for extracting connection information of two intermediate semantics so as to complete input coding. The system decoding part is composed of a plurality of modules divided according to SQL keywords, each module independently predicts, and the final splicing of SQL sentences is completed by filling the missing information of each module.
And S3, converting the user input and the data sheet into a visual chart.
After the mapping from the input information to the vector space is completed, two Bi-LSTM models are used for processing the problem input and the corresponding data field names, meanwhile, the used aggregation function information is given to the field names, and after the processing is completed, another intermediate semantic is used for decoding. The decoding part also divides the task of generating the visualization into a plurality of sub-modules, and necessary information required by the generation of the visualization can be finally obtained by splicing the results generated by predicting each module.
And S4, displaying the visual chart.
And returning the data table queried in the S2 and the results of the visual chart generated in the S3 to the front end, and displaying the results in a proper mode by the front end.
In a preferred embodiment of the present invention, as shown in fig. 2, S2 specifically includes:
and S21, completing the mapping of user input and database information to vector space through a word vector tool and inputting the mapping to the Bi-LSTM network.
And S22, encoding the user input and the database information by using the Bi-LSTM network, and acquiring the intermediate semantic vector representation of the user input and the database information.
Encoding natural language question input: in this step, the system maps the natural language used by the person into a computer-understandable digital space, using a two-way LSTM model that adequately extracts contextual information.
Coding database information: the database field names may also be entered into the LSTM model as a text sequence. However, fields in the database only represent composition components in a certain data field, are local features, and cannot help the system to completely extract features of database information, so that the table name and the database name are input into the system, the information contains domain knowledge related to stored data, and the information is global information of the database.
And S23, extracting the contact information of two intermediate semantics through an Attention mechanism, and finishing coding.
Certain specific words in the text sequence have higher influence on the prediction of a specific field, and because the encoding of the natural language and the encoding of the database information are independently carried out, the related specific gravity information is ignored, so that the invention adds an attention mechanism to calculate and extract the specific gravity of the part. Firstly, obtaining intermediate semantic information in the step B2/B3, then calculating the similarity between the database intermediate coding vector and the hidden layer vector of each word of the natural language, normalizing the numerical value through a Softmax function, and finally multiplying the weight proportion and the intermediate coding vector of the text to obtain the problem input expression merged into the database information.
S24, decoding; the SQL sentence splicing method comprises the following steps of dividing a plurality of modules according to SQL keywords, predicting by each module, and filling missing information of each module to complete splicing of SQL sentences;
and dividing the SQL keywords into a SELECT module, a WHERE module, an ORDER module and a GROUP module, decoding, and then decoding the FROM module.
Decoding SELECT module: this module first needs to predict the field cases chosen by the SELECT part and then for each field the Aggregate function it refers to, predicting only one part at a time. The following figure clearly shows the structure and the predictive content of the SELECT module, the purpose of which is to fill in the spaces.
① prediction of the number of SELECT fields
For an SQL statement, a SELECT part may SELECT a plurality of fields, so the decoder needs to predict the number of selected fields first, which can be regarded as a multi-classification problem, and the classification result is
Figure RE-GDA0002478954120000071
The number of fields selected. In the present system, it is assumed that the final result is 0-4, i.e., a maximum of 4 fields are selected at the same time. Equation (1.1) describes the way in which the classification probability is obtained, with the aim of obtaining a value of K that maximizes the probability.
Wherein Q represents the question of the input; u1 and U2 represent two trainable parameter matrices, which are gradually optimized during training; EQ | Q represents a final hidden layer state set obtained by applying an attention mechanism after Bi-LSTM processes a natural language problem sequence; softmax () is a normalization function and tanh () is an activation function.
② prediction of SELECT field
After getting the number of fields that can occur, K, we need to predict the first K most likely field names to be selected. Through the formula (1.2), the probability of selecting the field corresponding to each col in the database under the condition of inputting Q by the current natural language question can be obtained, and after all the probabilities are obtained, descending sorting is carried out, wherein the first K fields are fields related to the predicted question.
Figure RE-GDA0002478954120000072
Wherein Ecol represents a final hidden layer state set obtained after transmitting a word vector of a database list field to a Bi-LSTM network for processing, EQ | col represents a hidden layer state set of a natural language input part using an attention mechanism based on database information, ua represents a trainable parameter vector, Uc and Uq represent trainable parameter matrices, and σ represents an activation function.
③ prediction of SELECT field function
For each field name, we need to predict the Aggregate function it relates to, and we assume that there are six functional forms { NONE, MAX, MIN, SUM, COUNT, AVG }, and formula (1.3) shows a specific prediction mode.
Figure RE-GDA0002478954120000081
Wherein, col represents the currently selected field, and the module needs to obtain the result predicted by the previous part for determination when performing prediction.
B5. A decoding WHERE module: as shown in fig. 6, this module is part of the SQL statement judgment condition, and it is necessary to predict the number of conditions first, and then to further predict the values of the condition fields, the condition operators, and the specific judgment conditions.
① prediction of the number of WHERE fields
For the WHERE part of the SQL statement, which may also appear as occurring in the SELECT module, multiple fields are selected, and a k classifier is first required to predict the number of fields.
② prediction of WHERE field
This section also needs to predict the first k most likely fields to occur, as occurs in the SELECT block.
③ prediction of WHERE operator
For each field involved in the WHERE section, there are 12 possible operator cases, respectively { NOT, betweeen, >, <, |! IN, LIKE, IS, EXISTS, so the prediction operator IS a 12-class problem and we predict the probability by the formula (1.4).
Figure RE-GDA0002478954120000082
④ prediction of specific value of WHERE
When a user queries a database, in order to limit the query range and obtain specific data expected to be obtained by the user, various query conditions need to be added, and the condition is embodied in the WHERE part of an SQL statement. Therefore, the module needs to extract some segments of the user natural language input as the value of the conditional query, and for this requirement, the module adopts a Seq2Seq model to implement.
In the experiment, the sequence finally generated by the seq2seq model is taken from the input sequence, however, rare words which are not frequently appeared may exist in the input sequence, and for the network, the training is not enough and the generation of the words cannot be completed, so a special generation network named as a pointer network is adopted in the text. Unlike the conventional Seq2Seq model, which selects a generated word from a vocabulary, the model can directly copy a word from a source text and can better abstract the text.
In the case of using a pointer network, we accomplish the prediction of specific condition values by equation (1.5) and equation (1.6):
Pwhereval(i|Q,col,h)=softmax(a(h)) (1.5)
Figure RE-GDA0002478954120000091
wherein h represents the hidden layer state at the previous moment of each step in the Decoder model, and HQ represents the output of each sequence element after Bi-LSTM network processing.
B6. Decode ORDER module, as shown in fig. 7: this module is used to process the ORDER BY portion of the SQL statement. For the SQL statement, the three modules designed above are basic and common, and the ORDER BY keyword is not necessary, so the module first predicts the existence of the keyword, and if so, further predicts the subsequent part.
① prediction of the existence of ORDER BY keywords
The part firstly predicts whether the ORDER key words exist in the SQL generating statement or not, and predicts through a formula (1.7).
Figure RE-GDA0002478954120000092
② prediction of ORDER BY field
The user sorts the query results using the ORDER BY statement to specify the columns, which in part predicts the columns specified in the user input BY equation (1.8).
Figure RE-GDA0002478954120000093
③ ORDER BY field function prediction
The field specified in this section may also apply the AGGREGRATE function, predicted here by equation (1.9).
Figure RE-GDA0002478954120000101
④ ORDER prediction
The ORDER BY statement ORDERs the records BY default using ascending, but the actual SQL statement can be changed as needed. It is assumed that there are five ordering cases { NONE, ASC, DESC, ASC LIMIT 1, DESC LIMIT 1}, and the prediction is performed by equation (1.10).
Figure RE-GDA0002478954120000102
B7. Decoding the GROUP module: as shown in fig. 8, there is a module keyword for grouping in the SQL statement, and the prediction of this keyword is done by the module. Whether the keywords exist is also predicted firstly, and then the subsequent prediction is completed. Inside this module, there is a havingmodule for SQL statements, which needs to be further processed.
① prediction of the existence of GROUP keywords
When the SQL statement is generated, whether the GROUP BY keyword exists needs to be predicted, and the judgment is completed according to the formula (1.11).
Figure RE-GDA0002478954120000103
② prediction of GROUP field
This part performs the computational prediction according to equation (1.12) and is the basis for the grouping.
Figure RE-GDA0002478954120000104
③ prediction of HAVING Presence
In SQL statements, sometimes a join aggregate function is used, while WHERE keywords cannot be used with these functions, in which case the HAVING clause is used. The presence or absence of the HAVING keyword is determined herein by the formula (1.13).
Figure RE-GDA0002478954120000105
④ prediction of HAVING field
The specific fields to which the aggregation function is limited in this section are as follows using the formula (1.14).
Figure RE-GDA0002478954120000111
⑤ prediction of HAVING operator
In the case of operators used for this part of the prediction, equation (1.15) is a detailed implementation.
Figure RE-GDA0002478954120000112
⑥ prediction of HAVING judgment value
Similar to the WHERE section, this module also uses a network of pointers to predict the specific values involved. The equation (1.16) (1.17) similar to the previous one was used.
Phavingval(i|Q,col,h)=softmax(a(h)) (1.16)
Figure RE-GDA0002478954120000113
B8. Decoding the FROM module: the From module is generated based on the fields selected in the SELECT module, and completes the connection of the From part of the SQL statement by querying the table where the fields are located in the database, and the part is spliced purely according to the database table information without using a neural network model.
And S25, querying the database through the SQL statement to obtain a data table.
B9. And (3) splicing to generate SQL sentences: splicing the results generated by the prediction of each module to generate a complete executable SQL statement
B10. Querying a database: executing the SQL statement generated in the step B9 in the database selected by the user to inquire the expected data
B11. And returning a query result: the data table obtained by the query is passed to the next step.
In a preferred embodiment of the present invention, as shown in fig. 3, S3 specifically includes:
s31, completing user input and mapping of a data table to a vector space through a word vector tool and inputting the mapping to a Bi-LSTM network;
s32, encoding the user input and the data table by using the Bi-LSTM network, taking the field names of the database as text sequences, and inputting the aggregation function information and the text sequences of each field into the Bi-LSTM network; and acquiring intermediate semantic vector representation of the two.
Coding natural language questions: the natural language questions of the user are input into a Bi-LSTM model, the characteristic information of the input text is extracted, and the system is helped to understand the intention of the user when inquiring to generate a proper visualization form.
Data table obtained by encoding query: the searched database field names are regarded as a text sequence and input into the LSTM model, and the aggregation function information of each field is also input into the model to help the model to understand.
And S33, extracting the contact information of two intermediate semantics through an Attention mechanism, and finishing the input coding.
Certain words in the question may have a higher impact on the prediction of a particular visualization type, and therefore, an attention mechanism may also need to be added. And calculating to obtain a data table representation integrated with the problem information.
S34, decoding; the task of generating the visualization is divided into a plurality of sub-modules, and necessary information for generating the visualization is obtained by splicing the results generated by predicting each module.
Decoding the LAYOUT module: the invention limits the visualization types to 4 types, namely none, a broken line graph, a histogram and a pie graph, so that the module is a sub-module of a four-classification problem, and the probability of each visualization type under the existing input condition is calculated through the following formula:
Figure RE-GDA0002478954120000121
wherein W1、W2Representing proof of trainable parameters, HqA data table representation representing the merged problem information.
C5. Judging whether a plurality of coordinate series exist in the graph: since a plurality of series possibly exist in the generated visualization graph, namely the situation of sharing coordinate axes, the system needs to predict the specific situation through a model, and the prediction is realized through the following formula:
Figure RE-GDA0002478954120000122
if the final prediction result shows that the graph contains a plurality of coordinate series, executing the step C6, otherwise, directly executing the step C7
C6. Decoding a SERIES module: for those visualizations with multiple coordinate series, the system needs to select one of the input spreadsheet fields as the field name to distinguish the series. In the process, the system calculates the probability of each field as a division standard through a formula and picks out the field name with the maximum probability:
Figure RE-GDA0002478954120000123
where i represents the number of locations of a field in all query fields.
C7. Decoding the X-AXIS module: the system selects one of the entered spreadsheet fields as the most appropriate X-axis field for visualizing the chart. Since the series field, the X-axis field and the Y-axis field cannot select the same field, and the prediction result of the C6 step influences the step, the system transfers the hidden layer state input in the C6 step to the C7 step, and the connection between the two modules is strengthened. The step is calculated by using the following formula:
Figure RE-GDA0002478954120000131
wherein Hq|seriesRepresenting the hidden layer state at the time the series field was selected in the previous step.
C8. Decoding the Y-AXIS module: the system selects one of the entered spreadsheet fields as the most appropriate X-axis field for visualizing the chart. The fields selected in the steps C6 and C7 cannot be selected in this step, so the hidden layer state of the previous step needs to be introduced to obtain this part of information, and repeated selection is avoided. The formula used in this step is as follows:
Figure RE-GDA0002478954120000132
wherein Hq|xRepresenting the hidden layer state when the x-axis field was selected in the previous step.
And S35, obtaining a visualization chart according to the necessary information for generating the visualization.
Splicing to generate information required for visualization: and splicing the results generated by the prediction of each module to generate the information required by a complete visualization form.
C10. Returning a generated result: the information needed for visualization is returned to the system for rendering in a suitable form by the system.
From the inside of the system, when the part which is converted from the natural language problem into the SQL statement is designed, the invention introduces the global information of the database, so that the system can more fully understand the input information, thereby improving the accuracy of the prediction result. The target SQL statement is divided into a plurality of sub-modules, the accuracy of each module is evaluated respectively, meanwhile, the comparison is carried out with the existing relevant model, and the following results are obtained on the public data set Spider:
Figure RE-GDA0002478954120000141
when the recommended visual chart is generated from the inquired data table, the method can automatically learn the characteristics from the input without determining the characteristic items in advance, can effectively avoid errors caused by selection errors or defects of the characteristic items, and simplifies the preprocessing flow. Because the characteristics do not need to be acquired by using a statistical method, the system does not need to input complete data into the system, certain sensitive information in the data can be prevented from being leaked, and the data safety is ensured. The hardware pressure brought to the system is effectively reduced without processing specific data, and the input processing speed is improved. Finally, the method and the system have the advantages that the important role of problem input in the process of recommending the visualization form is noticed, and the system is helped to better understand the data table by adding the information of the natural language problem, so that the accuracy of the generated visualization form can be improved. We also split the target visualization form into multiple sub-modules and evaluated the performance of each module, and obtained the following results on the labeled data Spider dataset:
Figure RE-GDA0002478954120000142
the system is a complete end-to-end system, when a user uses the system, the user only needs to input problems by using natural language and select a database which is expected to be inquired, the system completes processing in the background and outputs the inquired data table and the recommended visual chart, so that the user can pay attention to the problems without considering the internal structure of the database system, and the efficiency of inquiring and analyzing data by the user is obviously improved. By using the system, a user without computer related experience can also obtain the desired data through daily used natural language, and can visually find the internal relation of the data through data visualization, so that the user does not need to learn complicated SQL grammar and control the operation of a visualization form, and the cost of inquiring and analyzing the data is reduced. Finally, when the user uses the system, because the input is natural language, the system is not limited by strict grammar any more, different sentences can be used for expressing the same meaning in the inquiry process, and the technical process in the operation can be simplified.
In part of the conversion of natural language into SQL statements, there is a scheme that uses trees to facilitate the conversion. Firstly, segmenting words of input problems according to parts of speech, then designing an analysis framework, wherein the framework takes the dependency relationship between words as a core and gives a judgment rule, and the preprocessed word sequences are processed one by one according to the rule to construct a semantic dependency tree. And then traversing the dependency tree from bottom to top, and processing and integrating the structural information and the semantic information in the nodes according to the node types to obtain corresponding semantic clusters, namely aggregation blocks. Because the collection block and the SQL language have good mapping relation, the final SQL statement can be obtained by processing the collection block finally obtained by traversing and integrating.
Firstly, because the same Chinese sentence may have different meanings under different environments, the model needs to provide related knowledge base background to eliminate ambiguity, so the model is often limited in a specific field and cannot be effectively applied to other databases. Secondly, natural language is highly condensed language used in daily life of human beings, even if partial information is omitted during communication, sentences can be understood through reasoning usually, however, the traditional method cannot enable a computer to have such reasoning capability, and therefore the accuracy of conversion cannot be guaranteed.
In the part of converting the data table into the visualization form, enumeration operation is firstly carried out to obtain the combinations of all fields and visualization types, then screening is carried out once to remove the condition that the visualization chart cannot be formed, and then a strict scoring rule is designed to score, so that the combination with the highest score obtains the highest weight, and the combination is taken as the optimal form of showing data. The effect of the scheme depends on the selection of the scoring rule, and the existing scoring scheme is mostly based on expressiveness and effectiveness standards proposed by Mackinlay, and the two standards list conditions which should be met by an excellent visualization form.
The disadvantage of this rule-based generation method is that designing detailed scoring rules is a very difficult process, and it requires a professional to judge all possible situations, and at the same time, adding some visualization form may make existing rules contradictory, and the system has no strong robustness, and more importantly, when the size of data input into the system becomes large, the number of combinations generated by enumeration will increase sharply, resulting in system breakdown.
As shown in fig. 4, the present invention further provides a database query apparatus, including:
the system comprises an acquisition module 1, a query module and a query module, wherein the acquisition module is used for acquiring user input and selecting a database to be queried;
the query module 2 is used for converting the user input into SQL statements and performing database query to obtain a data table;
the conversion module 3 is used for converting the user input and the data table into a visual chart;
and the display module 4 is used for displaying the visual chart.
In a preferred embodiment of the present invention, the obtaining module further includes a judging unit;
the judging unit is used for judging whether the user input is valid input.
In a preferred embodiment of the present invention, the query module comprises:
the first mapping unit is used for completing the mapping from user input and database information to a vector space through a word vector tool and inputting the mapping to the Bi-LSTM network;
the first coding unit is used for coding the user input and the database information by using a Bi-LSTM network and acquiring the intermediate semantic vector representation of the user input and the database information;
the first extraction unit is used for extracting the contact information of two intermediate semantics through an Attention mechanism to complete coding;
a first decoding unit for decoding; the SQL sentence splicing method comprises the following steps of dividing a plurality of modules according to SQL keywords, predicting by each module, and filling missing information of each module to complete splicing of SQL sentences;
the query unit is used for querying the database through SQL statements to obtain a data table;
and/or;
the conversion unit comprises:
the second mapping unit is used for completing the mapping from user input and a data table to a vector space through a word vector tool and inputting the mapping to the Bi-LSTM network;
the second coding unit is used for coding the user input and the data table by using the Bi-LSTM network, taking the field names of the database as text sequences, and inputting the aggregation function information and the text sequences of all the fields into the Bi-LSTM network; acquiring intermediate semantic vector representation of the two;
the second extraction unit is used for extracting the contact information of two intermediate semantics through an Attention mechanism to complete the input coding;
a second decoding unit for decoding; dividing a task for generating visualization into a plurality of sub-modules, and predicting a generated result by splicing each module to obtain necessary information for generating visualization;
and the generating unit is used for obtaining the visualization chart according to the necessary information for generating the visualization.
The embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium stores a plurality of instructions, and the instructions are suitable for being loaded by a processor to execute the steps in the database query method as claimed above.
Through the above description of the embodiments, those skilled in the art will clearly understand that the present application can be implemented by software plus necessary general-purpose hardware, and certainly can also be implemented by special-purpose hardware including special-purpose integrated circuits, special-purpose CPUs, special-purpose memories, special-purpose components and the like. Generally, functions performed by computer programs can be easily implemented by corresponding hardware, and specific hardware structures for implementing the same functions may be various, such as analog circuits, digital circuits, or dedicated circuits. However, for the present application, the implementation of a software program is more preferable. Based on such understanding, the technical solutions of the present application may be substantially embodied in the form of a software product, which is stored in a readable storage medium, such as a floppy disk, a usb disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the method of the embodiments of the present application.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions according to the embodiments of the present application are all or partially generated when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, e.g., the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. A computer-readable storage medium may be any available medium that a computer can store or a data storage device, such as a server, a data center, etc., that is integrated with one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
According to the database query method, the database query device and the computer readable storage medium, when the user uses the database, the user only needs to input the problem by using the natural language and select the database which the user wants to query, and then can complete processing and output the queried data table and the recommended visual chart in the background, so that the user can pay attention to the problem without considering the internal structure of the database system, and the efficiency of querying and analyzing data by the user is obviously improved. By using the database query method, the database query device and the computer readable storage medium, a user without computer related experience can also obtain desired data through daily used natural language, and can visually find out the internal relation of the data through data visualization, so that the user does not need to learn complicated SQL grammar and control the operation of a visualization form, and the cost of querying and analyzing the data is reduced. Finally, when the user uses the database query method, the database query device and the computer readable storage medium, the user can use different sentences to express the same meaning in the query process because the input natural language is not limited by strict grammar any more, and the technical process in the operation can be simplified.
In summary, the embodiments of the present invention can at least achieve the following effects:
in the embodiment of the invention, a complete system which is end-to-end, converts the natural language into the SQL statement and visually presents the result is designed and realized.
In the embodiment of the invention, when the SQL statement is converted, the global information of the database is added to help the model to understand the natural language question.
In the embodiment of the invention, when the visualization form is generated, the use of all data in the data table is avoided, certain sensitive information in the data is prevented from being leaked, the data safety is ensured, and the speed of processing the data by the model is also improved.
In the embodiment of the invention, the connection between the natural language question and the visualization form is found, and the question is used as an input help model for prediction.
In the embodiment of the invention, the complex task is divided into a plurality of sub-modules, and meanwhile, the connection between the modules is strengthened, so that the model can fully understand the input information.
Finally, it is to be noted that: the above description is only a preferred embodiment of the present invention, and is only used to illustrate the technical solutions of the present invention, and not to limit the protection scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims (10)

1. A database query method is characterized by comprising the following steps:
s1, acquiring user input and selecting a database to be queried;
s2, converting the user input into an SQL statement and performing database query to obtain a data table;
s3, converting the user input and the data sheet into a visual chart;
and S4, displaying the visual chart.
2. The database query method according to claim 1, wherein in S1, further comprising:
and judging whether the user input is valid input.
3. The database query method according to claim 1, wherein the S2 specifically includes:
s21, completing the mapping of the user input and the database information to a vector space through a word vector tool and inputting the mapping to a Bi-LSTM network;
s22, encoding the user input and the database information by using a Bi-LSTM network, and acquiring the intermediate semantic vector representation of the user input and the database information;
s23, extracting the contact information of two intermediate semantics through an Attention mechanism to complete coding;
s24, decoding; the SQL sentence splicing method comprises the following steps of dividing a plurality of modules according to SQL keywords, predicting by each module, and filling missing information of each module to complete splicing of SQL sentences;
and S25, querying the database through the SQL statement to obtain a data table.
4. The database query method according to claim 3, wherein the S24 specifically includes:
and dividing the SQL keywords into a SELECT module, a WHERE module, an ORDER module and a GROUP module, decoding, and then decoding the FROM module.
5. The database query method according to claim 1, wherein the S3 specifically includes:
s31, completing the mapping of the user input and the data table to a vector space through a word vector tool and inputting the mapping to a Bi-LSTM network;
s32, encoding the user input and the data table by using a Bi-LSTM network, taking the field names of the database as text sequences, and inputting the aggregation function information of each field and the text sequences into the Bi-LSTM network; acquiring intermediate semantic vector representation of the two;
s33, extracting the contact information of two intermediate semantics through an Attention mechanism to complete the input coding;
s34, decoding; dividing a task for generating visualization into a plurality of sub-modules, and predicting a generated result by splicing each module to obtain necessary information for generating visualization;
and S35, obtaining a visualization chart according to the necessary information for generating the visualization.
6. The database query method according to claim 5, wherein said S34 includes:
s341, a LAYOUT decoding module;
s342, judging whether a plurality of coordinate series exist in the visual chart, if so, executing S343, otherwise, executing S344;
s343, a decode services module,
s344, decoding the X-AXIS module;
s345, decoding the Y-AXIS module;
and S346, splicing the results generated by the prediction of the modules to obtain necessary information for generating visualization.
7. A database query device, comprising:
the acquisition module is used for acquiring user input and selecting a database to be queried;
the query module is used for converting the user input into an SQL statement and performing database query to obtain a data table;
the conversion module is used for converting the user input and the data table into a visual chart;
and the display module is used for displaying the visual chart.
8. The database query device according to claim 7, wherein the obtaining module further includes a judging unit;
the judging unit is used for judging whether the user input is valid input.
9. The database querying device of claim 7, wherein the querying module comprises:
the first mapping unit is used for completing the mapping of the user input and the database information to a vector space through a word vector tool and inputting the mapping to the Bi-LSTM network;
the first coding unit is used for coding the user input and the database information by using a Bi-LSTM network to obtain the intermediate semantic vector representation of the user input and the database information;
the first extraction unit is used for extracting the contact information of two intermediate semantics through an Attention mechanism to complete coding;
a first decoding unit for decoding; the SQL sentence splicing method comprises the following steps of dividing a plurality of modules according to SQL keywords, predicting by each module, and filling missing information of each module to complete splicing of SQL sentences;
the query unit is used for querying the database through SQL statements to obtain a data table;
and/or;
the conversion unit comprises:
the second mapping unit is used for completing the mapping from the user input and the data table to the vector space through a word vector tool and inputting the mapping to the Bi-LSTM network;
the second coding unit is used for coding the user input and the data table by using a Bi-LSTM network, taking the field names of the database as text sequences, and inputting the aggregation function information of each field and the text sequences into the Bi-LSTM network; acquiring intermediate semantic vector representation of the two;
the second extraction unit is used for extracting the contact information of two intermediate semantics through an Attention mechanism to complete the input coding;
a second decoding unit for decoding; dividing a task for generating visualization into a plurality of sub-modules, and predicting a generated result by splicing each module to obtain necessary information for generating visualization;
and the generating unit is used for obtaining a visualization chart according to the necessary information for generating the visualization.
10. A computer-readable storage medium storing instructions adapted to be loaded by a processor to perform the steps of the database query method according to any one of claims 1-6.
CN201911407758.6A 2019-12-31 2019-12-31 Database query method and device and computer readable storage medium Pending CN111274267A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911407758.6A CN111274267A (en) 2019-12-31 2019-12-31 Database query method and device and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911407758.6A CN111274267A (en) 2019-12-31 2019-12-31 Database query method and device and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN111274267A true CN111274267A (en) 2020-06-12

Family

ID=71001582

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911407758.6A Pending CN111274267A (en) 2019-12-31 2019-12-31 Database query method and device and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN111274267A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111797619A (en) * 2020-07-15 2020-10-20 浙江赢康医疗科技有限公司 Device and method for automatically calculating observation items according to formula
CN111813802A (en) * 2020-09-11 2020-10-23 杭州量之智能科技有限公司 Method for generating structured query statement based on natural language
CN111831626A (en) * 2020-07-16 2020-10-27 平安科技(深圳)有限公司 Graph structure generation method of database logical relation, data query method and device
CN112036189A (en) * 2020-08-10 2020-12-04 中国人民大学 Method and system for recognizing gold semantic
CN112632136A (en) * 2020-12-11 2021-04-09 北京国电通网络技术有限公司 Data statistical analysis method and device, electronic equipment and storage medium
CN112966140A (en) * 2021-03-10 2021-06-15 北京百度网讯科技有限公司 Field identification method, field identification device, electronic device, storage medium, and program product
CN112989010A (en) * 2021-04-26 2021-06-18 广州思迈特软件有限公司 Data query method, data query device and electronic equipment
CN112989011A (en) * 2021-04-26 2021-06-18 广州思迈特软件有限公司 Data query method, data query device and electronic equipment
CN113761334A (en) * 2020-11-13 2021-12-07 北京沃东天骏信息技术有限公司 Visual recommendation method, device, equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180288086A1 (en) * 2017-04-03 2018-10-04 Royal Bank Of Canada Systems and methods for cyberbot network detection
CN109446221A (en) * 2018-10-29 2019-03-08 北京百分点信息科技有限公司 A kind of interactive data method for surveying based on semantic analysis

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180288086A1 (en) * 2017-04-03 2018-10-04 Royal Bank Of Canada Systems and methods for cyberbot network detection
CN109446221A (en) * 2018-10-29 2019-03-08 北京百分点信息科技有限公司 A kind of interactive data method for surveying based on semantic analysis

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111797619B (en) * 2020-07-15 2024-04-09 联赢佳士比医疗科技(浙江)股份有限公司 Device and method for automatically calculating observation items according to formulas
CN111797619A (en) * 2020-07-15 2020-10-20 浙江赢康医疗科技有限公司 Device and method for automatically calculating observation items according to formula
WO2022011837A1 (en) * 2020-07-15 2022-01-20 浙江赢康医疗科技有限公司 Device and method for automatically calculating observation items according to formulas
WO2021139270A1 (en) * 2020-07-16 2021-07-15 平安科技(深圳)有限公司 Graph structure generation method for database logical relationship, and data query method and device
CN111831626A (en) * 2020-07-16 2020-10-27 平安科技(深圳)有限公司 Graph structure generation method of database logical relation, data query method and device
CN112036189A (en) * 2020-08-10 2020-12-04 中国人民大学 Method and system for recognizing gold semantic
CN111813802B (en) * 2020-09-11 2021-06-29 杭州量之智能科技有限公司 Method for generating structured query statement based on natural language
CN111813802A (en) * 2020-09-11 2020-10-23 杭州量之智能科技有限公司 Method for generating structured query statement based on natural language
CN113761334A (en) * 2020-11-13 2021-12-07 北京沃东天骏信息技术有限公司 Visual recommendation method, device, equipment and storage medium
CN112632136A (en) * 2020-12-11 2021-04-09 北京国电通网络技术有限公司 Data statistical analysis method and device, electronic equipment and storage medium
CN112966140A (en) * 2021-03-10 2021-06-15 北京百度网讯科技有限公司 Field identification method, field identification device, electronic device, storage medium, and program product
CN112966140B (en) * 2021-03-10 2023-08-08 北京百度网讯科技有限公司 Field identification method, field identification device, electronic device, storage medium and program product
CN112989010A (en) * 2021-04-26 2021-06-18 广州思迈特软件有限公司 Data query method, data query device and electronic equipment
CN112989011A (en) * 2021-04-26 2021-06-18 广州思迈特软件有限公司 Data query method, data query device and electronic equipment

Similar Documents

Publication Publication Date Title
CN111274267A (en) Database query method and device and computer readable storage medium
CN108363790B (en) Method, device, equipment and storage medium for evaluating comments
CN108647205B (en) Fine-grained emotion analysis model construction method and device and readable storage medium
CN106649260B (en) Product characteristic structure tree construction method based on comment text mining
US20220277005A1 (en) Semantic parsing of natural language query
WO2020005601A1 (en) Semantic parsing of natural language query
WO2021204014A1 (en) Model training method and related apparatus
WO2020232898A1 (en) Text classification method and apparatus, electronic device and computer non-volatile readable storage medium
CN110765277A (en) Online equipment fault diagnosis platform of mobile terminal based on knowledge graph
CN116719520B (en) Code generation method and device
WO2020258303A1 (en) Semantic model instantiation method, system and device
CN109522396B (en) Knowledge processing method and system for national defense science and technology field
CN114579104A (en) Data analysis scene generation method, device, equipment and storage medium
CN112989829B (en) Named entity recognition method, device, equipment and storage medium
CN113821588A (en) Text processing method and device, electronic equipment and storage medium
CN113297251A (en) Multi-source data retrieval method, device, equipment and storage medium
CN117076636A (en) Information query method, system and equipment for intelligent customer service
CN111460114A (en) Retrieval method, device, equipment and computer readable storage medium
CN112417170A (en) Relation linking method for incomplete knowledge graph
CN114969001B (en) Database metadata field matching method, device, equipment and medium
CN116561264A (en) Knowledge graph-based intelligent question-answering system construction method
CN115982322A (en) Water conservancy industry design field knowledge graph retrieval method and retrieval system
TWI636370B (en) Establishing chart indexing method and computer program product by text information
CN115203206A (en) Data content searching method and device, computer equipment and readable storage medium
CN114238595A (en) Metallurgical knowledge question-answering method and system based on knowledge graph

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination