CN114090627B - Data query method and device - Google Patents

Data query method and device Download PDF

Info

Publication number
CN114090627B
CN114090627B CN202210058343.8A CN202210058343A CN114090627B CN 114090627 B CN114090627 B CN 114090627B CN 202210058343 A CN202210058343 A CN 202210058343A CN 114090627 B CN114090627 B CN 114090627B
Authority
CN
China
Prior art keywords
query
node
data
nodes
data item
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210058343.8A
Other languages
Chinese (zh)
Other versions
CN114090627A (en
Inventor
田有朋
李俊
刘海波
朱文嘉
黄亚东
王小卫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alipay Hangzhou Information Technology Co Ltd
Original Assignee
Alipay Hangzhou Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alipay Hangzhou Information Technology Co Ltd filed Critical Alipay Hangzhou Information Technology Co Ltd
Priority to CN202210058343.8A priority Critical patent/CN114090627B/en
Priority to CN202210873120.7A priority patent/CN115221198A/en
Publication of CN114090627A publication Critical patent/CN114090627A/en
Application granted granted Critical
Publication of CN114090627B publication Critical patent/CN114090627B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2246Trees, e.g. B+trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages

Abstract

An embodiment of the present specification provides a data query method and device, and the method includes: acquiring a first statement for carrying out data query on a data storage system, wherein the first statement is based on a natural language; determining natural semantics of a first statement, the natural semantics comprising query conditions under which a data query is based and a description of a target data item for which the data query is intended; generating a query logic tree based on natural semantics, wherein the query logic tree indicates an intermediate logic step for obtaining a target data item according to a query condition, and the intermediate logic step is unrelated to a data storage structure; and generating a second statement based on the data query language according to the query logic tree.

Description

Data query method and device
Technical Field
One or more embodiments of the present disclosure relate to the field of natural language processing and the field of data analysis, and in particular, to a data query method and apparatus.
Background
A natural language query is a data query that is performed in accordance with natural language. The technical threshold of data query operation is reduced by natural language query, so that an operator can conveniently query data without mastering a professional data query language. Particularly, for example, in a scenario where the number of professionals who are engaged in data query languages in an enterprise is insufficient or the data query demand is huge, the data query system helps the enterprise to perform the bottleneck of data query workload due to the reasons, and improves the efficiency of enterprise data query. However, the existing data query method using natural language still has the problems of low query accuracy and insufficient support capability for complex query operation.
Therefore, a better data query method is needed.
Disclosure of Invention
Embodiments in this specification aim to provide a new data query method. The method comprises the steps of generating a query logic tree which expresses query logic and is irrelevant to a specific data storage mode by using natural query statements in a natural language form, and then generating query execution statements based on a data query language according to the query logic tree. By using the method, the conversion difficulty from the natural query statement to the query execution statement can be greatly improved, the conversion quality between the natural query statement and the query execution statement can be improved, the accuracy of the query result according to the natural language can be greatly improved, and the support capability of the complex query operation according to the natural language can be improved.
According to a first aspect, there is provided a data query method, including:
acquiring a first statement for carrying out data query on a data storage system, wherein the first statement is based on a natural language;
determining natural semantics of the first sentence, the natural semantics including a query condition under which the data query is based and a description of a target data item for which the data query is intended;
generating a query logic tree based on the natural semantics, wherein the query logic tree indicates an intermediate logic step for obtaining the target data item according to the query condition, and the intermediate logic step is independent of a data storage structure;
and generating a second statement based on the data query language according to the query logic tree.
In one possible embodiment, determining the natural semantics of the first sentence comprises:
generating a natural semantic tree of the first sentence based on the syntactic and semantic analysis of the first sentence;
generating a query logic tree based on the natural semantics, comprising:
and generating a query logic tree according to the natural semantic tree.
In one possible implementation, generating the second statement based on the data query language includes:
and generating a second statement according to the query logic tree and a data storage structure of the data storage system.
In one possible embodiment, the data storage structure includes a data table structure of the data storage system and a relationship between data tables.
In one possible implementation, the data query language includes one of SQL language and SPARK data query language.
In one possible embodiment, the generating a query logic tree based on the natural semantics includes:
and generating a query logic tree based on the query condition, the target data item and the predefined logic node.
In one possible implementation manner, the logic nodes include a query node, a screening condition node and a query data item node, wherein the screening condition node and the query data item node serve as child nodes of the query node;
generating a query logic tree based on the query condition and the target data item, and predefined logic nodes, including:
determining a plurality of query steps which are sequentially progressive, and query data items and screening conditions which respectively correspond to the query steps based on the query conditions and the target data items;
determining query nodes corresponding to the query steps, screening condition nodes and query data item nodes owned by the query nodes and membership between the query nodes based on the query steps, the sequence of the query steps, the query data items and the screening conditions;
and generating a query logic tree according to the query nodes and the subordination relationship among the query nodes.
In one possible embodiment, the query logic tree further includes a screening condition group node and a query data item group node; the screening condition group node is used as a father node of a plurality of screening condition nodes corresponding to each inquiry step and is used as a child node of the inquiry node corresponding to each inquiry step, and the inquiry data item group node is used as a father node of a plurality of inquiry data item nodes and is used as a child node of the inquiry node corresponding to each inquiry step.
In one possible implementation, the query logic tree further includes, as child nodes of the query data item node, data item operation function nodes corresponding to operation functions applied to the query data items, and data item identification nodes corresponding to identifications of the query data items.
In a possible implementation manner, the query result dataset corresponding to the query node is a union/intersection/difference set of a plurality of sub-query datasets, and the query node has sub-query nodes corresponding to the plurality of sub-query datasets respectively and a set operation sub-node corresponding to a sum/intersection/difference operation between the sub-query datasets.
In a possible implementation manner, the query node has a front step child node for indicating a query node corresponding to a previous step of the query step corresponding to the query node;
determining an affiliation between query nodes based on the order of the querying steps, including:
determining a pre-step child node for each query node based on the order of the query steps.
In one possible embodiment, the pre-substep node has a query child node corresponding to the other query node corresponding to the previous query step or a pre-unidentified child node indicating no pre-query step.
According to a second aspect, there is provided a data query apparatus comprising:
the natural sentence acquisition unit is configured to acquire a first sentence used for carrying out data query on a data storage system, wherein the first sentence is based on a natural language;
a natural semantics determining unit configured to determine natural semantics of the first sentence, the natural semantics including a query condition under which the data query is based and a description of a target data item for which the data query is intended;
a query logic determination unit configured to generate a query logic tree based on the natural semantics, the query logic tree indicating an intermediate logic step of obtaining the target data item according to the query condition, and the intermediate logic step being independent of a data storage structure;
and the query statement generation unit is configured to generate a second statement based on the data query language according to the query logic tree.
In a possible implementation, the natural semantics determining unit is further configured to:
generating a natural semantic tree of the first sentence based on the syntactic and semantic analysis of the first sentence;
generating a query logic tree based on the natural semantics, comprising:
and generating a query logic tree according to the natural semantic tree.
In one possible embodiment, the query statement generation unit is further configured to,
and generating a second statement according to the query logic tree and a data storage structure of the data storage system.
In one possible implementation, the including the data storage structure includes a data table structure of the data storage system and a relationship between the data tables.
In a possible implementation, the query logic determining unit is further configured to:
and generating a query logic tree based on the query condition, the target data item and the predefined logic node.
In one possible implementation manner, the logic nodes include a query node, a screening condition node and a query data item node, wherein the screening condition node and the query data item node serve as child nodes of the query node;
a query logic determination unit further configured to:
determining a plurality of query steps which are sequentially progressive, and query data items and screening conditions which respectively correspond to the query steps based on the query conditions and the target data items;
determining query nodes corresponding to the query steps, screening condition nodes and query data item nodes owned by the query nodes and membership between the query nodes based on the query steps, the sequence of the query steps, the query data items and the screening conditions;
and generating a query logic tree according to the query nodes and the subordination relationship among the query nodes.
In one possible embodiment, the query logic tree further includes a screening condition group node and a query data item group node; the screening condition group node is used as a father node of a plurality of screening condition nodes corresponding to each inquiry step and is used as a child node of the inquiry node corresponding to each inquiry step, and the inquiry data item group node is used as a father node of a plurality of inquiry data item nodes and is used as a child node of the inquiry node corresponding to each inquiry step.
In one possible implementation, the query logic tree further includes, as child nodes of the query data item node, a data item operation function node corresponding to an operation function applied to the query data item, and a data item identification node corresponding to an identification of the query data item.
In a possible implementation manner, the query result dataset corresponding to the query node is a union/intersection/difference set of a plurality of sub-query datasets, and the query node has sub-query nodes corresponding to the plurality of sub-query datasets respectively and a set operation sub-node corresponding to a sum/intersection/difference operation between the sub-query datasets.
In a possible implementation manner, the query node has a front step child node for indicating a query node corresponding to a previous step of the query step corresponding to the query node;
a query logic determination unit further configured to:
determining a pre-step child node for each query node based on the order of the query steps.
In one possible embodiment, the pre-substep node has a query child node corresponding to the other query node corresponding to the previous query step or a pre-unidentified child node indicating no pre-query step.
According to a third aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of the first aspect.
According to a fourth aspect, there is provided a computing device comprising a memory having stored therein executable code and a processor that, when executing the executable code, implements the method of the first aspect.
With one or more of the methods, apparatus, storage medium, and computing devices of the above aspects, accuracy of query results according to natural language is greatly improved, as well as supporting capability for complex query jobs according to natural language.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 illustrates a schematic diagram of a data query method according to an embodiment of the present disclosure;
FIG. 2 is a flow diagram illustrating a method for querying data according to an embodiment of the present disclosure;
FIG. 3 illustrates a schematic diagram of a natural semantic tree in accordance with an embodiment of the present description;
FIG. 4 illustrates a schematic diagram of a query logic tree in accordance with an embodiment of the present description;
FIG. 5 shows a schematic diagram of a generated SQL statement according to an embodiment of the present specification;
fig. 6 is a block diagram of a data query apparatus according to an embodiment of the present specification.
Detailed Description
The solution provided by the present specification will be described below with reference to the accompanying drawings.
As described above, the data query using natural language can greatly reduce the technical threshold of data query operation, so that the query operator can conveniently perform data query without mastering a professional data query language. However, there are problems with using natural language for data queries at present. For data extraction, query sentences are obtained after language analysis based on data query and analysis of natural language, and basically complete accuracy is required. This is different from the traditional natural Language processing algorithm in the field of natural Language processing (nlp), which generally has a probabilistic result rather than being completely accurate. Therefore, the currently mainstream technical solution (e.g. seq2 SQL) for performing data Query using natural Language generally translates natural Language into data Query Language (e.g. Structured Query Language, SQL for short) first, and then performs data Query using the data Query Language.
However, this solution also presents problems. The syntactic gap between natural languages and data query languages such as the SQL language is very large. For example, the following: 1) the SQL syntax has JOIN keys for representing associations between tables. Whereas table associations do not usually occur in natural language expressions. 2) The SQL syntax uses the keyword GROUP BY for grouping, and the grouped data items need to be both in the post-position of the GROUP BY and in the post-position of the query keyword SELECT. And such expressions are not usually present in natural language expressions. 3) The SQL syntax elicits screening conditions for data aggregation screening (e.g., post-packet screening) using a specific keyword HAVING. While natural language does not distinguish between general data screening and data aggregation screening. Therefore, it is difficult to directly translate the natural language into SQL. For example, the current advanced seq2sql algorithm only supports about 80% of data query accuracy under single-table single-layer aggregation, and is difficult to support various complex data analysis requirements under real scenes in an enterprise.
In order to solve the technical problem, the embodiments of the present specification provide a data query method. Fig. 1 is a schematic diagram illustrating a data query method according to an embodiment of the present disclosure. As shown in fig. 1, the core idea of this method is to first obtain a query statement in a natural language input by a user. Then, the natural query statement is parsed to obtain the natural semantics contained therein, including, for example, query conditions, query purpose, or query calculation formula. According to the natural semantics, determining logic steps of the query (the logic steps are irrelevant to how the data are stored in the database), and generating a query logic tree representing the logic steps of the query by using a pre-designed Intermediate Language (IL) for representing the query logic and a graphical language for representing the query logic in a tree structure. And finally, generating a corresponding query statement (such as SQL statement) in a data query language for acquiring data from the database according to the query logic tree. Thereafter, the data query statement may be executed, resulting in a query result dataset from the database.
Compared with the prior art, the method directly converts the natural language sentence into the query execution sentence. By adopting the method, the complexity of statement conversion can be greatly reduced, the accuracy of statement conversion is improved, more complex query execution statements can be obtained, and more complex data analysis requirements can be met. Specifically, the query logic tree determines each logic step of the query only according to the semantics of the natural query statement, and is not related to a specific data storage structure including data table composition and table relationship in a specific database. I.e., the query logical tree does not include content related to a particular data storage structure. In the syntax of a data query language (e.g., SQL language), the most different from the natural language syntax are usually table association statements (e.g., JOIN statement in SQL), grouping statements (e.g., GROUP BY in SQL) and aggregation screening statements (e.g., HAVING statement in SQL), which are related to specific data storage structures, such as data table composition and table relationship. That is, the query logical tree need not include content related to, for example, JOIN statements, grouping GROUP BY, and HAVING statements. Furthermore, the query logic tree is generated according to the natural query statement, which is much lower than the method for directly converting the natural query statement into the data query language in the aspects of grammar conversion and computational complexity. After the query logic tree is obtained, an executable data query statement based on a data query language may be generated by combining a specific data storage structure and the query logic tree. In terms of the whole conversion process, the method can greatly reduce the complexity of the whole process and improve the conversion accuracy. Moreover, the difficulty of obtaining the final query execution statement based on the query logic tree is greatly reduced compared with the method of directly obtaining the query execution statement from the natural language, and the method can obtain more complex query execution statements and meet the requirement of complex data analysis.
Fig. 2 shows a flow chart of a data query method according to an embodiment of the present description. As shown in fig. 2, the method comprises the steps of:
first, in step 21, a first statement for a data query to a data storage system is obtained.
In this step, the acquired first sentence is a natural language-based sentence and is used for performing data query on the data storage system. In different embodiments, the natural language sentence may be obtained based on different specific ways. In one example, for example, a natural language sentence input by a user through an interface may be obtained. The embodiments of the present specification focus on the language processing and data query process of the natural language sentence after the natural language sentence is acquired, and do not focus on the specific way to acquire the natural language sentence. The data storage system is a storage system that stores data that the user desires to obtain through the natural language. In different embodiments, the data storage system may be a data storage system of different specific types, and have different specific architectural manners, which are not limited in this specification.
After the first sentence is obtained, the natural semantics of the first sentence are determined, step 22. In this step, the determined natural semantics include a query condition under which the data query depends and a description of a target data item that the data query is intended to acquire.
The semantics of a statement (semantic) is the meaning implied by the statement. That is, an intrinsic function including a query condition and a query purpose (target data item) is determined from an acquired natural language intended for data query. In different embodiments, the determined natural semantics may have different concrete expressions, which are not limited in this specification. For example, natural semantics may be represented as a kind of dendrogram, i.e., a natural semantic tree. Thus, in one embodiment, a natural semantic tree of the first sentence may be generated based on the parsing and semantic analysis of the first sentence. FIG. 3 illustrates a schematic diagram of a natural semantic tree in accordance with an embodiment of the present description. As shown in fig. 3, in this example, one natural query sentence input by the user is acquired as 'the number of customers whose total amount of money in each city is less than 10 ten thousand yesterday'. And after semantic analysis is carried out on the statement, a semantic tree of the statement is obtained. Wherein, the 'time' node represents a time range (query condition) that the user desires to query, the 'data dimension' node represents a target data item that the user desires to query, or a basic data item on which the target data item is obtained, the 'calculation formula' node represents a calculation condition (query condition) set by the user query, and the 'destination' node represents a calculation purpose (description of the target data item) set by the user on the basic data item that the user desires to query.
Then, at step 23, a query logic tree is generated based on the natural semantics. In this step, a tree diagram, i.e., a query logic tree, expressing query logic may be generated according to the natural semantics determined in step 22. The query logic tree may indicate intermediate logic steps for obtaining the target data item according to query conditions included in the natural semantics, and the intermediate logic steps are independent of the data storage structure.
Specifically, the process of obtaining the final query result (query dataset) that the natural semantics expressing user desires may be decomposed into a plurality of query steps, and each step may correspond to a single query. The final query result may be obtained by the final query step, which may be a progressive query result on top of the query result obtained by the preceding step.
For example, in the embodiment shown in fig. 1, the user wants to search for 'the number of customers whose total amount of money is less than 10 ten thousand in each city yesterday', and according to the natural semantic tree in fig. 1, the number of customers whose data items the search statement wants to acquire are each city and each city satisfies the condition (total amount of money is less than 10 ten thousand) can be obtained. To determine the number of customers that satisfy the condition (total amount less than 10 ten thousand) for each city, it may first be determined which customers satisfy the condition for each city. Thus, in one embodiment, the query may be broken down into two progressive queries, i.e., two progressive steps:
in the first step, each city is inquired, and the user ID with the payment amount less than 10w of each city is inquired.
The data filtering condition can be expressed as transaction time = 'yesterday', SUM (amount) <100000,
the data items that the acquired data set includes may be represented as a city, a user ID.
And step two, determining the number of the users with the payment amount of less than 10w in each city according to the data set obtained in the step one.
The data screening condition is none.
The acquired data set comprises data items including cities and user numbers.
The number of users may be counted according to the user ID in the data set obtained in the first step, and is represented as count (user ID), for example.
It can be seen that each of the above query steps does not relate to a specific storage structure of the queried data in, for example, a database, e.g., a city, which tables the user ID is stored in the database, and what the relationship between these tables is.
By using the query logic tree, the query steps and the sequence between them can be represented in a tree structure diagram. Thus, according to one embodiment, a query logical tree may be generated based on query conditions and target data items included in natural semantics, as well as predefined logical nodes. The logical nodes may represent logical elements of further refinement in a logical step, such as the resulting query dataset in that step, query filter criteria, query terms, and the like. In different embodiments, the logical nodes may have different specific definitions. In one embodiment, the logical nodes may include a query node, a filter condition node, and a query data item node, wherein the filter condition node and the query data item node are child nodes of the query node. The query node may correspond to a query data set obtained in one query step, and the screening condition node and the query data item node may correspond to a data screening condition for obtaining the query data set and a data item included in the query data set, respectively. Furthermore, a plurality of query steps which are sequentially progressive, and query data items and screening conditions which respectively correspond to the query steps can be determined based on the query conditions and the target data items; determining query nodes corresponding to the query steps, screening condition nodes and query data item nodes owned by the query nodes and membership between the query nodes based on the query steps, the sequence of the query steps, the query data items and the screening conditions; and generating a query logic tree according to the query nodes and the subordination relationship among the query nodes.
FIG. 4 illustrates a schematic diagram of a query logic tree in accordance with an embodiment of the present description. As shown in fig. 4, according to the two decomposition steps obtained from fig. 3, a corresponding query logic tree may be generated, where the query logic tree includes query nodes corresponding to the first step (query step 1) and the second step (query step 2), i.e., query node 1 and query node 2, and the query conditions and query data items of the first step and the second step respectively correspond to the query node 1 and the query node, which respectively have a screening condition child node and a query data item child node.
In different embodiments, a single query may also include a plurality of different query conditions. Thus, in one embodiment, the query logic tree may further include a filter condition group node and a query data item group node; the screening condition group node is used as a father node of a plurality of screening condition nodes corresponding to each query step and is used as a child node of a query node corresponding to each query step, and the query data item group node is used as a father node of a plurality of query data item nodes and is used as a child node of the query node corresponding to each query step. For example, in the embodiment shown in FIG. 4, query node 2 has a query data item group child node, data item group node 1, and data item group node 1 includes query data item node 1 and query data item node 2, which correspond to two query filter criteria, respectively.
In various embodiments, a function operation may be performed on a query data item in a data query. Thus, in one embodiment, the query logic tree further includes, as child nodes of the query data item node, a data item operation function node corresponding to an operation function applied to the query data item, and a data item identification node corresponding to an identification of the query data item. For example, in the embodiment shown in fig. 4, the query data item node 1 has an operation function child node and a data item identification node, corresponding to an operation function (count, count function) applied to the query data item and an identification (user id) of the data item, respectively. In different embodiments, different types of data item operation functions may be included, for example, one of a count function, an average value function, a maximum value function, a minimum value function, and the like, which is not limited in this specification. It should be noted that, depending on the nature of the different application functions, the original data item or the non-original data item may be obtained after applying the operation function to the original data item, for example, after counting the user id, the number of users is obtained instead of the user id, and for example, the user id is maximized, and then the user id is still obtained.
In various embodiments, a query result may also be a collection, union, or difference of multiple other data query results. Therefore, in one embodiment, the query result data set corresponding to the query node is a UNION/intersection/difference set of several sub-query data sets, and the query node has sub-query nodes corresponding to the sub-query data sets respectively, and a set operation sub-node corresponding to a UNION operation (UNION), an intersection operation (inteselect), or a difference operation (EXCEPT) between the sub-query data sets.
As previously described, the overall query logic may contain several query steps that are progressive in sequence, with query steps that follow one another in sequence having a leading step. Therefore, in the query logic tree, for a query node corresponding to a subsequent step, a corresponding node of its preceding step may be indicated. Thus, in one embodiment, a query node may have a pre-step child node for indicating a query node corresponding to a previous step of a query step to which the query node corresponds; the pre-step child nodes of each query node may be determined based on the order of the query steps. In a specific embodiment, the pre-substep node may have a query child node corresponding to the other query node corresponding to the previous query step or a pre-identified-free child node indicating no pre-query step. For example, in the embodiment shown in FIG. 4, query node 2 has a pre-step child node 2, the child nodes of pre-step child node 2 include query node 1, and the pre-step of query node 2 is represented as query node 1. And the pre-step child node of query node 1 has a no pre-identity (NONE) child node, indicating that query node 1 has no pre-step.
After the query logic tree is obtained, a second statement based on the data query language is generated from the query logic tree at step 24.
The step is to generate a query statement based on the data query language according to the query logic tree. The statement may be executed, for example, by a database management system, a data computation engine, or other data storage and computing system, obtaining a final query result dataset from the data storage system. In different embodiments, different specific types of data query languages may be generated, which are not limited by this specification. In one embodiment, the data query language may include one of SQL language, SPARK data query language.
Specifically, the query statement that can obtain data from the data storage system is generated, and may be based on the data storage structure of the data storage system, for example, in which specific data tables and relationships between different data tables each target data item of the query is actually stored. Thus, in one embodiment, the second statement may be generated based on the query logical tree and a data storage structure of the data storage system. In a particular embodiment, including the data storage structure may include a data table structure of the data storage system and relationships between data tables.
Fig. 5 shows a schematic diagram of a generated SQL statement according to an embodiment of the present specification. As shown in fig. 5, according to the query logic tree shown in fig. 4, in combination with the specific storage manner of the data to be queried in the data storage system, an SQL statement that can be executed finally is generated. Specifically, in the SQL statement, the query statement corresponding to the query step 1 is nested by the query statement (SELECT statement) corresponding to the query step 2. Specifically, the customer and its consumption data are stored in the F and D tables. Therefore, in the SELECT statement corresponding to step 1, the F table and the D table may be connected through a JOIN statement (the connection condition may be based on the actual table structure and association relationship of the F table and the D table, for example, both have a user _ ID field with the same data definition, and corresponding to the user ID, both may be connected according to the user _ ID). Next, the query condition is specified as the consumption record of 'yesterday' by the conditional statement woreree. Then, through the grouping statement GROUP BY and the aggregation conditional statement HAVING, the user ID with the expense amount of more than 10 ten thousand in each city is obtained. Then, in the SELECT statement corresponding to the step 2, according to the query result of the SELECT statement corresponding to the step 1, the number of users consuming money in each city is obtained, wherein the number of users is more than 10 ten thousand. It can be seen that the SQL generation is specifically based on the query logic represented by the query logic tree and the specific storage structure of the data to be queried in the database.
The above describes a data query method according to an embodiment of the present specification. According to an embodiment of another aspect, a data query device is also provided. Fig. 6 shows a block diagram of a data query device according to an embodiment of the present specification. As shown in fig. 6, the apparatus 600 includes:
a natural language sentence acquisition unit 61 configured to acquire a first sentence for performing data query on the data storage system, the first sentence being based on a natural language;
a natural semantics determining unit 62 configured to determine natural semantics of the first sentence, the natural semantics including a query condition under which the data query is based and a description of a target data item for which the data query is intended;
a query logic determination unit 63 configured to generate a query logic tree based on the natural semantics, where the query logic tree indicates an intermediate logic step for obtaining the target data item according to the query condition, and the intermediate logic step is unrelated to a data storage structure;
a query statement generating unit 64 configured to generate a second statement based on the data query language according to the query logic tree.
In one embodiment, the natural semantics determining unit may be further configured to:
generating a natural semantic tree of the first sentence based on the syntactic and semantic analysis of the first sentence;
generating a query logic tree based on the natural semantics, comprising:
and generating a query logic tree according to the natural semantic tree.
In one embodiment, the query statement generation unit may be further configured to,
and generating a second statement according to the query logic tree and a data storage structure of the data storage system.
In one embodiment, the inclusion of the data storage structure may include a data table structure of the data storage system and relationships between data tables.
In one embodiment, the query logic determining unit may be further configured to:
and generating a query logic tree based on the query condition, the target data item and the predefined logic node.
In one embodiment, the logical nodes may include a query node, a screening condition node, and a query data item node, where the screening condition node and the query data item node serve as child nodes of the query node;
a query logic determination unit, which may be further configured to:
determining a plurality of query steps which are sequentially progressive, and query data items and screening conditions which respectively correspond to the query steps based on the query conditions and the target data items;
determining query nodes corresponding to the query steps, screening condition nodes and query data item nodes owned by the query nodes and membership between the query nodes based on the query steps, the sequence of the query steps, the query data items and the screening conditions;
and generating a query logic tree according to the query nodes and the subordinate relations among the query nodes.
In one embodiment, the query logic tree may further include a filter condition group node and a query data item group node; the screening condition group node is used as a father node of a plurality of screening condition nodes corresponding to each inquiry step and is used as a child node of the inquiry node corresponding to each inquiry step, and the inquiry data item group node is used as a father node of a plurality of inquiry data item nodes and is used as a child node of the inquiry node corresponding to each inquiry step.
In one embodiment, the query logic tree may further include, as child nodes of the query data item node, data item operation function nodes corresponding to operation functions applied to the query data items, and data item identification nodes corresponding to identifications of the query data items.
In one embodiment, the query result dataset corresponding to the query node may be a union set of a plurality of sub-query datasets, and the query node has sub-query nodes corresponding to the plurality of sub-query datasets respectively, and a set operation sub-node corresponding to a sum/difference operation between the sub-query datasets.
In one embodiment, the query node may have a pre-step sub-node for indicating a query node corresponding to a step preceding the query step corresponding to the query node;
a query logic determination unit, which may be further configured to:
determining a pre-step child node for each query node based on the order of the query steps.
In one embodiment, the pre-substep node may have a query child node corresponding to the other query node to which the previous query step corresponds or no pre-identified child node indicating no pre-query step.
According to an embodiment of yet another aspect, there is also provided a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the above-mentioned method.
According to an embodiment of still another aspect, there is also provided a computing device including a memory and a processor, the memory storing executable code, and the processor implementing the method when executing the executable code.
It is to be understood that the terms "first," "second," and the like, herein are used for descriptive purposes only and not for purposes of limitation, to distinguish between similar concepts.
Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in this invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.
The above-mentioned embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the present invention should be included in the scope of the present invention.

Claims (23)

1. A method of data query, comprising:
acquiring a first statement for carrying out data query on a data storage system, wherein the first statement is based on a natural language;
determining natural semantics of the first sentence, the natural semantics including a query condition under which the data query is based and a description of a target data item for which the data query is intended; the determining the natural semantics of the first sentence comprises generating a natural semantic tree of the first sentence based on the syntax analysis and semantic analysis of the first sentence;
generating a query logic tree based on the natural semantics, wherein the query logic tree indicates an intermediate logic step for obtaining the target data item according to the query condition, and the intermediate logic step is composed of a plurality of query steps which are sequentially progressive and is unrelated to a data storage structure; the query logic tree at least comprises a plurality of query nodes respectively corresponding to the query steps, wherein the query nodes have subordination relations, and the subordination relations correspond to the sequential progressive relations among the query steps; generating a query logic tree based on the natural semantics, including generating a query logic tree according to the natural semantics tree;
and generating a second statement based on the data query language according to the query logic tree.
2. The method of claim 1, wherein generating a second statement based on a data query language comprises:
and generating a second statement according to the query logic tree and a data storage structure of the data storage system.
3. The method of claim 2, wherein the data storage structure comprises a data table structure of a data storage system and relationships between data tables.
4. The method of claim 1, wherein the data query language comprises one of a SQL language, a SPARK data query language.
5. The method of claim 1, wherein the generating a query logic tree based on the natural semantics comprises:
and generating a query logic tree based on the query condition, the target data item and the predefined logic node.
6. The method of claim 5, wherein the logical nodes comprise a query node, a filter condition node, and a query data item node, wherein the filter condition node and the query data item node are child nodes of the query node;
generating a query logic tree based on the query condition and the target data item, and predefined logic nodes, including:
determining a plurality of query steps which are sequentially progressive, and query data items and screening conditions which respectively correspond to the query steps based on the query conditions and the target data items;
determining query nodes corresponding to the query steps, screening condition nodes and query data item nodes owned by the query nodes and membership between the query nodes based on the query steps, the sequence of the query steps, the query data items and the screening conditions;
and generating a query logic tree according to the query nodes and the subordination relationship among the query nodes.
7. The method of claim 6, wherein the query logic tree further comprises a filter criteria set node and a query data item set node; the screening condition group node is used as a father node of a plurality of screening condition nodes corresponding to each inquiry step and is used as a child node of the inquiry node corresponding to each inquiry step, and the inquiry data item group node is used as a father node of a plurality of inquiry data item nodes and is used as a child node of the inquiry node corresponding to each inquiry step.
8. The method of claim 7, wherein the query logic tree further comprises, as child nodes of the query data item node, a data item operation function node corresponding to an operation function applied to the query data item, and a data item identification node corresponding to an identification of the query data item.
9. The method of claim 6, wherein the query result dataset corresponding to the query node is a union/intersection/difference set of a plurality of sub-query datasets, and the query node has sub-query nodes corresponding to the plurality of sub-query datasets respectively and a union operation sub-node corresponding to a union/intersection/difference operation between the sub-query datasets.
10. The method according to claim 6, wherein the query node has a pre-step sub-node for indicating the query node corresponding to the previous step of the query step corresponding to the query node;
determining an affiliation between query nodes based on the order of the querying steps, including:
determining a pre-step child node for each query node based on the order of the query steps.
11. The method of claim 10, wherein the pre-substep node has a query child node corresponding to other query nodes to which the previous query step corresponds or a pre-identified-free child node indicating no pre-query step.
12. A data query apparatus, comprising:
the natural sentence acquisition unit is configured to acquire a first sentence used for carrying out data query on a data storage system, wherein the first sentence is based on a natural language;
a natural semantics determining unit configured to determine natural semantics of the first sentence, the natural semantics including a query condition under which the data query is based and a description of a target data item for which the data query is intended; the determining the natural semantics of the first sentence comprises generating a natural semantic tree of the first sentence based on the syntax analysis and semantic analysis of the first sentence;
a query logic determination unit configured to generate a query logic tree based on the natural semantics, wherein the query logic tree indicates an intermediate logic step consisting of a plurality of query steps which are sequentially advanced according to the query condition to obtain the target data item, and the intermediate logic step is unrelated to a data storage structure; the query logic tree at least comprises a plurality of query nodes respectively corresponding to the query steps, wherein the query nodes have subordination relations, and the subordination relations correspond to the sequential progressive relations among the query steps; generating a query logic tree based on the natural semantics, including generating a query logic tree according to the natural semantics tree;
and the query statement generation unit is configured to generate a second statement based on the data query language according to the query logic tree.
13. The apparatus of claim 12, wherein the query statement generation unit is further configured to,
and generating a second statement according to the query logic tree and a data storage structure of the data storage system.
14. The apparatus of claim 13, wherein the data storage structure comprises a data table structure of a data storage system and relationships between data tables.
15. The apparatus of claim 12, wherein the query logic determines unit is further configured to:
and generating a query logic tree based on the query condition, the target data item and the predefined logic node.
16. The apparatus of claim 15, wherein the logical nodes comprise a query node, a filter condition node, and a query data item node, wherein the filter condition node and the query data item node are child nodes of the query node;
a query logic determination unit further configured to:
determining a plurality of query steps which are sequentially progressive, and query data items and screening conditions which respectively correspond to the query steps based on the query conditions and the target data items;
determining query nodes corresponding to the query steps, screening condition nodes and query data item nodes owned by the query nodes and membership between the query nodes based on the query steps, the sequence of the query steps, the query data items and the screening conditions;
and generating a query logic tree according to the query nodes and the subordination relationship among the query nodes.
17. The apparatus of claim 16, wherein the query logic tree further comprises a filter condition group node and a query data item group node; the screening condition group node is used as a father node of a plurality of screening condition nodes corresponding to each query step and is used as a child node of a query node corresponding to each query step, and the query data item group node is used as a father node of a plurality of query data item nodes and is used as a child node of the query node corresponding to each query step.
18. The apparatus of claim 16, wherein the query logic tree further comprises, as child nodes of the query data item node, a data item operation function node corresponding to an operation function applied to the query data item, and a data item identification node corresponding to an identification of the query data item.
19. The apparatus of claim 16, wherein the query result dataset corresponding to the query node is a union/intersection/difference set of a plurality of sub-query datasets, and the query node has sub-query nodes corresponding to the plurality of sub-query datasets respectively and a union operation sub-node corresponding to a union/intersection/difference operation between the sub-query datasets.
20. The apparatus according to claim 16, wherein the query node has a pre-step sub-node for indicating the query node corresponding to the pre-step of the query step corresponding to the query node;
a query logic determination unit further configured to:
determining a pre-step child node for each query node based on the order of the query steps.
21. The apparatus of claim 20, wherein the pre-substep node has a query child node corresponding to other query nodes to which the previous query step corresponds or a pre-identified-free child node indicating no pre-query step.
22. A computer-readable storage medium, on which a computer program is stored which, when executed in a computer, causes the computer to carry out the method of any one of claims 1-11.
23. A computing device comprising a memory having executable code stored therein and a processor that, when executing the executable code, implements the method of any of claims 1-11.
CN202210058343.8A 2022-01-19 2022-01-19 Data query method and device Active CN114090627B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202210058343.8A CN114090627B (en) 2022-01-19 2022-01-19 Data query method and device
CN202210873120.7A CN115221198A (en) 2022-01-19 2022-01-19 Data query method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210058343.8A CN114090627B (en) 2022-01-19 2022-01-19 Data query method and device

Related Child Applications (1)

Application Number Title Priority Date Filing Date
CN202210873120.7A Division CN115221198A (en) 2022-01-19 2022-01-19 Data query method and device

Publications (2)

Publication Number Publication Date
CN114090627A CN114090627A (en) 2022-02-25
CN114090627B true CN114090627B (en) 2022-05-31

Family

ID=80308511

Family Applications (2)

Application Number Title Priority Date Filing Date
CN202210873120.7A Pending CN115221198A (en) 2022-01-19 2022-01-19 Data query method and device
CN202210058343.8A Active CN114090627B (en) 2022-01-19 2022-01-19 Data query method and device

Family Applications Before (1)

Application Number Title Priority Date Filing Date
CN202210873120.7A Pending CN115221198A (en) 2022-01-19 2022-01-19 Data query method and device

Country Status (1)

Country Link
CN (2) CN115221198A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107451153A (en) * 2016-05-31 2017-12-08 北京京东尚科信息技术有限公司 The method and apparatus of export structure query statement
CN107885786A (en) * 2017-10-17 2018-04-06 东华大学 Towards the Natural Language Query Interface implementation method of big data
CN109947794A (en) * 2019-02-21 2019-06-28 东华大学 A kind of interactive natural language inquiry conversion method
CN110727839A (en) * 2018-06-29 2020-01-24 微软技术许可有限责任公司 Semantic parsing of natural language queries
WO2021189195A1 (en) * 2020-03-23 2021-09-30 深圳市欢太科技有限公司 Data querying method and apparatus, server, and storage medium

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080235199A1 (en) * 2007-03-19 2008-09-25 Yunyao Li Natural language query interface, systems, and methods for a database
CN103577590A (en) * 2013-11-12 2014-02-12 北京润乾信息系统技术有限公司 Data query method and system
KR101661198B1 (en) * 2014-07-10 2016-10-04 네이버 주식회사 Method and system for searching by using natural language query
CN104657439B (en) * 2015-01-30 2019-12-13 欧阳江 Structured query statement generation system and method for precise retrieval of natural language
CN104657440B (en) * 2015-01-30 2020-05-15 欧阳江 Structured query statement generation system and method
WO2016174682A1 (en) * 2015-04-29 2016-11-03 Yellai Mahesh Method for generating visual representations of data based on controlled natural language queries and system thereof
US11036726B2 (en) * 2018-09-04 2021-06-15 International Business Machines Corporation Generating nested database queries from natural language queries
WO2020198319A1 (en) * 2019-03-25 2020-10-01 Jpmorgan Chase Bank, N.A. Method and system for implementing a natural language interface to data stores using deep learning
CN112580357A (en) * 2019-09-29 2021-03-30 微软技术许可有限责任公司 Semantic parsing of natural language queries
CN113032418B (en) * 2021-02-08 2022-11-11 浙江大学 Method for converting complex natural language query into SQL (structured query language) based on tree model

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107451153A (en) * 2016-05-31 2017-12-08 北京京东尚科信息技术有限公司 The method and apparatus of export structure query statement
CN107885786A (en) * 2017-10-17 2018-04-06 东华大学 Towards the Natural Language Query Interface implementation method of big data
CN110727839A (en) * 2018-06-29 2020-01-24 微软技术许可有限责任公司 Semantic parsing of natural language queries
CN109947794A (en) * 2019-02-21 2019-06-28 东华大学 A kind of interactive natural language inquiry conversion method
WO2021189195A1 (en) * 2020-03-23 2021-09-30 深圳市欢太科技有限公司 Data querying method and apparatus, server, and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
GIS中文查询系统中SQL语句的形成;徐爱萍等;《测绘科学》;20061020;第31卷(第05期);第110-113,107页 *
自然语言生成多表SQL查询语句技术研究;曹金超等;《计算机科学与探索》;20191014;第14卷(第07期);第1133-1141页 *

Also Published As

Publication number Publication date
CN115221198A (en) 2022-10-21
CN114090627A (en) 2022-02-25

Similar Documents

Publication Publication Date Title
CN110543517B (en) Method, device and medium for realizing complex query of mass data based on elastic search
Salas et al. Publishing statistical data on the web
JP4953468B2 (en) Method and apparatus for ontology data import / export
CN107515887B (en) Interactive query method suitable for various big data management systems
WO2015138497A2 (en) Systems and methods for rapid data analysis
CN111274267A (en) Database query method and device and computer readable storage medium
CN110674229A (en) AST-based relational database SQL table relational analysis and display method
TW201915777A (en) Financial analysis system and method for unstructured text data
CN110909126A (en) Information query method and device
CN105260374A (en) Asynchronous production line type graph query method and asynchronous production line type graph query system
CN114579104A (en) Data analysis scene generation method, device, equipment and storage medium
CN110795526A (en) Mathematical formula index creating method and system for retrieval system
CN115794833A (en) Data processing method, server and computer storage medium
JP6781820B2 (en) Distributed Computing Framework and Distributed Computing Method (DISTRIBUTED COMPUTING FRAMEWORK AND DISTRIBUTED COMPUTING METHOD)
CN114372174A (en) XML document distributed query method and system
CN114625748A (en) SQL query statement generation method and device, electronic equipment and readable storage medium
CN110990423B (en) SQL statement execution method, device, equipment and storage medium
CN114090627B (en) Data query method and device
CN116400910A (en) Code performance optimization method based on API substitution
CN114416848A (en) Data blood relationship processing method and device based on data warehouse
CN111368055A (en) Retrieval method and device for patent database combined enterprise information platform
Halpin Join constraints
CN114218935B (en) Entity display method and device in data analysis
CN117390064B (en) Database query optimization method based on embeddable subgraph
CN117216091A (en) Optimization method, device, equipment and storage medium for HiveSQL multi-connection query

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant