CN109947794B - Interactive natural language query conversion method - Google Patents

Interactive natural language query conversion method Download PDF

Info

Publication number
CN109947794B
CN109947794B CN201910129037.7A CN201910129037A CN109947794B CN 109947794 B CN109947794 B CN 109947794B CN 201910129037 A CN201910129037 A CN 201910129037A CN 109947794 B CN109947794 B CN 109947794B
Authority
CN
China
Prior art keywords
query
node
natural language
clause
database
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910129037.7A
Other languages
Chinese (zh)
Other versions
CN109947794A (en
Inventor
王梅
陈德华
潘乔
李继云
王丽敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Donghua University
Original Assignee
Donghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Donghua University filed Critical Donghua University
Priority to CN201910129037.7A priority Critical patent/CN109947794B/en
Publication of CN109947794A publication Critical patent/CN109947794A/en
Application granted granted Critical
Publication of CN109947794B publication Critical patent/CN109947794B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention provides an interactive natural language query conversion method, which comprises the following steps: semantic analysis; mapping nodes; pattern matching; predefining a function operation; inquiring and interacting; and (5) result interaction. Aiming at the difficulty of non-professional user query and database use and the semantic gap between the ambiguity and abstract of the natural language in expression and the accuracy and certainty of the structured query language in big data-oriented application, the invention analyzes the natural language query description input by the user, establishes the corresponding relation between the semantic source and the database table as well as the fields, generates basic query, adds function operation on the basis to obtain the final query, combines the traditional natural language query interface with the interactive query, so that the common user can query the database in the natural language description mode, and simultaneously better captures the query intention of the user. By defining the interaction function and the result feedback mechanism, the accuracy and the high efficiency of complex query conversion are improved.

Description

Interactive natural language query conversion method
Technical Field
The invention relates to a method for interactively converting natural language query into structured query language SQL, belonging to the technical field of big data processing.
Background
With the continuous increase of the data size and the continuous increase of the data openness, how to promote the public, especially the non-professional personnel, the ability to use the data, provide friendly and convenient query and search service, and become the key problem to be solved. The research of the problem has important significance for promoting the data opening process and the large data development of China.
Currently, the standard query language for relational datasets is the structured query language (Structured Query Language, SQL). Although structured query language can accurately express the query intention of the user, writing structured query sentences requires the user to be skilled in grasping the query language used and is familiar with the structure of tables in a database, etc.
Even for professional users, it is difficult to understand the complex professional query language grammar, let alone the application-oriented average user. Natural language, on the other hand, is a natural form of human expression demand. If the natural language query can be automatically converted into SQL, the query requirement of the user can be conveniently met. However, the ambiguity and abstract nature of natural language in expression and the accuracy and certainty of structured query language present significant challenges for accurate query transformations. On the one hand, natural language query descriptions often have conciseness and ambiguity, whereas SQL is converted into a query execution plan, contains accurate specific information, has strong semantic inequality, has poor direct one-to-one conversion accuracy, and is particularly difficult to directly convert complex query intentions containing sub-queries and the like. More importantly, the SQL language is suitable for determining the query intention, but for the common user, the SQL language is not a data producer, and the query intention is difficult to determine in advance on the premise of not knowing the data, and complex interactive query is often needed to realize. To this end, it is necessary to design and implement interactive natural language query transformations.
Disclosure of Invention
The invention aims to solve the technical problems that: aiming at the query intention of the user natural language description, how to automatically convert the natural language query into SQL sentences, an interactive function and a result feedback mechanism are designed in the conversion process, the accuracy and the high efficiency of complex query conversion are improved, and the convenience of non-database professional users for accessing data is improved.
In order to solve the technical problems, the technical scheme of the invention is to provide an interactive natural language query conversion method, which is characterized by comprising the following steps:
step 1, semantic analysis;
splitting a natural language query sentence input by a user into individual words, and analyzing and processing the query sentence by a natural language analysis tool to obtain a semantic dependency tree capable of expressing complex semantic relationships, wherein leaf nodes are words in the sentence, and non-leaf nodes represent semantic dependency relationships among the nodes;
step 2, node mapping;
mapping the nodes in the semantic tree to the corresponding node types, and discarding the meaningless segmentation;
step 3, pattern matching;
establishing connection among nodes completing mapping according to a database structure stored in advance, and finding out connection relation among tables;
step 4, generating a template T of the basic query bq
Step 5, predefining function operation;
according to the syntax of SQL, defining 5 function operation types, including field selection operation, screening condition operation, grouping operation, screening operation after grouping and sorting operation;
step 6, inquiring and interacting;
the user inputs the added operation description in a natural language form, a mapping pair is obtained through a word segmentation and node mapping method, the system converts the mapping pair into a specific function operation according to a defined grammar rule, the incremental query is further obtained through conversion on the basis of the preposed query, and the final query is obtained through continuous iteration;
step 7, result interaction;
the system returns the added condition of each operation to the user, and the user judges the correctness of the natural language conversion to a certain extent by checking the converted SQL and the returned result of the SQL executed in the database, including whether the converted SQL is the correct SQL conforming to the syntax and whether the converted SQL meets the semantic requirement of the user; through interaction and feedback of the intermediate query result, the user can know the data returned by the current query, and perform the next adding operation according to the data, so as to assist the user to acquire information in a progressive information acquisition mode.
Preferably, in step 1, in order to ensure that the specific vocabulary in the database used can be correctly segmented, a corresponding auxiliary corpus is constructed according to the database, and the corpus contains the specific vocabulary which appears in the database but is segmented into more than one word by the segmentation tool.
Preferably, the step 2 specifically includes:
step 2.1: mapping the nodes in the semantic tree to the corresponding node types;
firstly, classifying node types ON a semantic tree, including a selection node SN, AN operation node ON, a logic node LN, a function node FN, a sequencing node ODN, a grouping node GN, a post-grouping screening node HN, a table name node TN, a field name node AN and a value node VN, so that the node types correspond to corresponding parts in SQL;
step 2.2: creating a value-attribute inverted index for the value node;
acquiring values of fields in a database, recording fields corresponding to each value to obtain an inverted list item set, sorting according to a dictionary sequence of the values to obtain a first-level inverted index, classifying the first-level inverted index, and storing the index by using a data structure of a B+ tree;
step 2.3, node type mapping;
for the word segmentation result, firstly judging whether the word is an enumeration type node, then judging whether the word is a database table name or attribute name node, and if not, finally judging whether the word is a value node; for the value node, searching a field corresponding to the obtained value according to a value-attribute mapping method based on the inverted index, and establishing mapping from the value to the attribute; and determining the node type of each word and the data table related to query through a node mapping stage.
Preferably, the step 3 specifically includes:
step 3.1: extracting and storing the structure of the database, defining a database schema graph DSG for representing the main key and the field owned by each table and the connection relation of the main external key between the tables;
DSG is expressed as:wherein (1)>The set representing the concept of the database consists of two parts, the table name node +.>And attribute node->Likewise, the->Representing relationships between concepts, including the dependency edge->And major foreign bond relationship edge->Two types of->By the name->Point to the field contained in the table +.>By fieldPoint to the name +.>For indicating that there is a primary foreign key relationship between the two tables, and the field is the foreign key of the pointed table;
step 3.2, obtaining the pattern diagram of the current query
Default each edge of the original DSGFor a set of node map pairs queried, when a field appears, setting a weight of 1 for the dependency edge pointing to the field; when two or more tables appear, setting the main external key relation side between the tables as 1; if the connection between the two tables needs the third table as the intermediary, the corresponding connection edge of the intermediary table is also set to be 1, and the pattern diagram corresponding to the current query is obtained according to the edge with the 1 weightAnd generates a base query based on this graph.
Preferably, in the step 4, the template T of the basic query bq See table 1;
table 1 basic query template
Table 1 shows the table i Representing table names and tables in a database i PKey represents a table i Is included in the connection field of (a).
Generated according to step 3Determining the table involved in the query after the From clause, while in the white clause, the query structure is ++>The main external key edge in the list obtains the Join relation between the lists, and a Basic query is generated.
Preferably, the step 5 specifically includes:
step 5.1, defining a selection field operation: select (f||func (F), pre_q);
based on the Pre-query pre_q, adding the selected field F or the aggregation operation result Func (F) of the field F, wherein the Func (F) operation is divided into the following steps according to the aggregation function in the SQL: SUM () summing function, COUNT () counting function, MAX () maximum function, MIN () minimum function, and AVG () averaging function; defining a grammar rule of a select clause (SeleClause) as shown in formula (4-1);
SeleClause=SN+AN|SN+AN+FN|SN+FN+AN (4-1)
step 5.2, defining screening condition operation: filter (FOP value F OP sub_Q, pre_Q);
the Where operation is divided into an explicit screening condition and a nested Sub-query condition, OP represents an operation symbol, and Sub_Q represents a nested Sub-query; defining a conditional clause (CondClause) grammar rule as shown in formula (4-2);
CondClause=AN+ON+VN|ON+VN|CondClause
+LN+CondClause|AN+ON+Sub Q (4-2)
if the Where clause does not exist in the Pre_Q, adding the Where clause and the newly obtained screening condition, if the Where clause exists in the Pre_Q, directly adding the newly obtained condition to the back of the original condition, and judging the connection of the screening condition in terms of ' and ' or ' through logic words;
step 5.3, defining grouping operation: group by (F, pre_Q);
on the basis of Pre_Q, adding grouping operation, generating a Group by clause, if the Group by clause does not exist in the Pre_Q, adding a Group by () clause, otherwise, directly adding a new field into the original Group by clause; defining a group operation (GroClause) grammar rule as shown in formula (4-3):
GroClause=AN+GN (4-3);
step 5.4, defining screening operation after grouping: having (Fun (F) OP value, pre_q);
selecting screening operation after adding the grouping on the basis of Pre_Q, and adding the obtained Having clause to the back of the Group by clause; considering the usual HavingClause with aggregation function, value is a specific number NUM, defining a grammar rule as shown in the screening clause (HavingClause) after grouping of formula (4-4):
HavingClause=HN+FN+AN+ON+NUM (4-4);
step 5.5, defining a sorting operation: order by (F||Func (F), pre_Q);
for the sorting operation in the SQL grammar, a sorting function is defined, when the interactive operation input by the user is that the selection result is sorted according to a certain field, the sorting function is triggered, and the sorting clause is at the end of the whole SQL sentence, a sorting operation (OrderClause) grammar rule shown in a formula (4-5) is defined:
OrderClause=ODN+AN+ACS|DESC (4-5)。
the invention provides a method for interactively converting natural language query into Structured Query Language (SQL). Aiming at the difficulties of non-professional user query and database use in data open sharing application facing big data, and the ambiguity and abstract of natural language in expression and the accuracy and deterministic semantic gap of structured query language, the invention adopts the method of analyzing the natural language query description input by a user, establishing the corresponding relation between the original meaning and the database table and fields, generating basic query, adding function operation on the basis to obtain the final query, and combines the traditional natural language query interface and interactive query, so that the ordinary user can query the database in the natural language description mode, and simultaneously better capture the user query intention. By defining the interaction function and the result feedback mechanism, the accuracy and the high efficiency of complex query conversion are improved.
Drawings
FIG. 1 is a database DSG diagram;
FIG. 2 is a pattern diagram corresponding to query Q
FIG. 3 is a template of a base query;
FIG. 4 is a diagram of an example progressive query;
fig. 5 is a schematic diagram of a progressive query result.
Detailed Description
The invention will be further illustrated with reference to specific examples.
The embodiment provides an interactive natural language query conversion method, which comprises the steps of firstly analyzing natural language query description input by a user, establishing a corresponding relation between a meaning and a database table and a field, and generating a basic query. Based on basic inquiry, user interaction is introduced again, the user inputs an interaction function, the system automatically generates new inquiry and returns an inquiry result to the user. The user may update the query based on the query results. The overall steps are as follows:
and step 1, semantic analysis. The natural language query sentence input by the user is segmented into individual words. After the query sentence is analyzed and processed by a natural language analysis tool, a semantic dependency tree capable of expressing complex semantic relations is obtained, wherein leaf nodes are words in the sentence, and non-leaf nodes represent semantic dependency relations among the nodes. In order to ensure that the specific vocabulary in the database can be correctly segmented, a corresponding auxiliary corpus is constructed according to the database, and the corpus contains the specific vocabulary which appears in the database but is segmented into more than one word by a word segmentation tool. For example, a plurality of professional medical terms exist in a medical data set, such as thyroidectomy, renal failure, glycerinum fructose needle and the like, attribute values of all fields in the adopted data set are analyzed in advance, fields of words with stronger professionals are extracted, a special medical dictionary auxiliary corpus is established, and word segmentation accuracy is improved. For query statement Q, a patient with low hemoglobin content among patients with hyperthyroidism, after the auxiliary corpus is established in advance, "thyroid", "functional" and "hyperthyroidism" are not divided into three words and can be recognized as one specialized vocabulary. The query example sentence Q can be obtained after the first step of semantic analysis: query/hyperthyroidism/patient/medium/hemoglobin/index/content/low/patient.
And 2, node mapping. Nodes in the semantic tree are mapped to corresponding node types while meaningless segmentations such as "medium" are discarded. The method comprises the following steps:
in step 2.1, in order to better understand the semantic tree from the database perspective, the nodes in the semantic tree are mapped to the corresponding node types, and the node types ON the semantic tree are firstly classified, including a selection node SN, AN operation node ON, a logic node LN, a function node FN, a sorting node ODN, a grouping node GN, a post-grouping screening node HN, a table name node TN, a field name node AN and a value node VN, so that the nodes can correspond to the corresponding parts in the SQL.
And 2.2, creating a value-attribute inverted index for the value node. The method comprises the steps of obtaining values of fields in a database, recording the fields corresponding to each value to obtain an inverted list item set, sorting according to the dictionary sequence of the values to obtain a first-level inverted index, classifying the first-level inverted index, and storing the index by using a data structure of a B+ tree. Such as: { hyperthyroidism- > (diagnosis. ICD name, pathology report. Main diagnosis. Pathology report. Outpatient diagnosis), white blood cell- > -test index. Index name, 3.8778- > -test index. Test result }.
And 2.3, mapping the node types. For the word segmentation result, firstly judging whether the word is an enumeration type node, then judging whether the word is a database table name or attribute name node, if not, finally judging whether the word is a value node, and for the value type node, searching a field corresponding to the obtained value according to a value-attribute mapping method based on an inverted index, and establishing a mapping from the value to the attribute. And determining the node type of each word and the data table related to query through a node mapping stage. For example, for the word segmentation result of query Q in step 1, a mapping pair may be obtained: (1) query->Select, (2) hyperthyroidism->ICD name, (3) hemoglobin->Index name (4) low->Abnormal prompt for patient->Patient basic information table. Wherein (1) belongs to enumeration key mapping Value SN (2) (3) (4) Value node mapping Value vN (5) node mapping Value belonging to table name TN Meanwhile, a basic information table, a diagnosis table and a check index table related to the patient table are determined and inquired according to the mapping result.
And 3, pattern matching. And establishing the relation among the nodes which complete the mapping according to the database structure stored in advance, and finding the Join connection relation among the tables. The method comprises the following steps:
and 3.1, extracting and storing the structure of the database, and defining a database schema graph DSG (Database Schema Graph) for representing the main keys and fields owned by each table and the connection relation of the main external keys between the tables. D (D)SG can be expressed as:wherein (1)>The set representing the concept of the database consists of two parts, the table name node +.>And attribute node->Likewise, the->Representing relationships between concepts, including the dependency edge->And major foreign bond relationship edge->Two types of->By the name->Point to the field contained in the table +.>By field->Pointing to a table nameFor indicating that there is a primary foreign key relationship between the two tables and that this field is the foreign key of the table pointed to. As in figure 1 is a simple database DSG (to make the figure simplerClearly, the fields contained in each table are not all shown).
Step 3.2, obtaining the pattern diagram of the current queryDefault that the weight of each side of the original DSG is 0, for a node mapping pair set of a query, when a field appears, the weight is set to be 1 for the subordinate relation side pointing to the field, when two or more tables appear, the weight is set to be 1 for the main external key relation side between the tables, if the connection between the two tables needs a third table as an intermediary, the corresponding connection side of the intermediary table is also set to be 1, and a pattern diagram corresponding to the current query can be obtained according to the side with the weight of 1>And generates a base query based on this graph. For example, query Q in step 1 corresponds to pattern +.>As shown in fig. 2.
And 4, generating a basic query. Template T of basic query bq See fig. 3. Generated according to step 3The table involved in the query after the From clause can be determined, while in the white clause, the query structure is ++>The primary and outer key edges in the table can obtain Join relations among the tables, and Basic query is generated.
For example, query statement Q is generated in step 3 from the base query templateThe graph may get its corresponding base query:
Select*
from patient table basic information, visit, examination report, test index
The Where patient table basic information, medical card number = visit, medical card number
And, treating the patient, and reporting the patient
And inspection report number = test index report number
And 5, predefining function operation. According to the syntax of SQL, 5 kinds of function operation types are defined, including a field selection operation, a screening condition operation, a grouping operation, a screening operation after grouping, and a sorting operation. The method comprises the following steps:
step 5.1, defining a selection field operation: select (F||Func (F), pre_Q)
On the basis of the Pre-query pre_q (i.e., an add operation on the basis of this query), a selected field F or an aggregate operation result Func (F) on the field F is added, which adds the selected field to the "Select" word, defining a grammar rule as shown in formula 4-1.
SeleClause=SN+AN|SN+AN+FN|SN+FN+AN (4-1)
Func (F) operations are divided into according to the aggregation function in SQL: SUM () summing function, COUNT () counting function, MAX () maximum function, MIN () minimum function, and AVG () averaging function. For example, the user inputs "Select age of patient", and based on the word segmentation and node mapping results, a "Select patient basic information, age" clause will be obtained.
Step 5.2, defining screening condition operation: filter (FOP value F OP Sub Q, pre Q)
The Where operation is divided into an explicit screening condition and a nested Sub-query condition, OP represents an operation symbol, and sub_Q represents a nested Sub-query. Grammar rules are defined as shown in formula 4-2.
CondClause=AN+ON+VN|ON+VN|CondClause
+LN+CondClause|AN+ON+Sub Q (4-2)
If the Where clause does not exist in the Pre_Q, the Where clause and the newly obtained screening condition are added, if the Where clause exists in the Pre_Q, the newly obtained condition is directly added to the back of the original condition, and the screening condition is judged to be connected with ' and ' or ' through a logic word. For example, a user may input "patient with operation name of double-first full-cut" and may get "white operation.
Step 5.3, defining grouping operation: group by (F, pre_Q)
On the basis of Pre_Q, adding grouping operation, generating a Group by clause, if the Group by clause does not exist in the Pre_Q, adding the Group by () clause, otherwise, directly adding a new field into the original Group by clause. Grammar rules are defined as shown in formulas 4-3.
GroClause=AN+GN (4-3)
For example, a user input "Group departments" will get a "Group by" clause.
Step 5.4, defining screening operation after grouping: having (Fun (F) OP value, pre_Q)
And selecting screening operation after adding the packet on the basis of Pre_Q, and adding the obtained suspension clause to the back of the Group by clause. Considering the usual Having clause with an aggregation function, value is a specific number NUM, defining the grammar rules as shown in formulas 4-4.
HavingClause=HN+FN+AN+ON+NUM (4-4)
For example, the user inputs "patients with more than 3 medications", a "Having count (medication name > 3)" may be obtained and added to the Group by clause.
Step 5.5, defining a sorting operation: orderby (F|Func (F), pre_Q)
For the sorting operation in the SQL grammar, a sorting function is defined, when the interactive operation input by the user is that the selection result is sorted according to a certain field, the sorting function is triggered, and the sorting clause is at the end of the whole SQL sentence, and the grammar rule shown in the formulas 4-5 is defined.
OrderClause=ODN+AN+ACS|DESC (4-5)
For example, the user inputs "Order descending Order of patient's age", will get the "Order by patient basic information.
And 6, inquiring and interacting. The user inputs the added operation description in a natural language form, the mapping pair is obtained through a word segmentation and node mapping method, the system converts the mapping pair into specific function operation according to defined grammar rules, the incremental query is further obtained through conversion on the basis of the preposed query, and the final query is obtained through continuous iteration. For example, for the query sentence "query name of patient with low hemoglobin content in hyperthyroidism patient", the basic query pre_q may be obtained through step 1, step 2 and step 3:
“Select*
from patient table basic information, visit, examination report, test index
The Where patient table basic information, medical card number = visit, medical card number
And, treating the patient, and reporting the patient
And inspection report number = test index report number'
The user inputs the added operation description through interaction, such as ' selecting the name of a patient ', ' screening ICD name is hyperthyroidism ', ' index name is hemoglobin ', ' abnormal prompt is low ', the corresponding selection condition ' Select patient table basic information, name ', screening condition ' visit ', ' ICD name= ' hyperthyroidism ', ' test index ', ' index name= ' hemoglobin ', ' test index ', ' abnormal prompt= ' abnormal prompt is low ', and the operation description is sequentially added into the pre-query according to the grammar rule of the operation, so that the final query can be obtained:
"Select patient table basic information. Name
From patient table basic information, visit, examination report, test index
The Where patient table basic information, medical card number = visit, medical card number
And, treating the patient, and reporting the patient
And inspection report number = test index report number
And visit ICD name = 'hyperthyroidism'
And test index name = 'hemoglobin'
And test index anomaly prompt = 'low'
And 7, result interaction. The system can return the added condition of each operation to the user, and the user can judge the correctness of the natural language conversion to a certain extent by checking the converted SQL and the returned result of the SQL executed in the database, including whether the converted SQL is the correct SQL conforming to the syntax and whether the converted SQL meets the semantic requirement of the user. Through interaction and feedback of the intermediate query result, the user can know the data returned by the current query, and perform the next adding operation according to the data, so as to assist the user to acquire information in a progressive information acquisition mode.
For example, in the progressive query example in fig. 4, after obtaining the SQL statement of the query Q1 "the query is performed for hyperthyroidism patient check record more than 2 times," the user may execute and return the result in the database, and through simple analysis on the query result, the user may continue to add operations on the query, and Q2 and Q3 in fig. 3 are respectively added with "check index" index name = 'TRAB' "and" check index "on the basis of Q1. And FIG. 5 shows the results of partial queries performed by Q1, Q2, Q3 in the system, to better assist the user in obtaining the final desired query result in a progressive manner through interaction of the results.
While the invention has been described with respect to preferred embodiments thereof, it will be understood by those skilled in the art that various modifications and additions may be made without departing from the scope of the invention. Equivalent embodiments of the present invention will be apparent to those skilled in the art having the benefit of the teachings disclosed herein, when considered in the light of the foregoing disclosure, and without departing from the spirit and scope of the invention; meanwhile, any equivalent changes, modifications and evolution of the above embodiments according to the essential technology of the present invention still fall within the scope of the technical solution of the present invention.

Claims (5)

1. An interactive natural language query conversion method is characterized by comprising the following steps:
step 1, semantic analysis;
dividing a natural language query sentence input by a user into individual words, wherein the words comprise words in the sentence and meaningless words, analyzing and processing the query sentence by a natural language analysis tool to obtain a semantic dependency tree capable of expressing complex semantic relations, wherein leaf nodes are words in the sentence, and non-leaf nodes represent semantic dependency relations among the nodes;
step 2, node mapping;
mapping the nodes in the semantic tree to the corresponding node types, and discarding the meaningless segmentation;
step 3, pattern matching;
establishing connection among nodes completing mapping according to a database structure stored in advance, and finding out connection relation among tables;
step 4, generating a template T of the basic query bq
Step 5, predefining function operation;
according to the syntax of SQL, defining 5 function operation types, including field selection operation, screening condition operation, grouping operation, screening operation after grouping and sorting operation;
step 6, inquiring and interacting;
the user inputs the added operation description in a natural language form, a mapping pair is obtained through a word segmentation and node mapping method, the system converts the mapping pair into a specific function operation according to a defined grammar rule, the incremental query is further obtained through conversion on the basis of the preposed query, and the final query is obtained through continuous iteration;
step 7, result interaction;
the system returns the added condition of each operation to the user, and the user judges the correctness of the natural language conversion by checking the converted SQL and the returned result of the SQL executed in the database, including whether the converted SQL is the correct SQL conforming to the syntax and whether the converted SQL meets the semantic requirement of the user; through interaction and feedback of the intermediate query result, the user can know the data returned by the current query and perform the next adding operation according to the data, and the user is assisted to acquire information in a progressive information acquisition mode;
the step 2 specifically includes:
step 2.1: mapping the nodes in the semantic tree to the corresponding node types;
firstly, classifying node types ON a semantic tree, including a selection node SN, AN operation node ON, a logic node LN, a function node FN, a sequencing node ODN, a grouping node GN, a post-grouping screening node HN, a table name node TN, a field name node AN and a value node VN, so that the node types correspond to corresponding parts in SQL;
step 2.2: creating a value-attribute inverted index for the value node;
acquiring values of fields in a database, recording fields corresponding to each value to obtain an inverted list item set, sorting according to a dictionary sequence of the values to obtain a first-level inverted index, classifying the first-level inverted index, and storing the index by using a data structure of a B+ tree;
step 2.3, node type mapping;
for the word segmentation result, firstly judging whether the word is an enumeration type node, then judging whether the word is a database table name or attribute name node, and if not, finally judging whether the word is a value node; for the value node, searching a field corresponding to the obtained value according to a value-attribute mapping method based on the inverted index, and establishing mapping from the value to the attribute; and determining the node type of each word and the data table related to query through a node mapping stage.
2. The interactive natural language query conversion method of claim 1, wherein: in step 1, in order to ensure that the specific vocabulary in the database can be correctly segmented, a corresponding auxiliary corpus is constructed according to the database, and the corpus contains the specific vocabulary which appears in the database but is segmented into more than one word by a word segmentation tool.
3. The interactive natural language query conversion method of claim 1, wherein: the step 3 specifically includes:
step 3.1: extracting and storing the structure of the database, defining a database schema graph DSG for representing the main key and the field owned by each table and the connection relation of the main external key between the tables;
DSG is expressed as:wherein (1)>The set representing the concept of the database consists of two parts, the table name node +.>And attribute node->Likewise, the->Representing relationships between concepts, including dependency edgesAnd major foreign bond relationship edge->Two types of->By the name->Point to the field contained in the table +.> By field->Point to the name +.>For indicating that there is a primary foreign key relationship between the two tables, and the field is the foreign key of the pointed table;
step 3.2, obtaining the pattern diagram of the current query
Default the weight of each edge of the original DSG is 0, for a queried node mapping pair set, when a field appears, set the weight for the subordinate relation edge pointing to the field as 1; when two or more tables appear, setting the main external key relation side between the tables as 1; if the connection between the two tables needs the third table as the intermediary, the corresponding connection edge of the intermediary table is also set to be 1, and the pattern diagram corresponding to the current query is obtained according to the edge with the 1 weightAnd generates a base query based on this graph.
4. The interactive natural language query conversion method of claim 1, wherein: in the step 4, the template T of the basic query bq The following are provided:
Select*
From table i ,…,table k the// query involving a data table
Wheretable i .PKey=table j .PKey
And…
And table j .PKey=table k PKey// selection table Join connection conditions
Generated according to step 3After determining the From clause, the table involved is queried, while in the while clause, the query pattern is +.>The main external key edge in the list obtains the Join relation between the lists, and a Basic query is generated.
5. The interactive natural language query conversion method of claim 1, wherein: the step 5 specifically includes:
step 5.1, defining a selection field operation: select (f||func (F), pre_q);
based on the Pre-query pre_q, adding the selected field F or the aggregation operation result Func (F) of the field F, wherein the Func (F) operation is divided into the following steps according to the aggregation function in the SQL: SUM () summing function, COUNT () counting function, MAX () maximum function, MIN () minimum function, and AVG () averaging function; defining a grammar rule as shown in formula (4-1);
SeleClause=SN+AN|SN+AN+FN|SN+FN+AN (4-1)
step 5.2, defining screening condition operation: filter (FOP value F OP sub_Q, pre_Q);
the Where operation is divided into an explicit screening condition and a nested Sub-query condition, OP represents an operation symbol, and Sub_Q represents a nested Sub-query; defining a grammar rule as shown in formula (4-2);
CondClause=AN+ON+VN|ON+VN|CondClause+LN+CondClause|AN+ON+Sub_Q (4-2)
if the Where clause does not exist in the Pre_Q, adding the Where clause and the newly obtained screening condition, if the Where clause exists in the Pre_Q, directly adding the newly obtained condition to the back of the original condition, and judging the connection of the screening condition in terms of ' and ' or ' through logic words;
step 5.3, defining grouping operation: group by (F, pre_Q);
on the basis of Pre_Q, adding grouping operation, generating a Group by clause, if the Group by clause does not exist in the Pre_Q, adding a Group by () clause, otherwise, directly adding a new field into the original Group by clause; defining a grammar rule as shown in formula (4-3):
GroClause=AN+GN (4-3);
step 5.4, defining screening operation after grouping: having (Fun (F) OP value, pre_q);
selecting screening operation after adding the grouping on the basis of Pre_Q, and adding the obtained Having clause to the back of the Group by clause; considering the usual Having clause with an aggregation function, value is a specific number NUM, defining a grammar rule as shown in formula (4-4):
HavingClause=HN+FN+AN+ON+NUM (4-4);
step 5.5, defining a sorting operation: order by (F||Func (F), pre_Q);
for the sorting operation in the SQL grammar, a sorting function is defined, when the interactive operation input by the user is that the selection result is sorted according to a certain field, the sorting function is triggered, and the sorting clause is at the end of the whole SQL sentence, and the grammar rule shown in the formula (4-5) is defined:
OrderClause=ODN+AN+ACS|DESC (4-5)。
CN201910129037.7A 2019-02-21 2019-02-21 Interactive natural language query conversion method Active CN109947794B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910129037.7A CN109947794B (en) 2019-02-21 2019-02-21 Interactive natural language query conversion method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910129037.7A CN109947794B (en) 2019-02-21 2019-02-21 Interactive natural language query conversion method

Publications (2)

Publication Number Publication Date
CN109947794A CN109947794A (en) 2019-06-28
CN109947794B true CN109947794B (en) 2023-09-01

Family

ID=67006952

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910129037.7A Active CN109947794B (en) 2019-02-21 2019-02-21 Interactive natural language query conversion method

Country Status (1)

Country Link
CN (1) CN109947794B (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112035506A (en) * 2019-10-28 2020-12-04 竹间智能科技(上海)有限公司 Semantic recognition method and equipment
CN111008309B (en) * 2019-12-06 2023-08-08 北京百度网讯科技有限公司 Query method and device
CN111190920B (en) * 2019-12-30 2023-09-15 南京诚勤教育科技有限公司 Data interaction query method and system based on natural language
CN111324631B (en) * 2020-03-19 2022-04-22 成都海天数联科技有限公司 Method for automatically generating sql statement by human natural language of query data
CN111414380B (en) * 2020-03-20 2023-06-13 华泰证券股份有限公司 Method, equipment and storage medium for generating SQL (structured query language) sentences of Chinese database
CN111831626A (en) * 2020-07-16 2020-10-27 平安科技(深圳)有限公司 Graph structure generation method of database logical relation, data query method and device
CN112507098B (en) * 2020-12-18 2022-01-28 北京百度网讯科技有限公司 Question processing method, question processing device, electronic equipment, storage medium and program product
CN114138817A (en) * 2021-12-03 2022-03-04 中国建设银行股份有限公司 Data query method, device, medium and product based on relational database
CN114090627B (en) * 2022-01-19 2022-05-31 支付宝(杭州)信息技术有限公司 Data query method and device
CN115329753B (en) * 2022-10-13 2023-03-24 北京谊慧信息技术有限公司 Intelligent data analysis method and system based on natural language processing
CN116992888A (en) * 2023-09-25 2023-11-03 天津华来科技股份有限公司 Data analysis method and system based on natural semantics

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106293725A (en) * 2016-08-04 2017-01-04 深圳市微我科技有限公司 A kind of natural language hybrid programming method based on rear realization
CN107885786A (en) * 2017-10-17 2018-04-06 东华大学 Towards the Natural Language Query Interface implementation method of big data
CN107943481A (en) * 2017-05-23 2018-04-20 清华大学 C programmer code specification building method based on multi-model
CN109241076A (en) * 2018-08-01 2019-01-18 上海依图网络科技有限公司 A kind of data query method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106293725A (en) * 2016-08-04 2017-01-04 深圳市微我科技有限公司 A kind of natural language hybrid programming method based on rear realization
CN107943481A (en) * 2017-05-23 2018-04-20 清华大学 C programmer code specification building method based on multi-model
CN107885786A (en) * 2017-10-17 2018-04-06 东华大学 Towards the Natural Language Query Interface implementation method of big data
CN109241076A (en) * 2018-08-01 2019-01-18 上海依图网络科技有限公司 A kind of data query method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
受限领域自然语言数据库查询接口研究;余正涛 等;《昆明理工大学学报(理工版)》;20040831;第29卷(第04期);第133-138页 *

Also Published As

Publication number Publication date
CN109947794A (en) 2019-06-28

Similar Documents

Publication Publication Date Title
CN109947794B (en) Interactive natural language query conversion method
US5870739A (en) Hybrid query apparatus and method
US5873079A (en) Filtered index apparatus and method
US20050154708A1 (en) Information exchange between heterogeneous databases through automated identification of concept equivalence
US11816156B2 (en) Ontology index for content mapping
Lbath et al. Schema inference for property graphs
CN105723366A (en) Method for preparing a system for searching databases and system and method for executing queries to a connected data source
CN110674229A (en) AST-based relational database SQL table relational analysis and display method
CN109857736A (en) The data encoding of hospital's heterogeneous system unitized method and system, equipment, medium
CN111243748A (en) Needle pushing health data standardization system
Ruan et al. QAnalysis: a question-answer driven analytic tool on knowledge graphs for leveraging electronic medical records for clinical research
Sun Methods for automated concept mapping between medical databases
Thamer et al. A Semantic Approach for Extracting Medical Association Rules.
CN117290376A (en) Two-stage Text2SQL model, method and system based on large language model
Sander et al. Integrating terminologies into standard SQL: a new approach for research on routine data
Ren et al. Application of ontology in medical heterogeneous data integration
KR101162468B1 (en) Automatic data store architecture detection
Saripalle Current status of ontologies in Biomedical and Clinical informatics
Safari et al. An enhancement on Clinical Data Analytics Language (CliniDAL) by integration of free text concept search
CN111460173A (en) Method for constructing disease ontology model of thyroid cancer
CN110020428B (en) Method for jointly identifying and normalizing Chinese medicine symptom names based on semi-Markov
El-Sappagh et al. A proposed SNOMED CT ontology-based encoding methodology for diabetes diagnosis case-base
Kirsten et al. Metadata management for data integration in medical sciences
Veerappa et al. Syntax and Table Aware Parsing Based Naturalized Structured Query Language.
CN110188169A (en) A kind of knowledge matching process, system and equipment based on simplified label

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant