CN111324631A - Method for automatically generating sql statement by human natural language of query data - Google Patents
Method for automatically generating sql statement by human natural language of query data Download PDFInfo
- Publication number
- CN111324631A CN111324631A CN202010196066.8A CN202010196066A CN111324631A CN 111324631 A CN111324631 A CN 111324631A CN 202010196066 A CN202010196066 A CN 202010196066A CN 111324631 A CN111324631 A CN 111324631A
- Authority
- CN
- China
- Prior art keywords
- query
- natural language
- condition
- analysis
- field
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2452—Query translation
- G06F16/24522—Translation of natural language queries to structured queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2468—Fuzzy queries
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- General Physics & Mathematics (AREA)
- Fuzzy Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Automation & Control Theory (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a method for converting a human natural language of query data into an sql statement. A user inputs a natural sentence, performs word segmentation and stop word removal on the sentence, performs automatic table name analysis, performs further processing on the natural language with the word segmented and the stop word removed and the obtained table name if the table name can be obtained, and feeds back prompt information after fuzzy matching if the natural language is not matched with the stop word; after the table name is obtained, judging whether multi-table association exists or not according to metadata information, if so, carrying out query condition processing according to an association rule, otherwise, directly carrying out single-table query condition processing; in the query condition processing, time analysis processing is respectively carried out on the natural language, and query field analysis, judgment condition analysis, cascade query analysis and query aggregation analysis processing are carried out according to metadata information; finally, query sql is generated. According to the invention, personnel are not required to label data, the investment of human resources is reduced, the expenditure of labor cost is reduced, and the problem of system non-reusability caused by machine learning is solved.
Description
Technical Field
The invention relates to the field of natural language processing, in particular to a method for automatically generating sql sentences from human natural language of query data.
Background
With the development of national economy, more and more systems are put into use, and a large amount of data is generated. When a non-professional user wants to browse data, the data is usually displayed in a chart form in advance in a traditional mode, and the data cannot be freely inquired and browsed. At present, in a semantic to sql conversion mode on the market, a large amount of labeling information is collected through machine learning, and sql sentences are generated through long-time learning. In the middle process, a large amount of data needs to be labeled by personnel, and when new data or a new system is accessed, labeling and long-time machine learning are usually carried out again, so that the problems of uncontrollable training time, irreversibility of the system and the like exist.
Disclosure of Invention
The invention aims to provide a method for automatically generating sql sentences from human natural language of query data, aiming at the problems, and the method does not need personnel to label data and solves the problem of system non-reusability caused by machine learning.
A method of automatically generating sql statements from human natural language for querying data, comprising the steps of: s1: a user inputs a natural sentence of query data and performs word segmentation on the natural sentence;
s2: removing stop words from the natural sentences after word segmentation, and identifying the marked nouns and negative words;
s3: regularly splicing nouns, matching the nouns with the constructed table name metadata information to obtain corresponding table names, and if the nouns can be obtained, entering a step S4, and if the nouns cannot be obtained, entering a step S5;
s4: performing SQL generation processing on the natural language with the words being segmented and the stop words being removed and the acquired table name;
s41: judging whether the obtained table has multi-table correlation according to the metadata information of the table, if so, performing condition analysis on the natural language with words segmented and stop words removed according to a correlation nesting rule, otherwise, directly performing single-table query condition analysis;
s42: analyzing time data of the natural language;
s43: analyzing the aggregation query condition of the natural language subjected to time processing;
s44: analyzing the nouns labeled in the S2 according to the metadata information, wherein the nouns are analyzed by the query field and the query condition field; analyzing the query judgment condition according to the negative word marked in the S2;
s45: analyzing the query aggregation information;
s46: assembling and generating an sql statement according to the table name, the field name, the query condition field, the query judgment condition and the aggregation information;
s5: carrying out fuzzy query on natural language input by a user;
s51: and judging whether the matched table name exists or not, and if the matched table name does not exist, entering a feedback question-asking process.
Further, stop words do not include negative words.
Further, the condition analysis specifically includes: the recognized nouns are matched through metadata, and if the field names of the query conditions are found and the query values are found at the same time, one query condition is considered to be analyzed and completed; if the matching has no corresponding value of the field, considering whether the field is numerical type, matching with negative word analysis and numerical type query judgment conditions of less than or equal to the character analysis, and considering that one query condition analysis is completed; if only the query value is matched without the query field, the field corresponding to the query value in the metadata is automatically used as the query field, and one query condition is considered to be analyzed and completed; if a plurality of parallel query judgment values are analyzed, the query judgment values correspond to a column name and are analyzed into in operation or not in operation in the SQL language in cooperation with negative words.
Further, the time data analysis specifically includes: and analyzing and finding out data information which represents time in the segmented natural language after condition processing.
Further, the feedback question-asking process includes the following steps: and judging whether the similar table names exist or not, if so, giving a table name prompt, otherwise, obtaining a query record with higher query frequency of the user according to the query record of the user, prompting the user to input, and then entering the step s 1.
Further, the feedback question asking process further includes a special vocabulary judging step, if the special vocabulary exists, the input of the corresponding vocabulary is prompted, and after the semantic sentence is replaced, the step is carried out, and then the step is carried out in step S4.
The invention has the beneficial effects that: personnel are not needed to label data, investment of human resources is reduced, labor cost is lowered, and the problem that a system cannot be reused due to machine learning is solved.
Drawings
FIG. 1 is a block flow diagram of the present invention.
Detailed Description
In order to more clearly understand the technical features, objects, and effects of the present invention, embodiments of the present invention will now be described with reference to the accompanying drawings.
A method of automatically generating sql statements from human natural language for querying data, the method comprising the steps of:
s1: a user inputs a natural sentence of query data and performs word segmentation on the natural sentence;
s2: removing stop words from the natural sentences after word segmentation, and identifying the marked nouns and negative words;
s3: regularly splicing nouns, matching the nouns with the constructed table name metadata information to obtain corresponding table names, and if the nouns can be obtained, entering a step S4, and if the nouns cannot be obtained, entering a step S5;
s4: performing SQL generation processing on the natural language with the words being segmented and the stop words being removed and the acquired table name;
s41: judging whether the obtained table has multi-table correlation according to the metadata information of the table, if so, performing condition analysis on the natural language with words segmented and stop words removed according to a correlation nesting rule, otherwise, directly performing single-table query condition analysis;
wherein the associated nesting specifically comprises: nested queries are constructed based on the metadata relationships of the tables. For example, if the achievements are address value fields in the address table, but the sales table corresponds to an ID field of the address table, then the mapping can be done. Table 1 shows a metadata specific structure of the table, and table 2 shows a metadata specific structure of the field.
Table_name | Table name, str |
Mapping_name | Mapping name, str |
Maping_value | Mapping value, str |
Table 1 table metadata structure
TABLE 2 field metadata Structure
If the query path graph is a multi-table, judging whether to construct a Cartesian product according to the metadata or construct multi-stage query according to the query path graph.
Cartesian product: if there is no aggregation, the multi-tiered query is converted to a multi-table Cartesian query. And if the aggregation exists, outputting the relation between the columns and the corresponding columns of the outer layer table after the aggregation according to the query path graph. And resolving the multi-layer query according to the feasible query path.
S42: and performing time processing on the natural language after condition processing: the method comprises the steps of firstly constructing a time trigger word dictionary in a certain mode, for example, summarizing the trigger word dictionary from a training corpus or human experience, then constructing a time affix word dictionary according to modifiers around the trigger words, prefixes, suffixes and other components, and finally identifying a time expression through the combination of the trigger words and the affix words. Taking the explanation of "2019, 3, and 4" as an example, wherein "year", "month", and "day" are trigger words, then the time description can be interpreted as a standardized time format "2019/03/04" according to the components of the trigger word prefixes "2019", "3", and "4". Through the time analysis process, all the time descriptions in the query statement can be unified into a standard format to be constructed by a subsequent data model.
S43: aggregating the natural language after the time processing;
polymerization and analysis: judging the aggregation condition, judging whether to need to carry out aggregation analysis according to the regular expression at present, and finding out the corresponding aggregation condition according to the dependency relationship if the natural language input by the user contains aggregation vocabularies such as maximum, minimum, total, sum, average and the like. If "maximum tobacco sales amount" is input, "the" maximum "corresponding aggregation vocabulary is obtained as" max, "the" maximum "modified" sales amount "is known according to the dependency relationship, and the composition aggregation information" max (sales amount) "can be obtained to be spliced by subsequent sql.
S44: assembling and generating an inquiry sql statement according to the table name, the field name, the inquiry condition field, the inquiry judgment condition and the aggregation information;
s5: carrying out fuzzy query on natural language input by a user;
s51: and judging whether the matched table name exists or not, and if the matched table name does not exist, entering a feedback question-asking process.
Stop words include stop, yes, o, query, look, know, etc. stop words but not stop, not, etc. stop words.
The feedback questioning process comprises the following steps: judging whether a similar table name exists or not, if so, giving a table name prompt; when there is no similar table name, the query record with higher query frequency of the user is obtained according to the query record of the user, the user is prompted to input, and then the step s1 is entered.
The feedback question asking process further comprises a special vocabulary judging step, if the special vocabulary exists, the input of the corresponding vocabulary is prompted, and after the semantic sentence is replaced, the step S4 is carried out.
The method for automatically generating the sql sentences by the human natural language of the query data does not need personnel to label the data, reduces the investment of human resources, reduces the expenditure of labor cost, and solves the problem of system reusability caused by machine learning.
The foregoing shows and describes the general principles and broad features of the present invention and advantages thereof. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.
Claims (6)
1. A method for automatically generating sql statements from human natural language for querying data, comprising the steps of:
s1: a user inputs a natural sentence of query data and performs word segmentation on the natural sentence;
s2: removing stop words from the natural sentences after word segmentation, and identifying the marked nouns and negative words;
s3: regularly splicing nouns, matching the nouns with the constructed table name metadata information to obtain corresponding table names, and if the nouns can be obtained, entering a step S4, and if the nouns cannot be obtained, entering a step S5;
s4: performing SQL generation processing on the natural language with the words being segmented and the stop words being removed and the acquired table name;
s41: judging whether the obtained table has multi-table correlation according to the metadata information of the table, if so, performing condition analysis on the natural language with words segmented and stop words removed according to a correlation nesting rule, otherwise, directly performing single-table query condition analysis;
s42: analyzing time data of the natural language;
s43: analyzing the aggregation query condition of the natural language subjected to time processing;
s44: analyzing the nouns labeled in the S2 according to the metadata information, wherein the nouns are analyzed by the query field and the query condition field; analyzing the query judgment condition according to the negative word marked in the S2;
s45: analyzing the query aggregation information;
s46: assembling and generating an sql statement according to the table name, the field name, the query condition field, the query judgment condition and the aggregation information;
s5: carrying out fuzzy query on natural language input by a user;
s51: and judging whether the matched table name exists or not, and if the matched table name does not exist, entering a feedback question-asking process.
2. The method of claim 1, wherein the stop word does not include a negative word.
3. The method of claim 1, wherein the conditional parsing specifically comprises: the recognized nouns are matched through metadata, and if the field names of the query conditions are found and the query values are found at the same time, one query condition is considered to be analyzed and completed; if the matching has no corresponding value of the field, considering whether the field is numerical type, matching with negative word analysis and numerical type query judgment conditions of less than or equal to the character analysis, and considering that one query condition analysis is completed; if only the query value is matched without the query field, the field corresponding to the query value in the metadata is automatically used as the query field, and one query condition is considered to be analyzed and completed; if a plurality of parallel query judgment values are analyzed, the query judgment values correspond to a column name and are analyzed into in operation or not in operation in the SQL language in cooperation with negative words.
4. The method of claim 1, wherein the temporal data parsing is specifically: and analyzing and finding out data information which represents time in the segmented natural language after condition processing.
5. The method of claim 1, wherein the feedback question process comprises the following steps: and judging whether the similar table names exist or not, if so, giving a table name prompt, otherwise, obtaining a query record with higher query frequency of the user according to the query record of the user, prompting the user to input, and then entering the step s 1.
6. The method as claimed in claim 5, wherein the feedback question-asking process further includes a step of determining a special vocabulary, and if the special vocabulary exists, the method prompts to input the corresponding vocabulary, and enters step S4 after replacing the semantic sentence.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010196066.8A CN111324631B (en) | 2020-03-19 | 2020-03-19 | Method for automatically generating sql statement by human natural language of query data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010196066.8A CN111324631B (en) | 2020-03-19 | 2020-03-19 | Method for automatically generating sql statement by human natural language of query data |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111324631A true CN111324631A (en) | 2020-06-23 |
CN111324631B CN111324631B (en) | 2022-04-22 |
Family
ID=71171640
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010196066.8A Active CN111324631B (en) | 2020-03-19 | 2020-03-19 | Method for automatically generating sql statement by human natural language of query data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111324631B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112182022A (en) * | 2020-11-04 | 2021-01-05 | 北京安博通科技股份有限公司 | Data query method and device based on natural language and translation model |
CN112347121A (en) * | 2020-11-02 | 2021-02-09 | 中科曙光南京研究院有限公司 | Configurable method and system for converting natural language into sql |
CN112463819A (en) * | 2020-11-26 | 2021-03-09 | 北京宏景世纪软件股份有限公司 | Computing method, device and equipment based on Chinese expression and storage medium |
CN114625748A (en) * | 2021-04-23 | 2022-06-14 | 亚信科技(南京)有限公司 | SQL query statement generation method and device, electronic equipment and readable storage medium |
CN116821168A (en) * | 2023-08-24 | 2023-09-29 | 吉奥时空信息技术股份有限公司 | Improved NL2SQL method based on large language model |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102737049A (en) * | 2011-04-11 | 2012-10-17 | 腾讯科技(深圳)有限公司 | Method and system for database query |
US20130297654A1 (en) * | 2012-05-03 | 2013-11-07 | Salesforce.Com, Inc. | Method and system for generating database access objects |
CN104123346A (en) * | 2014-07-02 | 2014-10-29 | 广东电网公司信息中心 | Structural data searching method |
CN107180021A (en) * | 2016-03-09 | 2017-09-19 | 北京京东尚科信息技术有限公司 | A kind of data processing method, system and its server |
CN107451153A (en) * | 2016-05-31 | 2017-12-08 | 北京京东尚科信息技术有限公司 | The method and apparatus of export structure query statement |
CN108108426A (en) * | 2017-12-15 | 2018-06-01 | 杭州网蛙科技有限公司 | Understanding method, device and the electronic equipment that natural language is putd question to |
JP2018163490A (en) * | 2017-03-24 | 2018-10-18 | 三菱電機インフォメーションネットワーク株式会社 | Access control device and access control program |
CN109947794A (en) * | 2019-02-21 | 2019-06-28 | 东华大学 | A kind of interactive natural language inquiry conversion method |
-
2020
- 2020-03-19 CN CN202010196066.8A patent/CN111324631B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102737049A (en) * | 2011-04-11 | 2012-10-17 | 腾讯科技(深圳)有限公司 | Method and system for database query |
US20130297654A1 (en) * | 2012-05-03 | 2013-11-07 | Salesforce.Com, Inc. | Method and system for generating database access objects |
CN104123346A (en) * | 2014-07-02 | 2014-10-29 | 广东电网公司信息中心 | Structural data searching method |
CN107180021A (en) * | 2016-03-09 | 2017-09-19 | 北京京东尚科信息技术有限公司 | A kind of data processing method, system and its server |
CN107451153A (en) * | 2016-05-31 | 2017-12-08 | 北京京东尚科信息技术有限公司 | The method and apparatus of export structure query statement |
JP2018163490A (en) * | 2017-03-24 | 2018-10-18 | 三菱電機インフォメーションネットワーク株式会社 | Access control device and access control program |
CN108108426A (en) * | 2017-12-15 | 2018-06-01 | 杭州网蛙科技有限公司 | Understanding method, device and the electronic equipment that natural language is putd question to |
CN109947794A (en) * | 2019-02-21 | 2019-06-28 | 东华大学 | A kind of interactive natural language inquiry conversion method |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112347121A (en) * | 2020-11-02 | 2021-02-09 | 中科曙光南京研究院有限公司 | Configurable method and system for converting natural language into sql |
CN112347121B (en) * | 2020-11-02 | 2024-05-28 | 中科曙光南京研究院有限公司 | Configurable natural language sql conversion method and system |
CN112182022A (en) * | 2020-11-04 | 2021-01-05 | 北京安博通科技股份有限公司 | Data query method and device based on natural language and translation model |
CN112182022B (en) * | 2020-11-04 | 2024-04-16 | 北京安博通科技股份有限公司 | Data query method and device based on natural language and translation model |
CN112463819A (en) * | 2020-11-26 | 2021-03-09 | 北京宏景世纪软件股份有限公司 | Computing method, device and equipment based on Chinese expression and storage medium |
CN114625748A (en) * | 2021-04-23 | 2022-06-14 | 亚信科技(南京)有限公司 | SQL query statement generation method and device, electronic equipment and readable storage medium |
CN116821168A (en) * | 2023-08-24 | 2023-09-29 | 吉奥时空信息技术股份有限公司 | Improved NL2SQL method based on large language model |
CN116821168B (en) * | 2023-08-24 | 2024-01-23 | 吉奥时空信息技术股份有限公司 | Improved NL2SQL method based on large language model |
Also Published As
Publication number | Publication date |
---|---|
CN111324631B (en) | 2022-04-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111324631B (en) | Method for automatically generating sql statement by human natural language of query data | |
CN109684448B (en) | Intelligent question and answer method | |
US20200334418A1 (en) | Applied Artificial Intelligence Technology for Using Natural Language Processing and Concept Expression Templates to Train a Natural Language Generation System | |
US8832147B2 (en) | Relational meta-model and associated domain context-based knowledge inference engine for knowledge discovery and organization | |
US20090300043A1 (en) | Text based schema discovery and information extraction | |
US10706045B1 (en) | Natural language querying of a data lake using contextualized knowledge bases | |
US11693855B2 (en) | Automatic creation of schema annotation files for converting natural language queries to structured query language | |
US20220277005A1 (en) | Semantic parsing of natural language query | |
CN115576984A (en) | Method for generating SQL (structured query language) statement and cross-database query by Chinese natural language | |
CA2853627C (en) | Automatic creation of clinical study reports | |
US20180293300A1 (en) | Speech-based database access | |
TWI735380B (en) | Natural language processing method and computing apparatus thereof | |
WO2020258303A1 (en) | Semantic model instantiation method, system and device | |
US20210073216A1 (en) | Business intelligence system based on artificial intelligence and analysis method thereof | |
CN112632106A (en) | Knowledge graph query method, device, equipment and storage medium | |
CN117540004B (en) | Industrial domain intelligent question-answering method and system based on knowledge graph and user behavior | |
US10678827B2 (en) | Systematic mass normalization of international titles | |
Si et al. | Research and Implementation of Data Extraction Method Based on NLP | |
US11544304B2 (en) | System and method for parsing user query | |
CN112183110A (en) | Artificial intelligence data application system and application method based on data center | |
US12093265B2 (en) | Semantics based data and metadata mapping | |
Kiyavitskaya et al. | Applying software analysis technology to lightweight semantic markup of document text | |
CN115017271A (en) | Method and system for intelligently generating RPA flow component block | |
WO2019142094A1 (en) | System and method for semantic text search | |
CN118394954B (en) | Knowledge graph construction method and system for standard data elements of biomedical data set |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |