CN111324631A - Method for automatically generating sql statement by human natural language of query data - Google Patents

Method for automatically generating sql statement by human natural language of query data Download PDF

Info

Publication number
CN111324631A
CN111324631A CN202010196066.8A CN202010196066A CN111324631A CN 111324631 A CN111324631 A CN 111324631A CN 202010196066 A CN202010196066 A CN 202010196066A CN 111324631 A CN111324631 A CN 111324631A
Authority
CN
China
Prior art keywords
query
natural language
condition
analysis
field
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010196066.8A
Other languages
Chinese (zh)
Other versions
CN111324631B (en
Inventor
刘科
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Htdata Technology Co ltd
Original Assignee
Chengdu Htdata Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Htdata Technology Co ltd filed Critical Chengdu Htdata Technology Co ltd
Priority to CN202010196066.8A priority Critical patent/CN111324631B/en
Publication of CN111324631A publication Critical patent/CN111324631A/en
Application granted granted Critical
Publication of CN111324631B publication Critical patent/CN111324631B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2452Query translation
    • G06F16/24522Translation of natural language queries to structured queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2468Fuzzy queries

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Automation & Control Theory (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for converting a human natural language of query data into an sql statement. A user inputs a natural sentence, performs word segmentation and stop word removal on the sentence, performs automatic table name analysis, performs further processing on the natural language with the word segmented and the stop word removed and the obtained table name if the table name can be obtained, and feeds back prompt information after fuzzy matching if the natural language is not matched with the stop word; after the table name is obtained, judging whether multi-table association exists or not according to metadata information, if so, carrying out query condition processing according to an association rule, otherwise, directly carrying out single-table query condition processing; in the query condition processing, time analysis processing is respectively carried out on the natural language, and query field analysis, judgment condition analysis, cascade query analysis and query aggregation analysis processing are carried out according to metadata information; finally, query sql is generated. According to the invention, personnel are not required to label data, the investment of human resources is reduced, the expenditure of labor cost is reduced, and the problem of system non-reusability caused by machine learning is solved.

Description

Method for automatically generating sql statement by human natural language of query data
Technical Field
The invention relates to the field of natural language processing, in particular to a method for automatically generating sql sentences from human natural language of query data.
Background
With the development of national economy, more and more systems are put into use, and a large amount of data is generated. When a non-professional user wants to browse data, the data is usually displayed in a chart form in advance in a traditional mode, and the data cannot be freely inquired and browsed. At present, in a semantic to sql conversion mode on the market, a large amount of labeling information is collected through machine learning, and sql sentences are generated through long-time learning. In the middle process, a large amount of data needs to be labeled by personnel, and when new data or a new system is accessed, labeling and long-time machine learning are usually carried out again, so that the problems of uncontrollable training time, irreversibility of the system and the like exist.
Disclosure of Invention
The invention aims to provide a method for automatically generating sql sentences from human natural language of query data, aiming at the problems, and the method does not need personnel to label data and solves the problem of system non-reusability caused by machine learning.
A method of automatically generating sql statements from human natural language for querying data, comprising the steps of: s1: a user inputs a natural sentence of query data and performs word segmentation on the natural sentence;
s2: removing stop words from the natural sentences after word segmentation, and identifying the marked nouns and negative words;
s3: regularly splicing nouns, matching the nouns with the constructed table name metadata information to obtain corresponding table names, and if the nouns can be obtained, entering a step S4, and if the nouns cannot be obtained, entering a step S5;
s4: performing SQL generation processing on the natural language with the words being segmented and the stop words being removed and the acquired table name;
s41: judging whether the obtained table has multi-table correlation according to the metadata information of the table, if so, performing condition analysis on the natural language with words segmented and stop words removed according to a correlation nesting rule, otherwise, directly performing single-table query condition analysis;
s42: analyzing time data of the natural language;
s43: analyzing the aggregation query condition of the natural language subjected to time processing;
s44: analyzing the nouns labeled in the S2 according to the metadata information, wherein the nouns are analyzed by the query field and the query condition field; analyzing the query judgment condition according to the negative word marked in the S2;
s45: analyzing the query aggregation information;
s46: assembling and generating an sql statement according to the table name, the field name, the query condition field, the query judgment condition and the aggregation information;
s5: carrying out fuzzy query on natural language input by a user;
s51: and judging whether the matched table name exists or not, and if the matched table name does not exist, entering a feedback question-asking process.
Further, stop words do not include negative words.
Further, the condition analysis specifically includes: the recognized nouns are matched through metadata, and if the field names of the query conditions are found and the query values are found at the same time, one query condition is considered to be analyzed and completed; if the matching has no corresponding value of the field, considering whether the field is numerical type, matching with negative word analysis and numerical type query judgment conditions of less than or equal to the character analysis, and considering that one query condition analysis is completed; if only the query value is matched without the query field, the field corresponding to the query value in the metadata is automatically used as the query field, and one query condition is considered to be analyzed and completed; if a plurality of parallel query judgment values are analyzed, the query judgment values correspond to a column name and are analyzed into in operation or not in operation in the SQL language in cooperation with negative words.
Further, the time data analysis specifically includes: and analyzing and finding out data information which represents time in the segmented natural language after condition processing.
Further, the feedback question-asking process includes the following steps: and judging whether the similar table names exist or not, if so, giving a table name prompt, otherwise, obtaining a query record with higher query frequency of the user according to the query record of the user, prompting the user to input, and then entering the step s 1.
Further, the feedback question asking process further includes a special vocabulary judging step, if the special vocabulary exists, the input of the corresponding vocabulary is prompted, and after the semantic sentence is replaced, the step is carried out, and then the step is carried out in step S4.
The invention has the beneficial effects that: personnel are not needed to label data, investment of human resources is reduced, labor cost is lowered, and the problem that a system cannot be reused due to machine learning is solved.
Drawings
FIG. 1 is a block flow diagram of the present invention.
Detailed Description
In order to more clearly understand the technical features, objects, and effects of the present invention, embodiments of the present invention will now be described with reference to the accompanying drawings.
A method of automatically generating sql statements from human natural language for querying data, the method comprising the steps of:
s1: a user inputs a natural sentence of query data and performs word segmentation on the natural sentence;
s2: removing stop words from the natural sentences after word segmentation, and identifying the marked nouns and negative words;
s3: regularly splicing nouns, matching the nouns with the constructed table name metadata information to obtain corresponding table names, and if the nouns can be obtained, entering a step S4, and if the nouns cannot be obtained, entering a step S5;
s4: performing SQL generation processing on the natural language with the words being segmented and the stop words being removed and the acquired table name;
s41: judging whether the obtained table has multi-table correlation according to the metadata information of the table, if so, performing condition analysis on the natural language with words segmented and stop words removed according to a correlation nesting rule, otherwise, directly performing single-table query condition analysis;
wherein the associated nesting specifically comprises: nested queries are constructed based on the metadata relationships of the tables. For example, if the achievements are address value fields in the address table, but the sales table corresponds to an ID field of the address table, then the mapping can be done. Table 1 shows a metadata specific structure of the table, and table 2 shows a metadata specific structure of the field.
Table_name Table name, str
Mapping_name Mapping name, str
Maping_value Mapping value, str
Table 1 table metadata structure
Figure BDA0002417648930000031
TABLE 2 field metadata Structure
If the query path graph is a multi-table, judging whether to construct a Cartesian product according to the metadata or construct multi-stage query according to the query path graph.
Cartesian product: if there is no aggregation, the multi-tiered query is converted to a multi-table Cartesian query. And if the aggregation exists, outputting the relation between the columns and the corresponding columns of the outer layer table after the aggregation according to the query path graph. And resolving the multi-layer query according to the feasible query path.
S42: and performing time processing on the natural language after condition processing: the method comprises the steps of firstly constructing a time trigger word dictionary in a certain mode, for example, summarizing the trigger word dictionary from a training corpus or human experience, then constructing a time affix word dictionary according to modifiers around the trigger words, prefixes, suffixes and other components, and finally identifying a time expression through the combination of the trigger words and the affix words. Taking the explanation of "2019, 3, and 4" as an example, wherein "year", "month", and "day" are trigger words, then the time description can be interpreted as a standardized time format "2019/03/04" according to the components of the trigger word prefixes "2019", "3", and "4". Through the time analysis process, all the time descriptions in the query statement can be unified into a standard format to be constructed by a subsequent data model.
S43: aggregating the natural language after the time processing;
polymerization and analysis: judging the aggregation condition, judging whether to need to carry out aggregation analysis according to the regular expression at present, and finding out the corresponding aggregation condition according to the dependency relationship if the natural language input by the user contains aggregation vocabularies such as maximum, minimum, total, sum, average and the like. If "maximum tobacco sales amount" is input, "the" maximum "corresponding aggregation vocabulary is obtained as" max, "the" maximum "modified" sales amount "is known according to the dependency relationship, and the composition aggregation information" max (sales amount) "can be obtained to be spliced by subsequent sql.
S44: assembling and generating an inquiry sql statement according to the table name, the field name, the inquiry condition field, the inquiry judgment condition and the aggregation information;
s5: carrying out fuzzy query on natural language input by a user;
s51: and judging whether the matched table name exists or not, and if the matched table name does not exist, entering a feedback question-asking process.
Stop words include stop, yes, o, query, look, know, etc. stop words but not stop, not, etc. stop words.
The feedback questioning process comprises the following steps: judging whether a similar table name exists or not, if so, giving a table name prompt; when there is no similar table name, the query record with higher query frequency of the user is obtained according to the query record of the user, the user is prompted to input, and then the step s1 is entered.
The feedback question asking process further comprises a special vocabulary judging step, if the special vocabulary exists, the input of the corresponding vocabulary is prompted, and after the semantic sentence is replaced, the step S4 is carried out.
The method for automatically generating the sql sentences by the human natural language of the query data does not need personnel to label the data, reduces the investment of human resources, reduces the expenditure of labor cost, and solves the problem of system reusability caused by machine learning.
The foregoing shows and describes the general principles and broad features of the present invention and advantages thereof. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (6)

1. A method for automatically generating sql statements from human natural language for querying data, comprising the steps of:
s1: a user inputs a natural sentence of query data and performs word segmentation on the natural sentence;
s2: removing stop words from the natural sentences after word segmentation, and identifying the marked nouns and negative words;
s3: regularly splicing nouns, matching the nouns with the constructed table name metadata information to obtain corresponding table names, and if the nouns can be obtained, entering a step S4, and if the nouns cannot be obtained, entering a step S5;
s4: performing SQL generation processing on the natural language with the words being segmented and the stop words being removed and the acquired table name;
s41: judging whether the obtained table has multi-table correlation according to the metadata information of the table, if so, performing condition analysis on the natural language with words segmented and stop words removed according to a correlation nesting rule, otherwise, directly performing single-table query condition analysis;
s42: analyzing time data of the natural language;
s43: analyzing the aggregation query condition of the natural language subjected to time processing;
s44: analyzing the nouns labeled in the S2 according to the metadata information, wherein the nouns are analyzed by the query field and the query condition field; analyzing the query judgment condition according to the negative word marked in the S2;
s45: analyzing the query aggregation information;
s46: assembling and generating an sql statement according to the table name, the field name, the query condition field, the query judgment condition and the aggregation information;
s5: carrying out fuzzy query on natural language input by a user;
s51: and judging whether the matched table name exists or not, and if the matched table name does not exist, entering a feedback question-asking process.
2. The method of claim 1, wherein the stop word does not include a negative word.
3. The method of claim 1, wherein the conditional parsing specifically comprises: the recognized nouns are matched through metadata, and if the field names of the query conditions are found and the query values are found at the same time, one query condition is considered to be analyzed and completed; if the matching has no corresponding value of the field, considering whether the field is numerical type, matching with negative word analysis and numerical type query judgment conditions of less than or equal to the character analysis, and considering that one query condition analysis is completed; if only the query value is matched without the query field, the field corresponding to the query value in the metadata is automatically used as the query field, and one query condition is considered to be analyzed and completed; if a plurality of parallel query judgment values are analyzed, the query judgment values correspond to a column name and are analyzed into in operation or not in operation in the SQL language in cooperation with negative words.
4. The method of claim 1, wherein the temporal data parsing is specifically: and analyzing and finding out data information which represents time in the segmented natural language after condition processing.
5. The method of claim 1, wherein the feedback question process comprises the following steps: and judging whether the similar table names exist or not, if so, giving a table name prompt, otherwise, obtaining a query record with higher query frequency of the user according to the query record of the user, prompting the user to input, and then entering the step s 1.
6. The method as claimed in claim 5, wherein the feedback question-asking process further includes a step of determining a special vocabulary, and if the special vocabulary exists, the method prompts to input the corresponding vocabulary, and enters step S4 after replacing the semantic sentence.
CN202010196066.8A 2020-03-19 2020-03-19 Method for automatically generating sql statement by human natural language of query data Active CN111324631B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010196066.8A CN111324631B (en) 2020-03-19 2020-03-19 Method for automatically generating sql statement by human natural language of query data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010196066.8A CN111324631B (en) 2020-03-19 2020-03-19 Method for automatically generating sql statement by human natural language of query data

Publications (2)

Publication Number Publication Date
CN111324631A true CN111324631A (en) 2020-06-23
CN111324631B CN111324631B (en) 2022-04-22

Family

ID=71171640

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010196066.8A Active CN111324631B (en) 2020-03-19 2020-03-19 Method for automatically generating sql statement by human natural language of query data

Country Status (1)

Country Link
CN (1) CN111324631B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112182022A (en) * 2020-11-04 2021-01-05 北京安博通科技股份有限公司 Data query method and device based on natural language and translation model
CN112347121A (en) * 2020-11-02 2021-02-09 中科曙光南京研究院有限公司 Configurable method and system for converting natural language into sql
CN112463819A (en) * 2020-11-26 2021-03-09 北京宏景世纪软件股份有限公司 Computing method, device and equipment based on Chinese expression and storage medium
CN114625748A (en) * 2021-04-23 2022-06-14 亚信科技(南京)有限公司 SQL query statement generation method and device, electronic equipment and readable storage medium
CN116821168A (en) * 2023-08-24 2023-09-29 吉奥时空信息技术股份有限公司 Improved NL2SQL method based on large language model

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102737049A (en) * 2011-04-11 2012-10-17 腾讯科技(深圳)有限公司 Method and system for database query
US20130297654A1 (en) * 2012-05-03 2013-11-07 Salesforce.Com, Inc. Method and system for generating database access objects
CN104123346A (en) * 2014-07-02 2014-10-29 广东电网公司信息中心 Structural data searching method
CN107180021A (en) * 2016-03-09 2017-09-19 北京京东尚科信息技术有限公司 A kind of data processing method, system and its server
CN107451153A (en) * 2016-05-31 2017-12-08 北京京东尚科信息技术有限公司 The method and apparatus of export structure query statement
CN108108426A (en) * 2017-12-15 2018-06-01 杭州网蛙科技有限公司 Understanding method, device and the electronic equipment that natural language is putd question to
JP2018163490A (en) * 2017-03-24 2018-10-18 三菱電機インフォメーションネットワーク株式会社 Access control device and access control program
CN109947794A (en) * 2019-02-21 2019-06-28 东华大学 A kind of interactive natural language inquiry conversion method

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102737049A (en) * 2011-04-11 2012-10-17 腾讯科技(深圳)有限公司 Method and system for database query
US20130297654A1 (en) * 2012-05-03 2013-11-07 Salesforce.Com, Inc. Method and system for generating database access objects
CN104123346A (en) * 2014-07-02 2014-10-29 广东电网公司信息中心 Structural data searching method
CN107180021A (en) * 2016-03-09 2017-09-19 北京京东尚科信息技术有限公司 A kind of data processing method, system and its server
CN107451153A (en) * 2016-05-31 2017-12-08 北京京东尚科信息技术有限公司 The method and apparatus of export structure query statement
JP2018163490A (en) * 2017-03-24 2018-10-18 三菱電機インフォメーションネットワーク株式会社 Access control device and access control program
CN108108426A (en) * 2017-12-15 2018-06-01 杭州网蛙科技有限公司 Understanding method, device and the electronic equipment that natural language is putd question to
CN109947794A (en) * 2019-02-21 2019-06-28 东华大学 A kind of interactive natural language inquiry conversion method

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112347121A (en) * 2020-11-02 2021-02-09 中科曙光南京研究院有限公司 Configurable method and system for converting natural language into sql
CN112347121B (en) * 2020-11-02 2024-05-28 中科曙光南京研究院有限公司 Configurable natural language sql conversion method and system
CN112182022A (en) * 2020-11-04 2021-01-05 北京安博通科技股份有限公司 Data query method and device based on natural language and translation model
CN112182022B (en) * 2020-11-04 2024-04-16 北京安博通科技股份有限公司 Data query method and device based on natural language and translation model
CN112463819A (en) * 2020-11-26 2021-03-09 北京宏景世纪软件股份有限公司 Computing method, device and equipment based on Chinese expression and storage medium
CN114625748A (en) * 2021-04-23 2022-06-14 亚信科技(南京)有限公司 SQL query statement generation method and device, electronic equipment and readable storage medium
CN116821168A (en) * 2023-08-24 2023-09-29 吉奥时空信息技术股份有限公司 Improved NL2SQL method based on large language model
CN116821168B (en) * 2023-08-24 2024-01-23 吉奥时空信息技术股份有限公司 Improved NL2SQL method based on large language model

Also Published As

Publication number Publication date
CN111324631B (en) 2022-04-22

Similar Documents

Publication Publication Date Title
CN111324631B (en) Method for automatically generating sql statement by human natural language of query data
CN109684448B (en) Intelligent question and answer method
US20200334418A1 (en) Applied Artificial Intelligence Technology for Using Natural Language Processing and Concept Expression Templates to Train a Natural Language Generation System
US8832147B2 (en) Relational meta-model and associated domain context-based knowledge inference engine for knowledge discovery and organization
US20090300043A1 (en) Text based schema discovery and information extraction
US10706045B1 (en) Natural language querying of a data lake using contextualized knowledge bases
US11693855B2 (en) Automatic creation of schema annotation files for converting natural language queries to structured query language
US20220277005A1 (en) Semantic parsing of natural language query
CN115576984A (en) Method for generating SQL (structured query language) statement and cross-database query by Chinese natural language
CA2853627C (en) Automatic creation of clinical study reports
US20180293300A1 (en) Speech-based database access
TWI735380B (en) Natural language processing method and computing apparatus thereof
WO2020258303A1 (en) Semantic model instantiation method, system and device
US20210073216A1 (en) Business intelligence system based on artificial intelligence and analysis method thereof
CN112632106A (en) Knowledge graph query method, device, equipment and storage medium
CN117540004B (en) Industrial domain intelligent question-answering method and system based on knowledge graph and user behavior
US10678827B2 (en) Systematic mass normalization of international titles
Si et al. Research and Implementation of Data Extraction Method Based on NLP
US11544304B2 (en) System and method for parsing user query
CN112183110A (en) Artificial intelligence data application system and application method based on data center
US12093265B2 (en) Semantics based data and metadata mapping
Kiyavitskaya et al. Applying software analysis technology to lightweight semantic markup of document text
CN115017271A (en) Method and system for intelligently generating RPA flow component block
WO2019142094A1 (en) System and method for semantic text search
CN118394954B (en) Knowledge graph construction method and system for standard data elements of biomedical data set

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant