CN111324631B - Method for automatically generating sql statement by human natural language of query data - Google Patents

Method for automatically generating sql statement by human natural language of query data Download PDF

Info

Publication number
CN111324631B
CN111324631B CN202010196066.8A CN202010196066A CN111324631B CN 111324631 B CN111324631 B CN 111324631B CN 202010196066 A CN202010196066 A CN 202010196066A CN 111324631 B CN111324631 B CN 111324631B
Authority
CN
China
Prior art keywords
query
natural language
condition
analysis
aggregation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010196066.8A
Other languages
Chinese (zh)
Other versions
CN111324631A (en
Inventor
刘科
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Haitian Shulian Technology Co ltd
Original Assignee
Chengdu Haitian Shulian Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Haitian Shulian Technology Co ltd filed Critical Chengdu Haitian Shulian Technology Co ltd
Priority to CN202010196066.8A priority Critical patent/CN111324631B/en
Publication of CN111324631A publication Critical patent/CN111324631A/en
Application granted granted Critical
Publication of CN111324631B publication Critical patent/CN111324631B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2452Query translation
    • G06F16/24522Translation of natural language queries to structured queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2468Fuzzy queries

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Fuzzy Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Automation & Control Theory (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for converting a human natural language of query data into an sql statement. A user inputs a natural sentence, performs word segmentation and stop word removal on the sentence, performs automatic table name analysis, performs further processing on the natural language with the word segmented and the stop word removed and the obtained table name if the table name can be obtained, and feeds back prompt information after fuzzy matching if the natural language is not matched with the stop word; after the table name is obtained, judging whether multi-table association exists or not according to metadata information, if so, carrying out query condition processing according to an association rule, otherwise, directly carrying out single-table query condition processing; in the query condition processing, time analysis processing is respectively carried out on the natural language, and query field analysis, judgment condition analysis, cascade query analysis and query aggregation analysis processing are carried out according to metadata information; finally, query sql is generated. According to the invention, personnel are not required to label data, the investment of human resources is reduced, the expenditure of labor cost is reduced, and the problem of system non-reusability caused by machine learning is solved.

Description

Method for automatically generating sql statement by human natural language of query data
Technical Field
The invention relates to the field of natural language processing, in particular to a method for automatically generating sql sentences from human natural language of query data.
Background
With the development of national economy, more and more systems are put into use, and a large amount of data is generated. When a non-professional user wants to browse data, the data is usually displayed in a chart form in advance in a traditional mode, and the data cannot be freely inquired and browsed. At present, in a semantic to sql conversion mode on the market, a large amount of labeling information is collected through machine learning, and sql sentences are generated through long-time learning. In the middle process, a large amount of data needs to be labeled by personnel, and when new data or a new system is accessed, labeling and long-time machine learning are usually carried out again, so that the problems of uncontrollable training time, irreversibility of the system and the like exist.
Disclosure of Invention
The invention aims to provide a method for automatically generating sql sentences from human natural language of query data, aiming at the problems, and the method does not need personnel to label data and solves the problem of system non-reusability caused by machine learning.
A method of automatically generating sql statements from human natural language for querying data, comprising the steps of: s1: a user inputs a natural sentence of query data and performs word segmentation on the natural sentence;
s2: removing stop words from the natural sentences after word segmentation, and identifying the marked nouns and negative words;
s3: regularly splicing nouns, matching the nouns with the constructed table name metadata information to obtain corresponding table names, and if the nouns can be obtained, entering a step S4, and if the nouns cannot be obtained, entering a step S5;
s4: performing SQL generation processing on the natural language with the words being segmented and the stop words being removed and the acquired table name;
s41: judging whether the obtained table has multi-table correlation according to the metadata information of the table, if so, performing condition analysis on the natural language with words segmented and stop words removed according to a correlation nesting rule, otherwise, directly performing single-table query condition analysis;
s42: analyzing time data of the natural language;
s43: analyzing the aggregation query condition of the natural language subjected to time processing;
s44: analyzing the nouns labeled in the S2 according to the metadata information, wherein the nouns are analyzed by the query field and the query condition field; analyzing the query judgment condition according to the negative word marked in the S2;
s45: analyzing the query aggregation information;
s46: assembling and generating an sql statement according to the table name, the field name, the query condition field, the query judgment condition and the aggregation information;
s5: carrying out fuzzy query on natural language input by a user;
s51: and judging whether the matched table name exists or not, and if the matched table name does not exist, entering a feedback question-asking process.
Further, stop words do not include negative words.
Further, the condition analysis specifically includes: the recognized nouns are matched through metadata, and if the field names of the query conditions are found and the query values are found at the same time, one query condition is considered to be analyzed and completed; if the matching has no corresponding value of the field, considering whether the field is numerical type, matching with negative word analysis and numerical type query judgment conditions of less than or equal to the character analysis, and considering that one query condition analysis is completed; if only the query value is matched without the query field, the field corresponding to the query value in the metadata is automatically used as the query field, and one query condition is considered to be analyzed and completed; if a plurality of parallel query judgment values are analyzed, the query judgment values correspond to a column name and are analyzed into in operation or not in operation in the SQL language in cooperation with negative words.
Further, the time data analysis specifically includes: and analyzing and finding out data information which represents time in the segmented natural language after condition processing.
Further, the feedback question-asking process includes the following steps: and judging whether the similar table names exist or not, if so, giving a table name prompt, otherwise, obtaining a query record with higher query frequency of the user according to the query record of the user, prompting the user to input, and then entering the step s 1.
Further, the feedback question asking process further includes a special vocabulary judging step, if the special vocabulary exists, the input of the corresponding vocabulary is prompted, and after the semantic sentence is replaced, the step is carried out, and then the step is carried out in step S4.
The invention has the beneficial effects that: personnel are not needed to label data, investment of human resources is reduced, labor cost is lowered, and the problem that a system cannot be reused due to machine learning is solved.
Drawings
FIG. 1 is a block flow diagram of the present invention.
Detailed Description
In order to more clearly understand the technical features, objects, and effects of the present invention, embodiments of the present invention will now be described with reference to the accompanying drawings.
A method of automatically generating sql statements from human natural language for querying data, the method comprising the steps of:
s1: a user inputs a natural sentence of query data and performs word segmentation on the natural sentence;
s2: removing stop words from the natural sentences after word segmentation, and identifying the marked nouns and negative words;
s3: regularly splicing nouns, matching the nouns with the constructed table name metadata information to obtain corresponding table names, and if the nouns can be obtained, entering a step S4, and if the nouns cannot be obtained, entering a step S5;
s4: performing SQL generation processing on the natural language with the words being segmented and the stop words being removed and the acquired table name;
s41: judging whether the obtained table has multi-table correlation according to the metadata information of the table, if so, performing condition analysis on the natural language with words segmented and stop words removed according to a correlation nesting rule, otherwise, directly performing single-table query condition analysis;
wherein the associated nesting specifically comprises: nested queries are constructed based on the metadata relationships of the tables. For example, if the achievements are address value fields in the address table, but the sales table corresponds to an ID field of the address table, then the mapping can be done. Table 1 shows a metadata specific structure of the table, and table 2 shows a metadata specific structure of the field.
Table_name Table name, str
Mapping_name Mapping name, str
Maping_value Mapping value, str
Table 1 table metadata structure
Figure BDA0002417648930000031
TABLE 2 field metadata Structure
If the query path graph is a multi-table, judging whether to construct a Cartesian product according to the metadata or construct multi-stage query according to the query path graph.
Cartesian product: if there is no aggregation, the multi-tiered query is converted to a multi-table Cartesian query. And if the aggregation exists, outputting the relation between the columns and the corresponding columns of the outer layer table after the aggregation according to the query path graph. And resolving the multi-layer query according to the feasible query path.
S42: and performing time processing on the natural language after condition processing: the method comprises the steps of firstly constructing a time trigger word dictionary in a certain mode, for example, summarizing the trigger word dictionary from a training corpus or human experience, then constructing a time affix word dictionary according to modifiers around the trigger words, prefixes, suffixes and other components, and finally identifying a time expression through the combination of the trigger words and the affix words. Taking the explanation of "2019, 3, and 4" as an example, wherein "year", "month", and "day" are trigger words, then the time description can be interpreted as a standardized time format "2019/03/04" according to the components of the trigger word prefixes "2019", "3", and "4". Through the time analysis process, all the time descriptions in the query statement can be unified into a standard format to be constructed by a subsequent data model.
S43: aggregating the natural language after the time processing;
polymerization and analysis: judging the aggregation condition, judging whether to need to carry out aggregation analysis according to the regular expression at present, and finding out the corresponding aggregation condition according to the dependency relationship if the natural language input by the user contains aggregation vocabularies such as maximum, minimum, total, sum, average and the like. If "maximum tobacco sales amount" is input, "the" maximum "corresponding aggregation vocabulary is obtained as" max, "the" maximum "modified" sales amount "is known according to the dependency relationship, and the composition aggregation information" max (sales amount) "can be obtained to be spliced by subsequent sql.
S44: assembling and generating an inquiry sql statement according to the table name, the field name, the inquiry condition field, the inquiry judgment condition and the aggregation information;
s5: carrying out fuzzy query on natural language input by a user;
s51: and judging whether the matched table name exists or not, and if the matched table name does not exist, entering a feedback question-asking process.
Stop words include stop, yes, o, query, look, know, etc. stop words but not stop, not, etc. stop words.
The feedback questioning process comprises the following steps: judging whether a similar table name exists or not, if so, giving a table name prompt; when there is no similar table name, the query record with higher query frequency of the user is obtained according to the query record of the user, the user is prompted to input, and then the step s1 is entered.
The feedback question asking process further comprises a special vocabulary judging step, if the special vocabulary exists, the input of the corresponding vocabulary is prompted, and after the semantic sentence is replaced, the step S4 is carried out.
The method for automatically generating the sql sentences by the human natural language of the query data does not need personnel to label the data, reduces the investment of human resources, reduces the expenditure of labor cost, and solves the problem of system reusability caused by machine learning.
The foregoing shows and describes the general principles and broad features of the present invention and advantages thereof. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (2)

1. A method for automatically generating sql statements from human natural language for querying data, comprising the steps of:
s1: a user inputs a natural sentence of query data and performs word segmentation on the natural sentence;
s2: removing stop words from the natural sentences after word segmentation, and identifying the marked nouns and negative words;
s3: regularly splicing nouns, matching the nouns with the constructed table name metadata information to obtain corresponding table names, and if the nouns can be obtained, entering a step S4, and if the nouns cannot be obtained, entering a step S5;
s4: performing SQL generation processing on the natural language with the words being segmented and the stop words being removed and the acquired table name;
s41: judging whether the obtained table has multi-table correlation according to the metadata information of the table, if so, performing condition analysis on the natural language with words segmented and stop words removed according to a correlation nesting rule, otherwise, directly performing single-table query condition analysis;
the association nesting rule is specifically as follows: constructing a nested query based on the metadata relationship of the table; if the query path graph is a multi-table, judging whether to construct a Cartesian product according to the metadata or construct multi-stage query according to the query path graph; cartesian product: if no aggregation exists, converting the multi-layer query into a multi-table Cartesian query; if the aggregation exists, outputting the relation between the columns and the corresponding columns of the outer layer table after the aggregation according to the query path diagram; analyzing the multi-layer query according to the feasible query path;
the condition analysis specifically comprises the following steps: the recognized nouns are matched through metadata, and if the field names of the query conditions are found and the query values are found at the same time, one query condition is considered to be analyzed and completed; if the matching has no corresponding value of the field, considering whether the field is numerical type, matching with negative word analysis and numerical type query judgment conditions of less than or equal to the character analysis, and considering that one query condition analysis is completed; if only the query value is matched without the query field, the field corresponding to the query value in the metadata is automatically used as the query field, and one query condition is considered to be analyzed and completed; if a plurality of parallel query judgment values are analyzed, the query judgment values correspond to a column name and are analyzed into in operation or not in operation in the SQL language in cooperation with negative words;
s42: analyzing time data of the natural language; the time data analysis specifically comprises: analyzing and finding out data information which represents time in the segmented natural language after condition processing; firstly, a time trigger word dictionary is built in a certain mode, for example, the trigger word dictionary is summarized from a training corpus or artificial experience, then a time affix word dictionary is built according to modifiers around the trigger words, prefixes and suffixes, and finally a time expression is identified through the combination of the trigger words and the affix words, all time descriptions in a query sentence are unified into a standard format, and a data model to be built is constructed;
s43: analyzing the aggregation query condition of the natural language subjected to time processing; aggregate query condition parsing process: judging the aggregation condition, and judging whether to perform aggregation analysis according to the regular expression at present; finding out corresponding aggregation conditions according to the dependency relationship of the aggregation vocabularies in the natural language input by the user to form aggregation information for subsequent sql splicing;
s44: analyzing the nouns labeled in the S2 according to the metadata information, wherein the nouns are analyzed by the query field and the query condition field; analyzing the query judgment condition according to the negative word marked in the S2;
s45: analyzing the query aggregation information;
s46: assembling and generating an sql statement according to the table name, the field name, the query condition field, the query judgment condition and the aggregation information;
s5: carrying out fuzzy query on natural language input by a user;
s51: judging whether a matched table name exists or not, and entering a feedback question process if the matched table name does not exist;
the feedback question asking process comprises the following steps: judging whether similar table names exist or not, if so, giving a table name prompt, otherwise, obtaining a query record with higher query frequency of the user according to the query record of the user, prompting the user to input, and then entering the step S1; the feedback question asking process further comprises a special vocabulary judging step, if the special vocabulary exists, the input of the corresponding vocabulary is prompted, and after the semantic sentence is replaced, the step S4 is carried out.
2. The method of claim 1, wherein the stop word does not include a negative word.
CN202010196066.8A 2020-03-19 2020-03-19 Method for automatically generating sql statement by human natural language of query data Active CN111324631B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010196066.8A CN111324631B (en) 2020-03-19 2020-03-19 Method for automatically generating sql statement by human natural language of query data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010196066.8A CN111324631B (en) 2020-03-19 2020-03-19 Method for automatically generating sql statement by human natural language of query data

Publications (2)

Publication Number Publication Date
CN111324631A CN111324631A (en) 2020-06-23
CN111324631B true CN111324631B (en) 2022-04-22

Family

ID=71171640

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010196066.8A Active CN111324631B (en) 2020-03-19 2020-03-19 Method for automatically generating sql statement by human natural language of query data

Country Status (1)

Country Link
CN (1) CN111324631B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112182022B (en) * 2020-11-04 2024-04-16 北京安博通科技股份有限公司 Data query method and device based on natural language and translation model
CN116821168B (en) * 2023-08-24 2024-01-23 吉奥时空信息技术股份有限公司 Improved NL2SQL method based on large language model

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102737049A (en) * 2011-04-11 2012-10-17 腾讯科技(深圳)有限公司 Method and system for database query
CN104123346A (en) * 2014-07-02 2014-10-29 广东电网公司信息中心 Structural data searching method
CN107180021A (en) * 2016-03-09 2017-09-19 北京京东尚科信息技术有限公司 A kind of data processing method, system and its server
CN107451153A (en) * 2016-05-31 2017-12-08 北京京东尚科信息技术有限公司 The method and apparatus of export structure query statement
CN108108426A (en) * 2017-12-15 2018-06-01 杭州网蛙科技有限公司 Understanding method, device and the electronic equipment that natural language is putd question to
JP2018163490A (en) * 2017-03-24 2018-10-18 三菱電機インフォメーションネットワーク株式会社 Access control device and access control program
CN109947794A (en) * 2019-02-21 2019-06-28 东华大学 A kind of interactive natural language inquiry conversion method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10621372B2 (en) * 2012-05-03 2020-04-14 Salesforce.Com, Inc. Method and system for generating database access objects

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102737049A (en) * 2011-04-11 2012-10-17 腾讯科技(深圳)有限公司 Method and system for database query
CN104123346A (en) * 2014-07-02 2014-10-29 广东电网公司信息中心 Structural data searching method
CN107180021A (en) * 2016-03-09 2017-09-19 北京京东尚科信息技术有限公司 A kind of data processing method, system and its server
CN107451153A (en) * 2016-05-31 2017-12-08 北京京东尚科信息技术有限公司 The method and apparatus of export structure query statement
JP2018163490A (en) * 2017-03-24 2018-10-18 三菱電機インフォメーションネットワーク株式会社 Access control device and access control program
CN108108426A (en) * 2017-12-15 2018-06-01 杭州网蛙科技有限公司 Understanding method, device and the electronic equipment that natural language is putd question to
CN109947794A (en) * 2019-02-21 2019-06-28 东华大学 A kind of interactive natural language inquiry conversion method

Also Published As

Publication number Publication date
CN111324631A (en) 2020-06-23

Similar Documents

Publication Publication Date Title
CN109684448B (en) Intelligent question and answer method
US7930322B2 (en) Text based schema discovery and information extraction
US8700658B2 (en) Relational meta model and associated domain context-based knowledge inference engine for knowledge discovery and organization
US10140333B2 (en) Trusted query system and method
CN102207948B (en) Method for generating incident statement sentence material base
US10706045B1 (en) Natural language querying of a data lake using contextualized knowledge bases
CA2853627C (en) Automatic creation of clinical study reports
US11693855B2 (en) Automatic creation of schema annotation files for converting natural language queries to structured query language
CN111324631B (en) Method for automatically generating sql statement by human natural language of query data
CN110795932B (en) Geological report text information extraction method based on geological ontology
WO2020258303A1 (en) Semantic model instantiation method, system and device
US20210073216A1 (en) Business intelligence system based on artificial intelligence and analysis method thereof
CN109522396B (en) Knowledge processing method and system for national defense science and technology field
Cui et al. Introducing Explorer of Taxon Concepts with a case study on spider measurement matrix building
US20070282804A1 (en) Apparatus and method for extracting database information from a report
JP4005343B2 (en) Information retrieval system
CN112183110A (en) Artificial intelligence data application system and application method based on data center
US20200089697A1 (en) System and method for parsing user query
US20230044287A1 (en) Semantics based data and metadata mapping
CN112559550B (en) Multi-data-source NL2SQL system based on semantic rules and multi-dimensional model
Kiyavitskaya et al. Applying software analysis technology to lightweight semantic markup of document text
CN112818005A (en) Structured data searching method, device, equipment and storage medium
WO2019142094A1 (en) System and method for semantic text search
Zhu et al. Doc2Vec on similar document suggestion for pharmaceutical collections
Khashfeh et al. A Text Mining Algorithm Optimising the Determination of Relevant Studies

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant