CN111324631A

CN111324631A - Method for automatically generating sql statement by human natural language of query data

Info

Publication number: CN111324631A
Application number: CN202010196066.8A
Authority: CN
Inventors: 刘科
Original assignee: Chengdu Htdata Technology Co ltd
Current assignee: Chengdu Htdata Technology Co ltd
Priority date: 2020-03-19
Filing date: 2020-03-19
Publication date: 2020-06-23
Anticipated expiration: 2040-03-19
Also published as: CN111324631B

Abstract

The invention discloses a method for converting a human natural language of query data into an sql statement. A user inputs a natural sentence, performs word segmentation and stop word removal on the sentence, performs automatic table name analysis, performs further processing on the natural language with the word segmented and the stop word removed and the obtained table name if the table name can be obtained, and feeds back prompt information after fuzzy matching if the natural language is not matched with the stop word; after the table name is obtained, judging whether multi-table association exists or not according to metadata information, if so, carrying out query condition processing according to an association rule, otherwise, directly carrying out single-table query condition processing; in the query condition processing, time analysis processing is respectively carried out on the natural language, and query field analysis, judgment condition analysis, cascade query analysis and query aggregation analysis processing are carried out according to metadata information; finally, query sql is generated. According to the invention, personnel are not required to label data, the investment of human resources is reduced, the expenditure of labor cost is reduced, and the problem of system non-reusability caused by machine learning is solved.

Description

Method for automatically generating sql statement by human natural language of query data

Technical Field

The invention relates to the field of natural language processing, in particular to a method for automatically generating sql sentences from human natural language of query data.

Background

With the development of national economy, more and more systems are put into use, and a large amount of data is generated. When a non-professional user wants to browse data, the data is usually displayed in a chart form in advance in a traditional mode, and the data cannot be freely inquired and browsed. At present, in a semantic to sql conversion mode on the market, a large amount of labeling information is collected through machine learning, and sql sentences are generated through long-time learning. In the middle process, a large amount of data needs to be labeled by personnel, and when new data or a new system is accessed, labeling and long-time machine learning are usually carried out again, so that the problems of uncontrollable training time, irreversibility of the system and the like exist.

Disclosure of Invention

The invention aims to provide a method for automatically generating sql sentences from human natural language of query data, aiming at the problems, and the method does not need personnel to label data and solves the problem of system non-reusability caused by machine learning.

A method of automatically generating sql statements from human natural language for querying data, comprising the steps of: s1: a user inputs a natural sentence of query data and performs word segmentation on the natural sentence;

s2: removing stop words from the natural sentences after word segmentation, and identifying the marked nouns and negative words;

s3: regularly splicing nouns, matching the nouns with the constructed table name metadata information to obtain corresponding table names, and if the nouns can be obtained, entering a step S4, and if the nouns cannot be obtained, entering a step S5;

s4: performing SQL generation processing on the natural language with the words being segmented and the stop words being removed and the acquired table name;

s41: judging whether the obtained table has multi-table correlation according to the metadata information of the table, if so, performing condition analysis on the natural language with words segmented and stop words removed according to a correlation nesting rule, otherwise, directly performing single-table query condition analysis;

s42: analyzing time data of the natural language;

s43: analyzing the aggregation query condition of the natural language subjected to time processing;

s44: analyzing the nouns labeled in the S2 according to the metadata information, wherein the nouns are analyzed by the query field and the query condition field; analyzing the query judgment condition according to the negative word marked in the S2;

s45: analyzing the query aggregation information;

s46: assembling and generating an sql statement according to the table name, the field name, the query condition field, the query judgment condition and the aggregation information;

s5: carrying out fuzzy query on natural language input by a user;

s51: and judging whether the matched table name exists or not, and if the matched table name does not exist, entering a feedback question-asking process.

Further, stop words do not include negative words.

Further, the condition analysis specifically includes: the recognized nouns are matched through metadata, and if the field names of the query conditions are found and the query values are found at the same time, one query condition is considered to be analyzed and completed; if the matching has no corresponding value of the field, considering whether the field is numerical type, matching with negative word analysis and numerical type query judgment conditions of less than or equal to the character analysis, and considering that one query condition analysis is completed; if only the query value is matched without the query field, the field corresponding to the query value in the metadata is automatically used as the query field, and one query condition is considered to be analyzed and completed; if a plurality of parallel query judgment values are analyzed, the query judgment values correspond to a column name and are analyzed into in operation or not in operation in the SQL language in cooperation with negative words.

Further, the time data analysis specifically includes: and analyzing and finding out data information which represents time in the segmented natural language after condition processing.

Further, the feedback question-asking process includes the following steps: and judging whether the similar table names exist or not, if so, giving a table name prompt, otherwise, obtaining a query record with higher query frequency of the user according to the query record of the user, prompting the user to input, and then entering the step s 1.

Further, the feedback question asking process further includes a special vocabulary judging step, if the special vocabulary exists, the input of the corresponding vocabulary is prompted, and after the semantic sentence is replaced, the step is carried out, and then the step is carried out in step S4.

The invention has the beneficial effects that: personnel are not needed to label data, investment of human resources is reduced, labor cost is lowered, and the problem that a system cannot be reused due to machine learning is solved.

Drawings

FIG. 1 is a block flow diagram of the present invention.

Detailed Description

In order to more clearly understand the technical features, objects, and effects of the present invention, embodiments of the present invention will now be described with reference to the accompanying drawings.

A method of automatically generating sql statements from human natural language for querying data, the method comprising the steps of:

s1: a user inputs a natural sentence of query data and performs word segmentation on the natural sentence;

wherein the associated nesting specifically comprises: nested queries are constructed based on the metadata relationships of the tables. For example, if the achievements are address value fields in the address table, but the sales table corresponds to an ID field of the address table, then the mapping can be done. Table 1 shows a metadata specific structure of the table, and table 2 shows a metadata specific structure of the field.

Table_name	Table name, str
		Mapping_name	Mapping name, str
Maping_value	Mapping value, str

Table 1 table metadata structure

TABLE 2 field metadata Structure

If the query path graph is a multi-table, judging whether to construct a Cartesian product according to the metadata or construct multi-stage query according to the query path graph.

Cartesian product: if there is no aggregation, the multi-tiered query is converted to a multi-table Cartesian query. And if the aggregation exists, outputting the relation between the columns and the corresponding columns of the outer layer table after the aggregation according to the query path graph. And resolving the multi-layer query according to the feasible query path.

S42: and performing time processing on the natural language after condition processing: the method comprises the steps of firstly constructing a time trigger word dictionary in a certain mode, for example, summarizing the trigger word dictionary from a training corpus or human experience, then constructing a time affix word dictionary according to modifiers around the trigger words, prefixes, suffixes and other components, and finally identifying a time expression through the combination of the trigger words and the affix words. Taking the explanation of "2019, 3, and 4" as an example, wherein "year", "month", and "day" are trigger words, then the time description can be interpreted as a standardized time format "2019/03/04" according to the components of the trigger word prefixes "2019", "3", and "4". Through the time analysis process, all the time descriptions in the query statement can be unified into a standard format to be constructed by a subsequent data model.

S43: aggregating the natural language after the time processing;

polymerization and analysis: judging the aggregation condition, judging whether to need to carry out aggregation analysis according to the regular expression at present, and finding out the corresponding aggregation condition according to the dependency relationship if the natural language input by the user contains aggregation vocabularies such as maximum, minimum, total, sum, average and the like. If "maximum tobacco sales amount" is input, "the" maximum "corresponding aggregation vocabulary is obtained as" max, "the" maximum "modified" sales amount "is known according to the dependency relationship, and the composition aggregation information" max (sales amount) "can be obtained to be spliced by subsequent sql.

S44: assembling and generating an inquiry sql statement according to the table name, the field name, the inquiry condition field, the inquiry judgment condition and the aggregation information;

s5: carrying out fuzzy query on natural language input by a user;

Stop words include stop, yes, o, query, look, know, etc. stop words but not stop, not, etc. stop words.

The feedback questioning process comprises the following steps: judging whether a similar table name exists or not, if so, giving a table name prompt; when there is no similar table name, the query record with higher query frequency of the user is obtained according to the query record of the user, the user is prompted to input, and then the step s1 is entered.

The feedback question asking process further comprises a special vocabulary judging step, if the special vocabulary exists, the input of the corresponding vocabulary is prompted, and after the semantic sentence is replaced, the step S4 is carried out.

The method for automatically generating the sql sentences by the human natural language of the query data does not need personnel to label the data, reduces the investment of human resources, reduces the expenditure of labor cost, and solves the problem of system reusability caused by machine learning.

The foregoing shows and describes the general principles and broad features of the present invention and advantages thereof. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A method for automatically generating sql statements from human natural language for querying data, comprising the steps of:

s42: analyzing time data of the natural language;

s45: analyzing the query aggregation information;

s5: carrying out fuzzy query on natural language input by a user;

2. The method of claim 1, wherein the stop word does not include a negative word.

3. The method of claim 1, wherein the conditional parsing specifically comprises: the recognized nouns are matched through metadata, and if the field names of the query conditions are found and the query values are found at the same time, one query condition is considered to be analyzed and completed; if the matching has no corresponding value of the field, considering whether the field is numerical type, matching with negative word analysis and numerical type query judgment conditions of less than or equal to the character analysis, and considering that one query condition analysis is completed; if only the query value is matched without the query field, the field corresponding to the query value in the metadata is automatically used as the query field, and one query condition is considered to be analyzed and completed; if a plurality of parallel query judgment values are analyzed, the query judgment values correspond to a column name and are analyzed into in operation or not in operation in the SQL language in cooperation with negative words.

4. The method of claim 1, wherein the temporal data parsing is specifically: and analyzing and finding out data information which represents time in the segmented natural language after condition processing.

5. The method of claim 1, wherein the feedback question process comprises the following steps: and judging whether the similar table names exist or not, if so, giving a table name prompt, otherwise, obtaining a query record with higher query frequency of the user according to the query record of the user, prompting the user to input, and then entering the step s 1.

6. The method as claimed in claim 5, wherein the feedback question-asking process further includes a step of determining a special vocabulary, and if the special vocabulary exists, the method prompts to input the corresponding vocabulary, and enters step S4 after replacing the semantic sentence.