CN116821168A - Improved NL2SQL method based on large language model - Google Patents

Improved NL2SQL method based on large language model Download PDF

Info

Publication number
CN116821168A
CN116821168A CN202311070932.9A CN202311070932A CN116821168A CN 116821168 A CN116821168 A CN 116821168A CN 202311070932 A CN202311070932 A CN 202311070932A CN 116821168 A CN116821168 A CN 116821168A
Authority
CN
China
Prior art keywords
vector
column
language model
key
database
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311070932.9A
Other languages
Chinese (zh)
Other versions
CN116821168B (en
Inventor
段春先
刘梦军
李颖
陈胜鹏
许继伟
雷振
刘高
李小超
王敬佩
付卓
韩小乐
夏帷
王�锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Geospace Information Technology Co ltd
Original Assignee
Geospace Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Geospace Information Technology Co ltd filed Critical Geospace Information Technology Co ltd
Priority to CN202311070932.9A priority Critical patent/CN116821168B/en
Publication of CN116821168A publication Critical patent/CN116821168A/en
Application granted granted Critical
Publication of CN116821168B publication Critical patent/CN116821168B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention is applicable to the technical field of natural language processing, and provides an improved NL2SQL method based on a large generated language model, which comprises the following steps of S1, preprocessing table information of a database; s2, preprocessing a natural language question; s3, matching a target table; s4, outputting a large language model result; and S5, extracting and outputting SQL sentences. The invention uses the generated large language model, utilizes a thinking chain mode to construct a prompt sentence to improve the matching precision, does not need to be retrained end to end, and has better applicability in the actual production environment in which the database table structure is frequently updated. In addition, the method adopts a mode of pre-calculating the table and column weighting score, solves the problem of database query semantic recognition in the multi-table repeated column interference environment in the actual production environment, can help a large language model to better understand and match the related table and column information in the natural language question, and improves the accuracy of a final result.

Description

Improved NL2SQL method based on large language model
Technical Field
The invention belongs to the technical field of natural language processing, and particularly relates to an improved NL2SQL method based on a large generated language model.
Background
Natural language to SQL (Natural Language to SQL, NL2 SQL) is a technique that converts a natural language description input by a user into a structured SQL search statement. The generation of the technology aims at reducing the threshold for database retrieval, so that non-technicians without the database expertise reserve can quickly acquire data information by inputting natural language text. For example, a user enters "number of floating demographics for which a query requires significant attention", such as by NL2SQL techniques, which can be converted into executable SQL query statements: "SELECT COUNT (x) FROM p_ floating WHERE IS _form=1"; "and thus obtain data in the database.
With the development of artificial intelligence technology, a NL2SQL method based on deep learning is gradually formed, and the limitations and inflexibility of the traditional rule matching-based method are solved. However, the deep learning model needs to consume a great deal of manpower to construct a training data set, the implementation cost is high, and the training data set is often required to be retrained to maintain high accuracy in the production environment where the surface structure information is frequently updated, so that the application and popularization are inconvenient. A series of methods based on the pre-training language model are generated, and because semantic grooves exist between the natural language and the SQL sentence, the typical pre-training language model (such as BERT and a derivative model, only codes and does not decode) can not fully capture complex semantic and structural corresponding relations between the natural language and the SQL sentence, and especially under the condition of cross-domain or multi-round interaction, other modules or technologies such as abstract syntax trees, relation attention, text-table data connection and the like are required to be combined to improve the accuracy and the robustness of SQL generation. The current improved method of SoTA comprises a method of an intermediate grammar tree, a method of combining semantic analysis and semantic component matching, a method of introducing SQL professional knowledge base as an external dictionary, and the like. The SQL matching steps of the methods are complex, and the robustness is not high.
For example, patent CN115408506a discloses a method for combining semantic analysis and NL2SQL with semantic component matching, and after the pre-trained BERT model is used to encode a question input by a user, a method for converting natural language into SQL with multiple rounds is adopted, so that accuracy is improved, but time complexity is high due to multiple rounds of conversion. And the method only makes matching prediction on the aggregation function type, the operator type and the column field, and does not comprehensively analyze semantic information to identify the target table, so that confusable or repeated columns existing in actual production are difficult to process. For example, there is a database schema one "{ key teenager information table: { ID, name, guardian name … … … } } }), pattern two "{ leave-on demographic information: { ID, name, guardian name } } ", for the natural language question" guardian name of the guard with ID 1 queried ", it is not possible to distinguish whether the target table is mode one or mode two.
For another example, patent CN115658729a discloses a method for converting natural language into SQL statement based on a pre-training model, where the pre-training model used in this patent is oBERTa model, specifically, the encoding layer uses the RoBERTa-large network structure, and the decoding layer adopts the bidirectional LSTM structure. The method has low accuracy rate for matching the natural language question with inaccurate semantics. And, like the method in CN115408506a, only the matching process is adopted for the columns, the semantic information is not comprehensively analyzed to identify the target table, and thus it is difficult to process confusable or repetitive columns existing in actual production.
In recent years, chatGPT, chatGLM and the like generate a large language model, and provide a new solution for NL2SQL technology: the large language model is enabled to output expected results for a particular NL2SQL task through Prompt engineering without retraining the model. Because the generative large language model has strong semantic understanding capability and a prompt mechanism, when the generative large language model is used for processing NL2SQL tasks, executable SQL sentences can be obtained by only inputting table, column information, natural language questions and simple SQL examples, and the performance of model output is improved by adopting the prompt technologies such as CoT (Chain of Thoughts, thinking chain), coT-SC (Self-consistency withChain of Thoughts, self-consistent thinking chain), toT (Tree of thinking Tree) and the like, so the quality of the prompt sentences is crucial.
The NL2SQL is realized by generating a large language model, a target table is required to be matched from a database to be selected according to the natural language input by a user, the target table is used as a part of a prompt sentence and is sent to the large language model, and the accuracy of the prompt content determines the performance of the model. Because a large number of columns which are easy to be confused and repeated exist in the database warehouse in actual production, the columns of each table are unequal, the length of the meta information is unbalanced, the matching method of the semantic similarity between the user input and the meta information is directly calculated, and the accuracy is difficult to meet the production level requirement.
Disclosure of Invention
In view of the above problems, the invention aims to provide an NL2SQL method based on a generative large language model, which aims to solve the technical problem of low matching accuracy of the existing method.
The invention adopts the following technical scheme:
the improved NL2SQL method based on the generative large language model is characterized by comprising the following steps:
s1, preprocessing table information of a database to obtain an annotated text feature vector set, and calculating weight of each column of a table;
s2, word segmentation is carried out on an input natural language question, and then a text feature vector set of the word segmentation is obtained through text embedding operation;
s3, calculating weighted cosine similarity according to the text feature vector set and the column weight of the notes and the segmented words, and finally obtaining a matching target table of the natural language question;
s4, splicing the list information of the matching target list into a prompt instruction, inputting a generated large language model, and outputting a large language model result;
and S5, extracting SQL sentences from the large language model result output and outputting the SQL sentences.
Further, the specific process of step S1 is as follows:
s11, obtaining all table notes and column notes in a database, performing duplication removal processing on the column notes, performing text embedding operation and normalization processing on the table notes and the column notes, and obtaining a text feature vector set U of the table notes and the column notes;
s12, calculating the weight of each column of the database, wherein column col x The weight of (2) is expressed as score (col) x ),Num (table) is the total number of tables in the database, num (col x ) To contain column col x Is a number of tables of (a).
Further, the specific process of step S2 is as follows:
s21, stopping words of the input natural language question L, and then segmenting the words to obtain a segmented word set C 1
S22, obtaining an expanded word segmentation set C 2 Wherein the word-segmentation set is expandedC 2 The words in the word are selected from word segmentation set C 1 Phrase composed of p words adjacent to each word;
s23, word segmentation set C 1 Expanded word segmentation set C 2 Merging to obtain a final word segmentation result set C of the natural language question L;
s24, obtaining a text feature vector set V of the word segmentation result set C through text embedding operation.
Further, the specific process of step S3 is as follows:
s31, traversing each vector in the text feature vector set V, aiming at the currently acquired vector V i Querying and vector V in text feature vector set U i K vectors with maximum cosine similarity and a set K formed by big-to-small arrangement, wherein the j-th vector in the set K is marked as K j
S32, for vector V i Traversing the set K, for the currently traversed vector K j If K j Corresponding to a certain table in the database, the table weight is a preset table weight constant T, if K j A certain column col in the corresponding data set base x Then the column weight is score (col x ) Calculate vector K j Vector V of AND i Weighted cosine similarity of (2)
Wherein sim (ij) is vector K j Vector V of AND i Cosine similarity of (c);
s33, whenWhen the vector is smaller than or equal to a preset threshold value P, continuing to traverse the next vector in the set K;
s34, whenIf the vector K is greater than the preset threshold P j Corresponding to a certain table in the databaseThen the table is updated, if the vector K j Corresponding to a certain column in the database, obtaining all tables containing the column, and executing updating operation for each table;
the updating operation process is as follows: if the table is in the key name set matching the table score set tableMaxScore, if the corresponding key value update of the table is smaller thanUpdating the key value corresponding to the table to +.>The method comprises the steps of carrying out a first treatment on the surface of the If not, a key value pair is newly added in the target list score set tableMaxScore, the key name is the list name of the current list, and the key value is +.>
S35, for vector V i After the set K is traversed, for each key value pair (key, val) in the matching table score set tableMaxScore, if the key name key is in the key name set of the target table score set tableeLemScore of the natural language question L, the key value of the current key value pair is increased by val; otherwise, adding the current key value pair into a target table score set tableesumscore;
s36, after traversing the text feature vector set V, sorting the key values of the matching table score set tableLecumScore from high to low and outputting a matching target table result.
The beneficial effects of the invention are as follows: the improved NL2SQL method based on the large language model provided by the invention automatically generates high-quality prompt sentences by matching specific NL2SQL tasks for the large language model; specifically, a possible target table is matched according to an input problem, effective table list information of the target table is matched, the information is organized into a prompt sentence, a pre-trained large language model is input, and an SQL query sentence is output. On one hand, the invention utilizes the high semantic understanding capability of the generated large language model, avoids the complex intermediate process in the prior art, and improves the matching speed. By generating a prompt mechanism of a large language model, a thinking chain is constructed to solve the problem of large understanding deviation caused by unclear semantics of a natural language question, so that the matching accuracy is improved; on the other hand, the invention adopts a method for calculating the weighted scores of the tables and the columns, thereby improving the matching accuracy of the table and the column information in the identified natural language question. The method solves the problem of low matching accuracy of the target table caused by confusable or repeated columns in actual production.
Through practical verification, the effect of the invention is superior to the multi-table matching accuracy marked in the current open-source document and the actually measured matching accuracy in the open-source project, the invention can still keep higher accuracy by using fuzzy natural language input which does not influence the semantics in the practical production environment with confusing columns, the matching step is simpler, and the efficiency is higher in the aspect of the response speed of the model than the mode of constructing a complex thinking chain or a thinking tree.
Drawings
FIG. 1 is a flowchart of an improved NL2SQL method based on a generative large language model provided by an embodiment of the invention.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
In order to illustrate the technical scheme of the invention, the following description is made by specific examples.
As shown in fig. 1, the NL2SQL method based on the generated large language model provided by the present embodiment includes the following steps:
step S1, preprocessing table information of a database: preprocessing the table information of the database to obtain annotated text feature vector sets, and calculating the weight of each column of the table.
The step is mainly to preprocess database table information. The specific process is as follows:
s11, obtaining all table notes and column notes in a database, performing duplication removal processing on the column notes, performing text embedding operation and normalization processing on the table notes and the column notes, and obtaining a text feature vector set U of the table notes and the column notes.
This step is not limited to text embedding methods of operation, including but not limited to one-hot unicode, pre-trained natural language model based on word2vec, bert, etc. Note that, in this embodiment, the text embedding manner adopted is word2vec. The embodiment uses a milvus database to store and retrieve feature vectors, and can also use other processing modes which are convenient for vector similarity retrieval.
S12, calculating the weight of each column of the database, wherein column col x The weight of (2) is expressed as score (col) x ),Num (table) is the total number of tables in the database, num (col x ) To contain column col x Is a number of tables of (a).
For example, for the number num (table) =100 of tables in the database warehouse H, the number num (col) of tables containing column a A ) Number num of tables containing column B (col =100 B ) Number num of tables containing column C (col =1 C ) =30, score (col) A )=0.0,score(col B )=1.0,score(col C )=0.71。
S2, preprocessing a natural language question: and segmenting the input natural language question, and obtaining a text feature vector set of the segmented word through text embedding operation.
The specific process of the method is as follows:
s21, stopping words of the input natural language question L, and then segmenting the words to obtain a segmented word set C 1
For example, a natural language question l= "number of floating population that needs to be focused on for query", word segmentation set C after word deactivation 1 For [ ' need ', ' emphasis ', ' attention ', ' floating population ', ' quantity ', '.]。
S22, obtaining an expanded word segmentation set C 2 Wherein the word segmentation set C is expanded 2 The words in the word are selected from word segmentation set C 1 Phrase composed of p words adjacent to each word;
for example, in the present embodiment, p is set to 2, and in step S21, the word segmentation set C 1 Expanded word C of (C) 2 The aggregate is [ ' need emphasis ', ' focus attention ', ' number of floating population ', ']。
Since the notes of the table and the columns are generally phrases containing 2 and 3 words, the words obtained by the direct word segmentation method only contain one word, and the calculated cosine similarity is generally lower because the number of words is different. By using the method of expanding word segmentation, higher cosine similarity can be obtained, and subsequent operation calculation is facilitated.
S23, word segmentation set C 1 Expanded word segmentation set C 2 Merging to obtain a final word segmentation result set C of the natural language question L;
setting the final word segmentation result set of the natural language question L as C, wherein the set C is formed by C 1 And C 2 Combining, i.e. c=c 1 +C 2 The number of words in set C is recorded as num (C).
For example, c=c1+c2= [ ' need ', ' focus ', ' attention ', ' floating population ', ' quantity ', ' need focus ', ' focus attention ', ' floating population quantity ', '.
S24, obtaining a text feature vector set V of the word segmentation result set C through text embedding operation.
And (3) obtaining a text feature vector set V of the word segmentation result set C of the natural language question L by adopting the same text embedding operation mode as that in the step (S11).
Step S3, matching a target table: and calculating weighted cosine similarity according to the text feature vector set and the column weight of the notes and the segmented words, and finally obtaining a matching target table of the natural language question.
The specific matching process of the step is as follows:
s31, traversing each vector in the text feature vector set V, aiming at the currently acquired vector V i Querying and vector V in text feature vector set U i K vectors with maximum cosine similarity and a set K formed by big-to-small arrangement, wherein the j-th vector in the set K is marked as K j
S32, for vector V i The set K is traversed, and the set,vector K for the current traversal j If K j Corresponding to a certain table in the database, the table weight is a preset table weight constant T, if K j A certain column col in the corresponding data set base x Then the column weight is score (col x ) Calculate vector K j Vector V of AND i Weighted cosine similarity of (2)
Wherein sim (ij) is vector K j Vector V of AND i Cosine similarity of (c);
s33, whenWhen the vector is smaller than or equal to a preset threshold value P, continuing to traverse the next vector in the set K;
s34, whenIf the vector K is greater than the preset threshold P j Corresponding to a certain table in the database, performing an update operation on the table, if the vector K j Corresponding to a certain column in the database, obtaining all tables containing the column, and executing updating operation for each table;
the updating operation process is as follows: if the table is in the key name set matching the table score set tableMaxScore, if the corresponding key value update of the table is smaller thanUpdating the key value corresponding to the table to +.>The method comprises the steps of carrying out a first treatment on the surface of the If not, a key value pair is newly added in the target list score set tableMaxScore, the key name is the list name of the current list, and the key value is +.>
S35、For vector V i After the set K is traversed, for each key value pair (key, val) in the matching table score set tableMaxScore, if the key name key is in the key name set of the target table score set tableeLemScore of the natural language question L, the key value of the current key value pair is increased by val; otherwise, adding the current key value pair into a target table score set tableesumscore;
s36, after traversing the text feature vector set V, sorting the key values of the matching table score set tableLecumScore from high to low and outputting a matching target table result.
The method comprises the steps of inquiring K vectors with maximum similarity in an annotated text feature vector set U to form a set K aiming at each vector in a segmented text feature vector set V, traversing the vectors in the set K aiming at each vector in the set V, calculating weighted cosine similarity, judging whether updating is needed according to comparison of the weighted cosine similarity and a threshold value P, and disclosing a specific updating matching table score set process. For each vector V i And after the traversing of the set K is completed, updating the target table score set. And after the set V is traversed, sorting and outputting the key values of the matched list score set from high to low.
In this embodiment, the target table score set tableumscore is a key-value pair object, the key name is the table name, and the key value is the matching score. The matching table score set tableMaxScore is also a key value pair object, the key name is a table name, and the key value is a matching score.
The invention adopts the table and column information related in the natural language question sentence to improve the matching accuracy of the target table, and particularly in the actual production environment with confusable and repeated columns, compared with the method for directly calculating the semantic similarity, the method of the invention has more excellent effect. The method for calculating the column weights considers the occurrence times, the column weights with more occurrence times are lower, and the method maps the weights to the [0, 1] interval, thereby being beneficial to calculating the weighted cosine similarity in the follow-up operation.
From the point of view of algorithm logic, the implementation process of the step S3 is as follows:
s3101, setting i=1, and setting a target table score set matched by the natural language question L as a tableesumscore.
S3102, obtaining the ith vector in the text feature vector set V, and marking the ith vector as V i Querying and vector V in text feature vector set U i The cosine similarity is ranked K vectors of top K, and the K vectors are arranged into a set K from large to small according to the cosine similarity.
S3103. let j=1, v i The matched set of matching table scores is tableMaxScore.
S3104 the j-th vector in set K is denoted K j If K j Corresponding to a table tb in a database (e.g. warehouse H) x The weight is a preset table weight constant T; if K j One column col in the corresponding database (e.g. warehouse H) x Then the column weight is score (col x )。
Vector K j Vector V of AND i The cosine similarity of (c) is denoted sim (ij).
Let K be j And V is equal to i The weighted cosine similarity of (2) isThe calculation formula is as follows:
s3105 comparisonThe magnitude of the threshold value P is equal to or greater than the threshold value P>Greater than P, S3106 is performed, otherwise S3108 is performed.
S3106 if K j Corresponding to a table in the database, executing S3107 on the table;
if K j For a certain column in the database, all tables containing the column are obtained, and S3107 is sequentially executed for each table.
S3107 if the table is in the key name set matching the table score set tableMaxScore, then tableMaxScore [ Table name ]]=max (tableMaxScore [ table name ]],) Otherwise, tableMaxScore [ Table name ]]=/>
S3108.j=j+1. If j < = K, i.e. set K is not traversed, then executing S3104-S3107 to obtain the next vector; otherwise, S3109 is executed.
S3109. for each key-value pair (key, val) in the tableMaxScore, if the key name key is in the set of key names of the tableesumscore, then the tableesumscore [ key ] =tableesumscore [ key ] +val; otherwise tableresumscore [ key ] =val.
S3110.i=i+1. S3102-S3109 are performed if i < = num (C), i.e. set V is not traversed to completion, otherwise S3111 is performed.
S3111, sorting the target table score set tableLemScore from high to low, and outputting a target table matching result.
A specific example is listed below: according to practical experience, k is set to be 3, the table weight constant T is 1.06, and the preset threshold P is 0.9, so that a better effect is achieved. For ease of understanding, the feature vector corresponding to each 'word' in the collection V, K is denoted as W (word), e.g., the feature vector of "required" is denoted as W (required).
For the set c= [ ' need ', ' focus ', ' attention ', ' floating population ', ' quantity ', ' need focus ', ' focus attention ', ' floating population quantity ', ' num (C) =8 ].
S3101.i=1, tableSumScore={};
S3102. V 1 =w (need), query in U and V 1 The vector set of the top three of the cosine similarity ranks is k= ("whether need to cure", "resolution time limit", "service unit/place") and the cosine similarity scores are 0.4902482032775879, 0.38061994314193726, 0.34967041015625, respectively.
S3103.j=1,tableMaxScore={}
S3104.K 1 =w (whether or not remediation is required), which corresponds to the "whether or not remediation is required" column in H, weight is 1.0,sim(ij)=0.4902482032775879,=sim(ij)*1.0=0.4902482032775879。
S3105.<0.9, execution S3108
s3108.j=2; j < = 3, execute S3104-S3107
After repeating the steps of S3104 to S3107 3 times in total, tablemaxscore= { }
S3109.tableSumScore={}
S3110.i=i+1=2. i < =num (C), S3102-S3109 are performed.
S3102.V 2 After repeating the steps S3104 to S3107 3 times in total =w (emphasis)
S3109.tableSumScore={}
S3110.i=i+1=3. i < =num (C), S3102-S3109 are performed.
S3102.V 3 After repeating the steps S3104-S3107 3 times in total =w (focus)
S3109.tableSumScore={}
S3110.i=i+1=4. i < =num (C), S3102-S3109 are performed.
S3102.V 4 =w (floating population)
S3103.j=1,tableMaxScore={}
S3104.K 1 W (floating population information table), which corresponds to the "floating population information table" table in H, with a weight of 1.06, sim (ij) = 0.9087840616703033,=sim(ij)*1.06=0.9633111053705216
S3105.greater than P
S3107. the "floating demographic information table" is not in the key name set of the tableMaxScore, so the tableMaxScore [ "floating demographic information table" ] = 0.9633111053705216, at which point the tableMaxScore = { "floating demographic information table": 0.9633111053705216}
S3108.j=2. j < = 3, execute S3104-S3107
After repeating the steps of S3104-S3107 a total of 3 times, tableMaxScore= { "table of floating population information: 0.9633111053705216}
S3109. tableesumscore= { "table of floating population information: 0.9633111053705216}
S3110.i=i+1=5. i < = num (C), then S3102-S3109 are performed;
S3102.V 5 =w (quantity)
After repeating the steps of S3102 to S3109 3 times, tablemaxscore= { }
S3109. tableesumscore= { "table of floating population information: 0.9633111053705216}
S3110.i=i+1=6. i < = num (C), then S3102-S3109 are performed;
S3102.V 6 =w (need emphasis)
After repeating the steps of S3102 to S3109 3 times, tablemaxscore= { }
S3109. tableesumscore= { "table of floating population information: 0.9633111053705216}
S3110.i=i+1=7. i < = num (C), then S3102-S3109 are performed;
S3102.V 7 =w (focus on)
After repeating the steps of S3102-S3109 3 times, tableMaxScore= { "floating population information table": 0.9235531017184258, "key teenager information table": 0.90173406124115}
S3109. tableesumscore= { "table of floating population information": 1.8868642070889474, "table of key teenager information": 0.90173406124115}
S3110.i=i+1= 8,i < =num (C), then S3102-S3109 are performed;
S3102.V 8 =w (number of floating population)
After repeating the steps of S3102-S3109 3 times, tableMaxScore= { "table of floating population information: 0.9550383374094964}
S3109. tableesumscore= { "table of floating population information": 2.8419025444984438, "table of key teenager information": 0.90173406124115}
S3110.i=i+1=9. i > num (C), S111 is performed.
S3111, sorting the set tableLemscore from high to low to obtain a tableLemscore= { "floating population information table": 2.8419025444984438, "key teenager information table": 0.90173406124115}, and outputting.
Step S4, outputting large language model results: and splicing the list information of the matching target list into a prompt instruction, inputting the generated large language model, and obtaining a large language model result and outputting the result.
For example, the prompt instruction= "there is now a table of floating population information, the table name is p_floating, and the fields include: UID, p_mark_id, flow_ REASON, REGISTER _ DATE, RESIDENCE _type, is_form, source_type, wherein the column name of the unique identification UID IS UID, the column name of the person ID IS p_mark_id, the column name of the inflow REASON IS flow_reason, the column name of the registration DATE IS register_date, the column name of the residence TYPE IS RESIDENCE _type, the column name of whether the person IS focused IS is_form, and the column name of the data SOURCE IS source_type. Please output the SQL statement according to the question. Problems: the number of floating population that needs to be focused on is queried.
Large language model output: "the following SQL statement can be used to query the number of floating demographics that require significant attention: SELECT COUNT FROM p_ floating WHERE IS _form=1, this SQL statement uses conditional queries to filter out the floating population that satisfies the condition. At the same time, the statement also uses the COUNT function to calculate the number of floating populations that need to be focused on.
It should be noted that, the foregoing promt only specifies the content that needs to be contained, the order of the format content of the promt is not strictly specified, and examples such as "SELECT x FROM p_flowing" may be added to the promt, and other description contents may be added at the same time.
Step S5, extracting SQL sentences and outputting: and extracting SQL sentences from the large language model result output and outputting the SQL sentences.
For example, the embodiment of the invention adopts a regular expression to extract the SQL statement output by the large language model, and can also adopt other methods to extract the SQL statement, which is not limited herein.
The final output is SELECT COUNT (x) FROM p_ floating WHERE IS _form=1.
The user may be prompted for supplemental information if the user determines that the SQL statement is incorrect. And splicing the output of the large language model in the initial natural language question L, S as history information and user-supplemented information into a prompt instruction, inputting the large language model together, and outputting the corrected SQL sentence again. For natural language questions with unclear expressions and ambiguous meanings input by the user, the correct answers can be gradually approached in this way.
And finally, verifying the effect of the method through the practical application example.
The database warehouse is tested with 56 tables including a guard information table, a key teenager information table and the like, and the tables are provided with confusable and repeated columns.
The information table of the left-behind personnel is as follows:
CREATE TABLE `p_people` (
the unique identifier UID 'bin (20) NOT NULL COMMENT',
NAME varchar (50) NOT NULL COMMENT 'NAME',
the 'REAR_TYPE' varchar (5) NOT NULL COMMENT 'TYPE of caretaker',
the 'IS_CONSISTENT' varchar (2) NOT NULL COMMENT 'user identity',
HEALTH varchar (5) NOT NULL COMMENT HEALTH status',
HELPER varchar (50) DEFAULT NULL COMMENT 'helps people population ID',
difficulty and appeal of DIFFICULT 'varchar (1024) DEFAULT NULL COMMENT',
HELP_CONDITION varchar (1024) DEFAULT NULL COMMENT 'HELPs the situation',
source_TYPE ' varchar (20) DEFAULT NULL COMMENT ' data Source ',
REMARK ' varchar (200) DEFAULT NULL COMMENT ' REMARKs ',
the 'Creater' varchar (50) NOT NULL COMMENT 'Creater',
the 'CREATETIME' date NOT NULL COMMENT 'creation time',
the ' update ' varchar (50) DEFAULT NULL COMMENT ' modifier,
the update time date DEFAULT NULL COMMENT 'modifies the time',
whether or not IS _ VALID int (11) DEFAULT NULL COMMENT' IS VALID,
the 'EXPIRED_DATE' DATE DEFAULT NULL COMMENT 'time to failure',
PRIMARY KEY (`UID`),
KEY `P_REAR_FK1` (`P_MARK_ID`),
CONSTRAINT `P_REAR_FK1` FOREIGN KEY (`P_MARK_ID`) REFERENCES `p_person_info` (`P_MARK_ID`) ON DELETE NO ACTION ON UPDATE NO ACTION
) The engine= InnoDB DEFAULT CHARSET =utf8component= 'guard information table';
the key teenager table pattern is as follows:
CREATE TABLE `p_teenager` (
the unique identifier UID 'bin (20) NOT NULL COMMENT',
P_MARK_ID ' varchar (50) NOT NULL COMMENT ' personnel ID ',
the 'TEENAGER_TYPE' varchar (5) DEFAULT NULL COMMENT 'adolescent TYPE',
FAMILY conditions of FAMILY_STATUS 'varchar (200) DEFAULT NULL COMMENT',
GUARDIAN_NAME varchar (50) DEFAULT NULL COMMENT 'GUARDIAN NAME',
relationship between GUARDIAN related v varchar (5) DEFAULT NULL COMMENT' and GUARDIAN,
HELPER varchar (50) DEFAULT NULL COMMENT 'helps people population ID',
HELP_METHOD ' varchar (5) DEFAULT NULL COMMENT ' aid ',
HELP_CONDITION varchar (1024) DEFAULT NULL COMMENT 'HELPs the situation',
source_TYPE ' varchar (20) DEFAULT NULL COMMENT ' data Source ',
REMARK ' varchar (200) DEFAULT NULL COMMENT ' REMARKs ',
the 'Creater' varchar (50) NOT NULL COMMENT 'Creater',
the 'CREATETIME' date NOT NULL COMMENT 'creation time',
the ' update ' varchar (50) DEFAULT NULL COMMENT ' modifier,
the update time date DEFAULT NULL COMMENT 'modifies the time',
whether or not IS _ VALID int (11) DEFAULT NULL COMMENT' IS VALID,
the 'EXPIRED_DATE' DATE DEFAULT NULL COMMENT 'time to failure',
PRIMARY KEY (`UID`),
KEY `P_TEENAGER_FK1` (`P_MARK_ID`),
CONSTRAINT `P_TEENAGER_FK1` FOREIGN KEY (`P_MARK_ID`) REFERENCES `p_person_info` (`P_MARK_ID`) ON DELETE NO ACTION ON UPDATE NO ACTION
) Engine= InnoDB DEFAULT CHARSET =utf8component= 'key teenager information table';
the matching results of word2vec and the method of the present invention for natural language questions "number of floating population that requires major attention for query" are shown in the following table.
word2vec:{'p_teenager': 0.8875935077667236}
The method of the invention outputs { 'p_floating': 2.841902544498444}.
The target table should be the floating population information table p_floating, and the method of the present invention gives the correct result.
According to the invention, when the target table is matched, a table and column weighting score calculating mode is adopted, and table and column information in a natural language question is comprehensively utilized, so that the matching accuracy of the target table is improved. The invention uses the generation type large language model, has strong semantic understanding capability, and can improve the accuracy of SQL generation aiming at specific NL2SQL tasks through a prompt mechanism.
The method of the invention understands the natural language question and generates the query language by using the pre-trained generation type large language model, has the advantages that the conversion function can be realized by utilizing the powerful semantic understanding capability and the powerful generation capability of the generation type large language model, a large amount of marking data is not needed as training data, and the generation result can be guided to accord with the expected target through a prompt mechanism. Based on a zero sample learning mechanism, the method is easy to migrate to the inference of a new data structure, and can adapt to the change of the table structure in the actual production environment.
The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, and alternatives falling within the spirit and principles of the invention.

Claims (4)

1. An improved NL2SQL method based on a generative large language model, the method comprising the steps of:
s1, preprocessing table information of a database to obtain an annotated text feature vector set, and calculating weight of each column of a table;
s2, word segmentation is carried out on an input natural language question, and then a text feature vector set of the word segmentation is obtained through text embedding operation;
s3, calculating weighted cosine similarity according to the text feature vector set and the column weight of the notes and the segmented words, and finally obtaining a matching target table of the natural language question;
s4, splicing the list information of the matching target list into a prompt instruction, inputting a generated large language model, and outputting a large language model result;
and S5, extracting SQL sentences from the large language model result output and outputting the SQL sentences.
2. The improved NL2SQL method based on the generative large language model of claim 1, wherein the specific procedure of step S1 is as follows:
s11, obtaining all table notes and column notes in a database, performing duplication removal processing on the column notes, performing text embedding operation and normalization processing on the table notes and the column notes, and obtaining a text feature vector set U of the table notes and the column notes;
s12, calculating the weight of each column of the database, wherein column col x The weight of (2) is expressed as score (col) x ),Num (table) is the total number of tables in the database, num (col x ) To contain column col x Is a number of tables of (a).
3. The improved NL2SQL method based on the generative large language model of claim 2, wherein the specific procedure of step S2 is as follows:
s21, stopping words of the input natural language question L, and then segmenting the words to obtain a segmented word set C 1
S22, obtaining an expanded word segmentation set C 2 Wherein the word segmentation set C is expanded 2 The words in the word are selected from word segmentation set C 1 Phrase composed of p words adjacent to each word;
s23, word segmentation set C 1 Expanded word segmentation set C 2 Merging to obtain a final word segmentation result set C of the natural language question L;
s24, obtaining a text feature vector set V of the word segmentation result set C through text embedding operation.
4. The improved NL2SQL method based on the generative large language model of claim 3, wherein the specific procedure of step S3 is as follows:
s31, traversing each vector in the text feature vector set V, aiming at the currently acquired vector V i Querying and vector V in text feature vector set U i K vectors with maximum cosine similarity and a set K formed by big-to-small arrangement, wherein the j-th vector in the set K is marked as K j
S32, for vector V i Traversing the set K, for the currently traversed vector K j If K j Corresponding to a certain table in the database, the table weight is a preset table weight constant T, if K j A certain column col in the corresponding data set base x Then the column weight is score (col x ) Calculate vector K j Vector V of AND i Weighted cosine similarity of (2)
Wherein sim (ij) is vector K j Vector V of AND i Cosine similarity of (c);
s33, whenWhen the vector is smaller than or equal to a preset threshold value P, continuing to traverse the next vector in the set K;
s34, whenIf the vector K is greater than the preset threshold P j Corresponding to a certain table in the database, performing an update operation on the table, if the vector K j Corresponding to a certain column in the database, obtaining all tables containing the column, and executing updating operation for each table;
the updating operation process is as follows: if the table is in the key name set matching the table score set tableMaxScore, if the corresponding key value update of the table is smaller thanUpdating the key value corresponding to the table to +.>The method comprises the steps of carrying out a first treatment on the surface of the If not, a key value pair is newly added in the target list score set tableMaxScore, the key name is the list name of the current list, and the key value is +.>
S35, for vector V i After the set K is traversed, for each key value pair (key, val) in the matching table score set tableMaxScore, if the key name key is in the key name set of the target table score set tableeLemScore of the natural language question L, the key value of the current key value pair is increased by val; otherwise, adding the current key value pair into a target table score set tableesumscore;
s36, after traversing the text feature vector set V, sorting the key values of the matching table score set tableLecumScore from high to low and outputting a matching target table result.
CN202311070932.9A 2023-08-24 2023-08-24 Improved NL2SQL method based on large language model Active CN116821168B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311070932.9A CN116821168B (en) 2023-08-24 2023-08-24 Improved NL2SQL method based on large language model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311070932.9A CN116821168B (en) 2023-08-24 2023-08-24 Improved NL2SQL method based on large language model

Publications (2)

Publication Number Publication Date
CN116821168A true CN116821168A (en) 2023-09-29
CN116821168B CN116821168B (en) 2024-01-23

Family

ID=88127775

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311070932.9A Active CN116821168B (en) 2023-08-24 2023-08-24 Improved NL2SQL method based on large language model

Country Status (1)

Country Link
CN (1) CN116821168B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117371406A (en) * 2023-10-07 2024-01-09 星环信息科技(上海)股份有限公司 Annotation generation method, device, equipment and medium based on large language model
CN117453717A (en) * 2023-11-06 2024-01-26 星环信息科技(上海)股份有限公司 Data query statement generation method, device, equipment and storage medium

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110688394A (en) * 2019-09-29 2020-01-14 浙江大学 NL generation SQL method for novel power supply urban rail train big data operation and maintenance
CN111324631A (en) * 2020-03-19 2020-06-23 成都海天数联科技有限公司 Method for automatically generating sql statement by human natural language of query data
US20200293617A1 (en) * 2019-03-14 2020-09-17 International Business Machines Corporation Predictive natural language rule generation
US20200334252A1 (en) * 2019-04-18 2020-10-22 Sap Se Clause-wise text-to-sql generation
CN111813802A (en) * 2020-09-11 2020-10-23 杭州量之智能科技有限公司 Method for generating structured query statement based on natural language
CN113111158A (en) * 2021-04-14 2021-07-13 杭州电子科技大学 Intelligent data visualization oriented conversational question-answering implementation method
CN114547329A (en) * 2022-01-25 2022-05-27 阿里巴巴(中国)有限公司 Method for establishing pre-training language model, semantic analysis method and device
US11520815B1 (en) * 2021-07-30 2022-12-06 Dsilo, Inc. Database query generation using natural language text
CN115658729A (en) * 2022-11-02 2023-01-31 广东工业大学 Method for converting natural language into SQL (structured query language) statement based on pre-training model
CN116010575A (en) * 2023-01-19 2023-04-25 桂林电子科技大学 Dialogue generation method integrating basic knowledge and user information
CN116127020A (en) * 2023-03-03 2023-05-16 北京百度网讯科技有限公司 Method for training generated large language model and searching method based on model

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200293617A1 (en) * 2019-03-14 2020-09-17 International Business Machines Corporation Predictive natural language rule generation
US20200334252A1 (en) * 2019-04-18 2020-10-22 Sap Se Clause-wise text-to-sql generation
CN110688394A (en) * 2019-09-29 2020-01-14 浙江大学 NL generation SQL method for novel power supply urban rail train big data operation and maintenance
CN111324631A (en) * 2020-03-19 2020-06-23 成都海天数联科技有限公司 Method for automatically generating sql statement by human natural language of query data
CN111813802A (en) * 2020-09-11 2020-10-23 杭州量之智能科技有限公司 Method for generating structured query statement based on natural language
CN113111158A (en) * 2021-04-14 2021-07-13 杭州电子科技大学 Intelligent data visualization oriented conversational question-answering implementation method
US11520815B1 (en) * 2021-07-30 2022-12-06 Dsilo, Inc. Database query generation using natural language text
US20230037077A1 (en) * 2021-07-30 2023-02-02 Dsilo, Inc. Database generation from natural language text documents
CN114547329A (en) * 2022-01-25 2022-05-27 阿里巴巴(中国)有限公司 Method for establishing pre-training language model, semantic analysis method and device
CN115658729A (en) * 2022-11-02 2023-01-31 广东工业大学 Method for converting natural language into SQL (structured query language) statement based on pre-training model
CN116010575A (en) * 2023-01-19 2023-04-25 桂林电子科技大学 Dialogue generation method integrating basic knowledge and user information
CN116127020A (en) * 2023-03-03 2023-05-16 北京百度网讯科技有限公司 Method for training generated large language model and searching method based on model

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117371406A (en) * 2023-10-07 2024-01-09 星环信息科技(上海)股份有限公司 Annotation generation method, device, equipment and medium based on large language model
CN117453717A (en) * 2023-11-06 2024-01-26 星环信息科技(上海)股份有限公司 Data query statement generation method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN116821168B (en) 2024-01-23

Similar Documents

Publication Publication Date Title
CN110298037B (en) Convolutional neural network matching text recognition method based on enhanced attention mechanism
CN108519890B (en) Robust code abstract generation method based on self-attention mechanism
CN116821168B (en) Improved NL2SQL method based on large language model
CN111931506B (en) Entity relationship extraction method based on graph information enhancement
CN112183094B (en) Chinese grammar debugging method and system based on multiple text features
CN112667818B (en) GCN and multi-granularity attention fused user comment sentiment analysis method and system
CN110390049B (en) Automatic answer generation method for software development questions
CN115048447B (en) Database natural language interface system based on intelligent semantic completion
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
CN115357719A (en) Power audit text classification method and device based on improved BERT model
CN112632250A (en) Question and answer method and system under multi-document scene
CN114332519A (en) Image description generation method based on external triple and abstract relation
CN113705238A (en) Method and model for analyzing aspect level emotion based on BERT and aspect feature positioning model
CN112818698A (en) Fine-grained user comment sentiment analysis method based on dual-channel model
CN114742069A (en) Code similarity detection method and device
CN114398900A (en) Long text semantic similarity calculation method based on RoBERTA model
CN116522165B (en) Public opinion text matching system and method based on twin structure
CN117634615A (en) Multi-task code retrieval method based on mode irrelevant comparison learning
CN110377753B (en) Relation extraction method and device based on relation trigger word and GRU model
CN115510230A (en) Mongolian emotion analysis method based on multi-dimensional feature fusion and comparative reinforcement learning mechanism
CN115203206A (en) Data content searching method and device, computer equipment and readable storage medium
CN114579729A (en) FAQ question-answer matching method and system fusing multi-algorithm model
CN114239555A (en) Training method of keyword extraction model and related device
CN114282537A (en) Social text-oriented cascade linear entity relationship extraction method
Liu et al. The BERT-BiLSTM-CRF question event information extraction method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant