CN117851445A

CN117851445A - Large language model Text2SQL chart generation method and device

Info

Publication number: CN117851445A
Application number: CN202410264178.0A
Authority: CN
Inventors: 王宾; 李照川; 张峰; 张尧臣; 张悦; 李捷明; 王飞; 林浩
Original assignee: Inspur Software Technology Co Ltd
Current assignee: Inspur Software Technology Co Ltd
Priority date: 2024-03-08
Filing date: 2024-03-08
Publication date: 2024-04-09

Abstract

The invention relates to the technical field of data processing, in particular to a method and a device for generating a Text2SQL chart of a large language model, wherein a Text2SQL large model architecture is selected, and a LORA-based method is used for fine tuning the model to master the understanding and conversion capability of texts into SQL languages; and then, in a first-stage Text2SQL prompt word engineering architecture, designing by using a Langchain development architecture, and finally, selecting a chart to generate a task book understanding prompt word architecture, and automatically capturing and analyzing new requirement problems. Compared with the prior art, the method and the device can allow the user to query and generate the data through the natural language, greatly simplify the database operation process and improve the working efficiency.

Description

Large language model Text2SQL chart generation method and device

Technical Field

The invention relates to the technical field of data processing, and particularly provides a method and a device for generating a Text2SQL chart of a large language model.

Background

At present, two solutions mainly exist for the data query and processing problem in the field of industrial digital governance application, but each has a certain problem and challenge. The first solution is a traditional SQL query, which is a data query and processing method that has long been adopted in the field of industrial digital governance applications. The core of this approach is to rely on relational database structures to retrieve and manipulate data by writing structured SQL query statements. In order to ensure the accuracy and effectiveness of the query, operation and maintenance personnel and industry personnel need to have a certain SQL language learning basis. For example, the operator needs to know the table structure of the database in depth, including the association relationship between tables, the data type and meaning of the fields, etc. Meanwhile, they also need to master the writing method of SQL query sentences, including basic addition, deletion and verification operations, and higher-level aggregation, connection, sub-query and other operations. This is a difficult task for non-technicians because the SQL language itself has some complexity and expertise. In addition, the traditional SQL query method has a certain expandability problem. With the continuous development of industry digital governance applications, the data volume is continuously increased, and the data structure may also change. In this case, the original SQL query statement may require extensive modification and optimization to accommodate the new requirements. This undoubtedly increases the difficulty and cost of operation and maintenance, and also limits the development speed of industrial digital governance applications.

The second solution is to attempt to solve the data query and processing problem based on Natural Language Processing (NLP) technology and BERT language model architecture. The core idea of the method is that a large language model is trained so that the language model can understand and analyze natural language query of a user and then convert the natural language query into corresponding SQL query sentences. However, while this approach appears to be attractive in theory, it presents a number of challenges in practical applications. First, there is a natural semantic gap between natural language and SQL. While advanced language models such as BERT have made significant progress in natural language understanding, they still have difficulty in fully understanding and accurately translating complex query intent. For example, for complex query requirements involving multiple table joint queries, nested queries, or aggregate operations, NLP-based approaches often fail to generate the correct SQL statement. Secondly, NLP-based approaches also suffer from significant drawbacks in generalization ability. Due to limitations in the training data and limitations in the model itself, these methods tend to be difficult to adapt to new database structures or query requirements. When faced with new scenarios that differ from the training data distribution, the performance of the model may be significantly degraded, even failing to generate valid SQL statements.

In traditional data processing, operators must master technical languages such as SQL to perform database queries, which is a challenge for industry personnel who are not familiar with computer technology.

Disclosure of Invention

The invention provides a high-practicability large language model Text2SQL chart generation method aiming at the defects of the prior art.

The invention further aims to provide a large language model Text2SQL chart generating device which is reasonable in design, safe and applicable.

The technical scheme adopted for solving the technical problems is as follows:

a large language model Text2SQL chart generation method, first, select Text2SQL large model architecture, use LORA-based method to fine tune the model, master the Text understanding and converting ability into SQL language;

and then, in a first-stage Text2SQL prompt word engineering architecture, designing by using a Langchain development architecture, and finally, selecting a chart to generate a task book understanding prompt word architecture, and automatically capturing and analyzing new requirement problems.

Furthermore, in the Text2SQL big model architecture, the specific design architecture is as follows:

a1, designing a natural language encoder;

a2, designing a structured encoder;

a3, designing a Text2SQL architecture decoder;

a4, fine adjustment of the model;

a5, training evaluation.

Further, in step A1, the Encoder accepts the entire text sequence and encodes it into a hidden layer sequence；

The text sequence is generated in unstructured text in a preprocessing stage and is generated in a mode of non-overlapping text pairs and converted into the textAfter which each +.>The words in (a) are first converted into word vectors, each +.>Also turn into a one-hot vector;

one-hot vector is a vector representation method, wherein One element in the vector is 1, and the other elements are 0, which is usually used for representing the uniqueness of the classification variable;

finally, each is provided withAll mean values are taken out as embedding vectors +.>Finally, NL Encoder inputs all span embedded vectors into a Bi-directional long and short memory network Bi-LSTM, and both the forward and reverse output hidden states in Bi-LSTM are equal to +.>Splicing;

Bi-LSTM is a deep learning model that captures forward and backward dependencies in time series data by passing the data forward and backward to two independent long and short memory networks and combining the information in both directions.

Further, in step A2, the Schema encoder first encodes the wordConverting into word vector, converting category vector into embedding vector +.>The Schema encoder then uses the mean of the word vectors as the initial representation of the field +.>At the same time->Embedding vectors performs an attention mechanism to obtain a context vector +.>；

Finally, the Schema encoder represents the sum of the initial embedding, the category embedding and the context vector as fields as。

Further, in step A3, a query of the synthetic structured query language model SemQL is performed, and when a field needs to be selected, it is first determined whether the field is selected from the internal memory or directly extracted from the database structure, and once a specific field is selected, the specific field is removed from the structure schema of the database and recorded in the memory.

Further, in step A4, a basic large model of the INT4 quantization version is selected, fine tuning is performed on the WikiSQL data set on the basic large model of the INT4 quantization version by using a fine tuning architecture based on the LORA, and the fine tuning architecture is a technology for efficiently fine tuning the large model;

a trainable low rank decomposition matrix is injected in each transducer model layer, the new matrix is trained to accommodate the new data while keeping the overall number of changes low, and finally, the original weights and the adjusted weights are combined to produce the final result.

Further, in step A5, the following 2 indexes are adopted to comprehensively evaluate the basic large model subjected to Text2SQL fine tuning;

firstly, the index I is the execution accuracy, and reflects the correct execution proportion of SQL queries generated by a model in a data set;

and secondly, measuring the accuracy of the index two logical forms and the matching degree between the SQL query generated by the model and the standard SQL.

Further, in the first stage Text2SQL hint word engineering architecture, the specific operation steps are as follows:

b1, problem understanding:

firstly, carrying out deep analysis on an input query problem, identifying and extracting key information elements related to a database in the problem through a trimmed large language model, and determining the type of the queried target database according to the extracted key information;

b2, according to a preset prompt template format, splicing the problem text and the database mode Schema information to form prompt information with complete semantics and explicit indication function;

b3, the expected model carries out deep reasoning based on the received problem and the database mode Schema;

the expected model output contains query SQL sentences aiming at the problems, and the SQL sentences are accurately extracted from the reasoning results of the model;

b4, SQL execution and result feedback;

the method comprises the steps of executing an SQL query statement generated in the previous stage on a target database, returning a query result to a large model, performing deep reasoning in the second stage, designing the first stage architecture to automatically query an industry data management database, and returning the SQL query result to the large model.

Further, in the task book understanding prompt word architecture for chart generation, the specific steps include:

c1, a chart generator carries out multi-mode output on structured data generated by upstream SQL inquiry;

c2, after SQL query is executed and results are returned, the prompt word architecture receives the results and carries out deep understanding again by utilizing the large model, in the process, new demand problems are automatically captured and analyzed, and at the moment, the prompt word architecture calls a corresponding chart generation module to convert the query results into visual bar charts to be output;

if the user does not set forth a particular output requirement, the prompt word architecture will default to outputting a results list of SQL queries.

A large language model Text2SQL graph generation apparatus comprising: at least one memory and at least one processor;

the at least one memory for storing a machine readable program;

the at least one processor is configured to invoke the machine-readable program to execute a large language model Text2SQL schema generation method.

Compared with the prior art, the large language model Text2SQL chart generation method and device have the following outstanding beneficial effects:

the invention has wide application potential in the multi-element scene of industry digital treatment. By constructing an efficient database management query system and integrating rich knowledge question-answer library resources, powerful database retrieval and data analysis capabilities are provided for a large language model. The integrated solution not only greatly improves the industry data processing efficiency, but also reduces the threshold of complex inquiry and analysis, so that related staff can acquire the required information more conveniently.

The method provides a more intuitive and easy way of data querying for those industry workers who lack programming or SQL language skills. Through simplifying the inquiry flow and the operation interface, operation and maintenance personnel can easily deal with the data processing task in the digital treatment field, thereby reducing the workload and improving the overall working efficiency.

The invention optimizes the data treatment process and enhances the data insight and decision making capability by providing an efficient and visual data processing tool in the power assisting industry. This not only helps to promote public quality of service and industry transparency, but also promotes efficient interaction and communication between industry and people.

Therefore, not only is the modernization progress of industry digital governance promoted, but also a powerful data support tool is provided for the industry sector, which is assisted in processing and analyzing mass data resources in a more scientific and efficient manner.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow diagram of a method for generating a large language model Text2SQL graph.

Detailed Description

In order to provide a better understanding of the aspects of the present invention, the present invention will be described in further detail with reference to specific embodiments. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

A preferred embodiment is given below:

as shown in fig. 1, in the method for generating a large language model Text2SQL chart in this embodiment, first, a Text2SQL large model architecture is selected, and a method based on the LORA is used to fine tune the model, so as to grasp the ability of understanding and converting Text into SQL language;

In the Text2SQL big model architecture, the specific design architecture is as follows:

a1, designing a natural language encoder;

first, its Encoder accepts the entire text sequence of input and encodes it into a sequence of hidden layers. The text sequence is generated in unstructured text in a preprocessing stage and is generated in a mode of non-overlapping text pairs, and is converted into the unstructured textIs a format of (c).

Thereafter, eachThe words in (a) are first converted into word vectors, each +.>And also converted into a One-hot vector, wherein One-hot vector is a vector representation method, one element in the vector is 1, and the other elements are 0, and the One-hot vector is usually used for representing the uniqueness of the classification variable. Finally, each->All mean values are taken out as embedding vectors +.>。

Finally, NL Encoder inputs all span embedded vectors into a two-way long and short memory network (Bi-LSTM), and both the forward and reverse output hidden states in Bi-LSTM are summedAnd splicing. Bi-LSTM is a deep learning model that propagates data forward and backwardAnd the information in the two directions is combined, so that forward and backward dependency relations in time sequence data can be captured simultaneously.

A2, designing a structured encoder;

the Schema encoder first encodes the word. Converting into word vector, converting category vector into embedding vector +.>. The Schema Encoder then uses the mean of the word vector as the initial representation of the field +.>. At the same time->Embedding vectors performs an attention mechanism to obtain a context vector +.>. Finally, the Schema encoder represents the sum of the initial embedding, the category embedding and the context vector as a field as +.>。

A3, designing a Text2SQL architecture decoder;

the goal is to synthesize a structured query language model (SemQL) query. The innovation of the invention is that the invention not only gives a tree structure of SemQL, but also accurately models the generation process of SemQL query through the sequence application of a series of actions according to a set of rules based on natural language grammar.

In addition, an advanced memory enhancement directed network mechanism is employed for implementing complex selective query behavior. During operation of the decoder, when a field needs to be selected, it first determines whether the field is selected from internal memory or directly extracted from the database structure. This decision mechanism allows for more flexible and accurate processing of query generation. Once a particular field is selected, it is removed from the structure (schema) of the database and recorded in memory, thereby ensuring logical consistency and accuracy of the query. The unique processing mode ensures that the decoder can effectively reduce errors and improve the efficiency and accuracy of query generation when processing complex queries.

A4, fine adjustment of the model;

a basic large model of the INT4 quantization version was selected. After quantization processing, the model has high performance, and meanwhile, the demand of the video memory is greatly reduced, and the model can be operated only by 10G video memory, so that the deployment flexibility and the cost benefit of the model in practical application are greatly improved. The basic large model is trimmed on the WikiSQL dataset using a LORA-based trimming architecture, which is a technique for efficiently trimming large models.

It represents the weight update by low rank decomposition compared to the traditional fine tuning approach. In this approach, the original pre-trained model weights remain frozen while a trainable low-rank decomposition matrix is injected in each transducer model layer. These new matrices can be trained to accommodate new data while keeping the overall number of changes low. Finally, the original weights and the adjusted weights are combined to produce a final result.

A5, training evaluation;

the following 2 indexes are adopted to carry out comprehensive rationality evaluation on the basic large model subjected to Text2SQL fine adjustment.

First, the first index is the execution accuracy, which mainly reflects the proportion of the SQL query generated by the model that is executed correctly in the dataset, but it should be noted that there may be an overestimated condition for this index.

Secondly, the accuracy of the two-dimensional logic form of the index is measured, the index mainly measures the matching degree between the SQL query generated by the model and the standard SQL, but the index is also noted that a certain underestimation risk can exist. By integrating these two indices we can more fully evaluate the performance of the model.

In the first-stage Text2SQL hint word engineering architecture, the specific operation steps are as follows:

b1, understanding the problem;

firstly, the input query questions are deeply analyzed, and key information elements related to the database in the questions are identified and extracted through the trimmed large language model. Such key information includes, but is not limited to, database entity names, attributes, relationships, query conditions, and the like; and determining the type of the inquired target database according to the extracted key information. On this basis, a database Schema (Schema) corresponding to the database type is further determined to ensure the accuracy and validity of the subsequent query operation.

And B2, according to a preset prompt template format, splicing the problem text and the database mode (Schema) information to form prompt information with complete semantics and explicit indication function. The prompt information not only contains understanding of the questions and database modes, but also requires that the model can output related information such as query SQL sentences, query results, answers to the questions and the like according to a specific format.

B3, the expected model will make deep reasoning based on the received questions and database patterns (Schema). The output of the model will contain a query SQL statement for the problem, which is extracted precisely from the reasoning results of the model. The design ensures the effectiveness and pertinence of query SQL, and provides a solid foundation for subsequent data query;

b4, SQL execution and result feedback;

the stage executes the SQL query statement generated in the previous stage on the target database, and returns the query result to the large model for the second stage of deep reasoning.

In particular, the first-stage architecture is designed to automatically query industry data governance databases and return SQL query results to a large model for deeper understanding and analysis. The mechanism effectively improves the processing capacity of the model on the query result and the overall reasoning effect.

In the diagram generation task understanding prompt word architecture, the specific steps include:

c1, a chart generator;

the ability to multi-modal output structured data generated by upstream SQL queries. This design allows the system to process and parse data as well as present to the user in an intuitive, understandable manner. In particular, the system is capable of generating various forms of visual charts, such as a ranking table, a line graph, a bar graph, and the like, from query results.

And C2, after the SQL query is executed and the results are returned, the prompt word architecture receives the results and carries out deep understanding again by using the large model.

In this process, the system can automatically capture and resolve new demand issues, for example, the user may desire to present the query results in the form of a histogram. At this time, the prompt word architecture calls a corresponding chart generation module to convert the query result into an intuitive histogram for output.

If the user does not set forth a particular output requirement, the prompt word architecture will default to outputting a results list of SQL queries. This design ensures flexibility and ease of use of the system, enabling the user to select different output modes according to actual needs.

Based on the above method, a large language model Text2SQL chart generating device in this embodiment includes: at least one memory and at least one processor;

the at least one memory for storing a machine readable program;

The above-mentioned specific embodiments are merely specific examples of the present invention, and the scope of the present invention is not limited to the specific embodiments, and any suitable changes or substitutions made by those skilled in the art, which conform to the technical solutions described in the claims of the present invention, should fall within the scope of the present invention.

Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A large language model Text2SQL chart generation method is characterized in that firstly, a Text2SQL large model architecture is selected, a LORA-based method is used for fine tuning the model, and the ability of Text understanding and conversion into SQL language is mastered;

2. The method for generating a large language model Text2SQL graph according to claim 1, wherein in the Text2SQL large model architecture, a specific design architecture is:

a1, designing a natural language encoder;

a2, designing a structured encoder;

a3, designing a Text2SQL architecture decoder;

a4, fine adjustment of the model;

a5, training evaluation.

3. The method of claim 2, wherein in step A1, the Encoder accepts the entire Text sequence and encodes it into a hidden state sequence；

4. The large language model Text2SQL graph generation method according to claim 3, wherein in step A2, the Schema encoder first generates wordsConverting into word vector, converting category vector into embedding vector +.>The Schema encoder then uses the mean of the word vectors as the initial representation of the field +.>At the same time->Embedding vectors performs an attention mechanism to obtain a context vector +.>；

5. The method according to claim 4, wherein in step A3, a query is made for the structured query language model semQL, and when a field is selected, it is first determined whether the field is selected from the internal memory or directly extracted from the database structure, and once a specific field is selected, the specific field is removed from the database structure schema and recorded in the memory.

6. The method for generating Text2SQL graph of claim 5, wherein in step A4, a basic large model of INT4 quantization version is selected, fine tuning is performed on the INT4 quantization version basic large model on the WikiSQL data set by using a fine tuning architecture based on LORA, and the fine tuning architecture is a technology for efficiently fine tuning the large model;

7. The method for generating a large language model Text2SQL graph according to claim 6, wherein in step A5, the following 2 indexes are used to comprehensively evaluate the basic large model subjected to Text2SQL fine tuning;

8. The method for generating large language model Text2SQL chart according to claim 7, wherein in the first stage Text2SQL hint word engineering architecture, the specific operation steps are as follows:

b1, problem understanding:

b4, SQL execution and result feedback;

9. The method for generating a large language model Text2SQL graph according to claim 8, wherein in the graph generating task book understanding prompt word architecture, the specific steps comprise:

10. A large language model Text2SQL graph generating device, comprising: at least one memory and at least one processor;

the at least one memory for storing a machine readable program;

the at least one processor being configured to invoke the machine readable program to perform the method of any of claims 1 to 9.