CN114547068A

CN114547068A - Data generation method, device, equipment and computer readable storage medium

Info

Publication number: CN114547068A
Application number: CN202210087092.6A
Authority: CN
Inventors: 耿瑞莹; 石翔; 黎槟华; 惠彬原; 李永彬; 孙健
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2022-01-25
Filing date: 2022-01-25
Publication date: 2022-05-27

Abstract

The embodiment of the application discloses a data generation method, a device, equipment and a computer readable storage medium, and the technical scheme comprises the following steps: acquiring an SQL (structured query language) syntax tree, wherein the SQL syntax tree is generated in advance by using SQL syntax rules; generating more than one SQL statement by combining the SQL syntax tree with the sampled table data; and respectively generating corresponding natural language texts aiming at the SQL sentences on the basis of a pre-configured dialect template so as to obtain a plurality of sample pairs, wherein the sample pairs are composed of the SQL sentences and the corresponding natural language texts. According to the method and the device, a large number of sample pairs formed by the SQL sentences and the natural language texts thereof can be automatically generated to serve as training data of the semantic analysis model, so that the cost is saved, the efficiency is improved, and the large-scale popularization is facilitated.

Description

Data generation method, device, equipment and computer readable storage medium

Technical Field

The present application relates to the field of computer application technologies, and in particular, to a data generation method, apparatus, device, and computer-readable storage medium.

Background

At present, intelligent question answering has been widely applied to scenes such as customer service, marketing, service acquisition and the like, is used as a supplement of Graphical User Interface (GUI) to provide efficient and personalized experience for users, and is even integrated into hardware devices such as intelligent sound boxes, intelligent homes and intelligent navigation. TableQA (tabular question answering) is widely used in various scenarios as one of the intelligent question answering forms. In TableQA, knowledge is organized in the form of a table. For example, product information of a merchant is stored in a table form, and when a user asks for specific attributes of a certain product, the question-and-answer model can convert natural Language input by the user in a Query into SQL (Structured Query Language) statements and then locate answers from the table.

The Natural Language is converted into SQL by usually adopting a semantic analysis model such as NL2SQL (Natural Language To Structured Query Language), which requires a large amount of training data during training, and if a manual labeling manner is adopted To obtain the training data, the cost is very high, the efficiency is low, and the large-scale popularization is difficult.

Disclosure of Invention

In view of this, the present application provides a data generation method, apparatus, device and computer readable storage medium, so as to reduce the cost of obtaining training data of a semantic parsing model and improve the efficiency.

The application provides the following scheme:

in a first aspect, a data generation method is provided, where the method includes:

acquiring a Structured Query Language (SQL) syntax tree, wherein the SQL syntax tree is generated in advance by utilizing SQL syntax rules;

generating more than one SQL statement by combining the SQL syntax tree with the sampled table data;

respectively generating corresponding natural language texts aiming at all SQL sentences based on a pre-configured dialect template to obtain a plurality of sample pairs, wherein the sample pairs are composed of the SQL sentences and the corresponding natural language texts;

sampling at least one sample pair from the obtained plurality of sample pairs;

respectively taking each sampled sample pair as a sample pair of the first round of conversation to generate a sample pair of the subsequent N rounds of conversations, wherein N is a positive integer;

and forming a group of multi-turn dialogue sample pairs by using the sample pair of the first turn of dialogue and the sample pair of the subsequent N turns of dialogue.

According to one implementable manner in an embodiment, the SQL syntax tree comprises an operation part node and a condition part node;

the operation part node comprises an operation keyword, a focus parameter and an aggregation function;

the condition part nodes comprise condition keywords, condition parameters and condition operators;

wherein the focus parameter and/or the condition parameter indicate a corresponding table data type.

According to one implementation of an embodiment, generating one or more SQL statements using the SQL syntax tree in combination with the sampled table data comprises:

traversing the SQL syntax tree, and filling the traversed focus parameters and/or condition parameters by utilizing the sampled table data to obtain more than one SQL statement.

According to an implementation manner in an embodiment, the generating, based on the preconfigured conversational template, corresponding natural language texts for the respective SQL statements includes:

determining word granularity tactical templates corresponding to focus parameters, condition parameters, aggregation functions and condition operators in SQL sentences respectively;

combining the determined word granularity phonetics templates according to the logical relationship among the focus parameters, the condition parameters, the aggregation functions and the condition operators in the SQL sentences to obtain more than one phrase granularity phonetics templates;

combining the more than one phrase granularity tactical templates according to the logic relation among the phrases in the SQL statement to obtain more than one sentence granularity tactical templates;

and determining the natural language text corresponding to the SQL sentence from the more than one sentence granularity linguistic template.

According to one implementation manner in the embodiment, the generating the sample pairs of the subsequent N-rounds of dialogues based on the sample pairs of the first-round dialogues includes:

taking the sample pair of the first round of conversation as the sample pair of the current round;

determining an editing strategy applicable to a sample pair of the current round from a preset editing strategy set;

sampling an editing strategy from the determined editing strategies, and editing the SQL sentences and the natural language texts in the sample pairs of the current round respectively by using the sampled editing strategies to obtain the edited SQL sentences and the edited natural language texts as the sample pairs of the next round of conversation;

and taking the edited sample pair of the next round of conversation as the sample pair of the current round, and switching to the operation of determining the editing strategy applicable to the sample pair of the current round from the preset editing strategy set until obtaining the sample pair of N rounds of conversation after the sample pair of the first round of conversation.

According to an implementable manner of an embodiment, the set of editing policies comprises at least one of the following editing policies:

add condition, modify condition, delete condition, modify focus, modify aggregation function, delete focus, restart, reject identification, and table switch.

In a second aspect, a data generation method is provided, the method including:

and respectively generating corresponding natural language texts aiming at the SQL sentences on the basis of a pre-configured dialect template so as to obtain a plurality of sample pairs, wherein the sample pairs are composed of the SQL sentences and the corresponding natural language texts.

According to one implementation of an embodiment, populating the traversed focus parameters and/or condition parameters with the sampled table data includes:

filling the focus parameter with the sampled table name and/or column name; and/or the presence of a gas in the gas,

and filling the condition parameters by using the sampled column names and the sampled record values.

According to a third aspect, there is provided a method of acquiring training data, the method comprising:

obtaining sample pairs as training data, wherein the sample pairs comprise pairs of multi-turn dialog samples generated by the method of the first aspect or pairs of samples generated by the method of the second aspect;

the training data is used to train a table-based pre-trained language model or a table-based semantic parsing model.

According to a fourth aspect, there is provided a data generating apparatus, the apparatus comprising:

a syntax tree obtaining unit configured to obtain a Structured Query Language (SQL) syntax tree, wherein the SQL syntax tree is generated in advance by using SQL syntax rules;

the SQL sentence generating unit is configured to generate more than one SQL sentence by combining the SQL syntax tree with the sampled table data;

the text generation unit is configured to generate corresponding natural language texts aiming at the SQL sentences respectively on the basis of a pre-configured dialect template so as to obtain a plurality of sample pairs, and the sample pairs are composed of the SQL sentences and the corresponding natural language texts;

a multi-round sample generation unit configured to sample at least one sample pair from the obtained plurality of sample pairs; respectively taking each sampled sample pair as a sample pair of the first round of conversation to generate a sample pair of the subsequent N rounds of conversations, wherein N is a positive integer; and forming a group of multi-turn dialogue sample pairs by using the sample pair of the first turn of dialogue and the sample pair of the subsequent N turns of dialogue.

According to a fifth aspect, there is provided a data generating apparatus, the apparatus comprising:

the text generation unit is configured to generate corresponding natural language texts aiming at the SQL sentences respectively on the basis of a pre-configured dialogs template so as to obtain a plurality of sample pairs, and the sample pairs are composed of the SQL sentences and the corresponding natural language texts.

According to a sixth aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the method of any of the above first aspects.

According to a seventh aspect, there is provided an electronic device, comprising:

one or more processors; and

a memory associated with the one or more processors for storing program instructions that, when read and executed by the one or more processors, perform the steps of the method of any of the first aspects described above.

The following advantages may be achieved according to the specific embodiments provided in the present application:

1) according to the method and the device, a large number of sample pairs formed by the SQL sentences and the natural language texts thereof can be automatically generated to serve as training data of the semantic analysis model, so that the cost is saved, the efficiency is improved, and the large-scale popularization is facilitated.

2) According to the method and the device, the accuracy of the natural language text is guaranteed by presetting the mode of generating the natural language text by the dialect template, and the diversity of the natural language text is guaranteed by carrying out fine-grained division on the dialect template.

3) According to the method and the device, the multi-turn dialogue sample pairs can be further generated on the basis of the obtained text pairs, so that training data are provided for training of the semantic analysis model supporting multi-turn dialogue.

Of course, it is not necessary for any product to achieve all of the above-described advantages at the same time for the practice of the present application.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 illustrates an exemplary system architecture diagram to which embodiments of the present application may be applied;

fig. 2 is a main flowchart of a data generation method provided by the embodiment of the present disclosure;

FIG. 3 is a partial schematic content of an SQL syntax tree provided by an embodiment of the present application;

FIG. 4 is a flow chart of generating pairs of multi-turn dialog samples provided by an embodiment of the present application;

FIG. 5 shows a schematic block diagram of the data generation apparatus according to one embodiment;

fig. 6 is a schematic architecture diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described clearly and completely with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only some embodiments of the present application, and not all embodiments. All other embodiments that can be derived from the embodiments given herein by a person of ordinary skill in the art are intended to be within the scope of the present disclosure.

The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the examples of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be understood that the term "and/or" as used herein is merely one type of association that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination" or "in response to a detection", depending on the context. Similarly, the phrases "if determined" or "if detected (a stated condition or event)" may be interpreted as "when determined" or "in response to a determination" or "when detected (a stated condition or event)" or "in response to a detection (a stated condition or event)", depending on the context.

FIG. 1 illustrates an exemplary system architecture to which embodiments of the present application may be applied.

As shown in fig. 1, the system architecture may include

terminal devices

101 and 102, a network 103, and a server 104. The network 103 serves as a medium for providing communication links between the

terminal devices

101, 102 and the server 104. Network 103 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

A user may interact with server 104 through network 103 using

terminal devices

101 and 102. Various applications, such as a voice interaction application, a web browser application, a communication-type application, etc., may be installed on the

terminal devices

101 and 102.

The

terminal devices

101 and 102 may be various electronic devices, and may be screen devices or non-screen devices. Including but not limited to smart phones, tablet computers, smart speakers, smart televisions, PCs (Personal computers), wearable devices, and the like.

The server 104 may be a single server, a server group including a plurality of servers, or a cloud server. The cloud Server is also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and virtual Private Server (VPs) service. The data generation device provided by the present application may be arranged and run in the server 104 described above. It may be implemented as a plurality of software or software modules (for example, for providing distributed services), or as a single software or software module, which is not specifically limited herein.

The data generating device in the server 104 generates a sample pair of < SQL, Text >, in which SQL refers to an SQL statement and Text refers to a natural language Text corresponding to the SQL statement, in the manner provided by the present application. These sample pairs may be used as training data for a semantic parsing model. Semantic parsing models are important components such as the Table QA model. After receiving the question of the terminal device 101 or the terminal device 102, the server 104 may obtain an answer corresponding to the question by using the Table QA model, and return the answer to the terminal device 101 or the terminal device 102 that sent the question.

The questions and the answers can be in a text form or a voice form. If the form of the voice is adopted, the server 104 may further include a corresponding voice processing portion, such as a module for voice analysis, voice synthesis, and the like, which is not limited in this application.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Fig. 2 is a main flowchart of a data generation method provided by the embodiment of the present disclosure, which may be executed by the server 104 in the architecture shown in fig. 1. In addition to servers, other computer devices with greater computing power may also execute. As shown in fig. 2, the method may comprise the following main steps:

step 201: and acquiring the SQL syntax tree, wherein the SQL syntax tree is generated in advance by using the SQL syntax rules.

Step 202: and generating more than one SQL statement by combining the SQL syntax tree with the sampled table data.

Step 203: and respectively generating corresponding natural language texts aiming at the SQL sentences on the basis of a pre-configured dialect template so as to obtain a plurality of sample pairs, wherein the sample pairs are composed of the SQL sentences and the corresponding natural language texts.

According to the flow, the SQL sentences are automatically generated by combining the SQL syntax tree with the sampled table data, and the natural language texts of the SQL sentences are automatically generated based on the pre-configured dialect templates, so that a large number of text pairs which can be used as the training data of the semantic analysis model are obtained. Compared with a manual labeling mode, the method greatly saves cost, improves efficiency and facilitates large-scale popularization.

Further, in order to support model training in a multi-round interactive scenario, the above process may further include step 204: sampling at least one sample pair from the obtained multiple sample pairs, and taking each sampled sample pair as a sample pair of the first round of conversation respectively to generate a sample pair of the subsequent N rounds of conversations; and forming a group of multi-turn dialogue sample pairs by using the sample pair of the first turn of dialogue and the sample pair of the subsequent N turns of dialogue.

The following describes each step in the above-described flow in detail. First, the step 201 of "obtaining SQL syntax tree" will be described in detail with reference to the embodiments.

SQL is a standard computer language for accessing and manipulating databases and is used to access and operate database systems. SQL is a database-oriented execution of queries that can retrieve data from a database, insert new records into a database, update data in a database, delete records from a database, create a new database, create new tables in a database, and the like. Most operations performed on the database may be performed by SQL statements.

SQL has specific syntax rules, and according to the SQL syntax rules, an abstract syntax tree can be generated, which contains possible syntax structures of the SQL statement. The SQL syntax tree mainly comprises operation part nodes and condition part nodes.

The operation part nodes mainly comprise operation keywords, focus parameters and aggregation functions. Operation keywords are keywords used to indicate specific operations, such as "select", "from", "update", "delete", etc. (SQL syntax is case insensitive). The aggregation function may be some specific function name, such as averaging function avg, max taking the maximum value, min taking the minimum value, and so on. The focus parameter mainly represents what the focus of the SQL query is, and in the SQL syntax tree, the focus parameter node may indicate the corresponding table data type, such as the table name, the column name, and so on.

The condition part nodes mainly comprise condition keywords, condition parameters and condition operators. The condition keyword is a keyword such as "where" or the like to indicate a condition. The conditional operator may be some specific function name, a logical operator, a mathematical operator, or the like. E.g., and, or, not, >, <, >, etc.

Fig. 3 is a partial schematic content of an SQL syntax tree according to an embodiment of the present application, as shown in fig. 3. For the Select statement, it may include an operation part node and a condition part node. The operation part nodes comprise operation keywords 'select' and 'from'. The fields portion after select may include a focus parameter "field name," which may contain indication information that may indicate the corresponding column name. The fields section may also include function names function (), and specifically may include function () function b (), function () and the like. The tables section from may include the focus parameter "table," which may contain indication information indicating the corresponding table name. the tables section may also include other select statements. The condition part node includes a condition keyword "where". The condition section following the where may include specific condition parameters such as expA, expB, expC, etc., may contain indication information indicating the corresponding column name. The Condition section may also include operators such as and, or, not, >, <, >, and so on.

Wherein, some parts can be indicated in the SQL syntax tree as unnecessary, such as the condition part, some keywords such as the operation part, and so on. It should be noted that fig. 3 is only schematic, and the specific SQL syntax tree may adopt other structures, but the principle of spirit is similar.

The following describes the above step 202, i.e. generating more than one SQL statement by using the SQL syntax tree and the sampled table data, in detail with reference to the embodiments.

In this step, expressions of various SQL statements, which are context-free expressions, can be obtained by traversing the SQL syntax tree. In the process of traversing the SQL syntax tree, for the traversed focus parameters and/or condition parameters, table data may be sampled from the database according to the table data type indicated by the syntax tree, and the traversed focus parameters and/or condition parameters are filled with the sampled table data to obtain the SQL statements. Through the combination of the traversal and the sampling, a plurality of SQL sentences can be obtained.

For example, in the process of traversing the SQL syntax tree, the expression of one of the SQL statements is obtained as follows: "select (column name) from (table name) where (column name) > (record value)". Where the focus parameter and the condition parameter are in parentheses and are represented in the expression by the type of table data indicated by them. Filling the expression by sampling column names, table names, record values and the like from table data can obtain the following SQL statements:

select name from student registry height >180

select name from student registry where body weight >100

select product name for product table where price >200

……

In addition to the above-mentioned manner of traversing the SQL syntax tree, all templates may be sorted in advance according to the SQL syntax tree, and then the templates are filled with sampled table data, or other realizable manners are adopted, which are not listed here. But this approach is preferred because it is more efficient to traverse the SQL syntax tree in conjunction with sampling the table data.

The following describes step 203, that is, generating corresponding natural language texts for each SQL statement based on the pre-configured linguistic template to obtain a plurality of sample pairs in detail with reference to the embodiment.

After a large number of SQL statements are obtained, since training data required by the semantic analysis model is a large number of < SQL, Text > sample pairs, corresponding natural language texts need to be generated for the obtained large number of SQL statements respectively. By natural language is meant a language that naturally evolves with culture, i.e. the form of language that humans typically express, including chinese, english, japanese, etc. In the embodiments of the present application, chinese is taken as an example for description.

Generally, the ability of a semantic parsing model is determined to a great extent by the accuracy and diversity of training data, the accuracy means whether the service can be effectively supported, and the diversity means whether the effect of the model can be guaranteed. In the embodiment of the application, the accuracy of the natural language text is ensured by presetting the dialoging template, and the diversity of the natural language text is ensured by performing fine-grained division on the dialoging template.

Firstly, the service provider is allowed to preset some dialogical templates to accurately describe the requirements of the user, and the requirements correspond to the processing of the tables in the database. Specifically, in order to ensure diversity, the above-mentioned dialect templates can be set from three levels of word granularity, phrase granularity and sentence granularity, respectively.

Firstly, on the word granularity, word granularity tactical templates corresponding to focus parameters, condition parameters, aggregation functions, condition operators and the like in the SQL statement can be determined. For example, in a financial product, the word granularity dialog template may be predefined as "highest" for the operator max. But in Chinese expression, a great number of synonyms can be accumulated, and the diversity of word granularity can be realized by replacing the synonyms. For example, the operator max may also get the word granularity, the morphology template "max", after synonym replacement at the word granularity.

And then, on the phrase granularity, combining the determined word granularity linguistic templates according to the logical relationship among the focus parameters, the condition parameters, the aggregation functions and the condition operators in the SQL statement to obtain more than one phrase granularity linguistic template. For example, in a financial product, a phrase-granularity conversational template "maximum rate of return" may be obtained for a combination of the condition parameter "rate of return" and the operator max. Also using synonymous substitution, the phrase granularity morphology template "highest rate of return", "most aggressive" is also available.

And then combining the obtained phrase granularity linguistic templates according to the logical relationship among the phrases in the SQL sentence to obtain more than one sentence granularity linguistic template. Taking SQL statement "choose financial product where income type is" warranty and max income rate "as an example, phrase granularity terminology templates corresponding to" benefit type is "warranty type", phrase granularity terminology templates corresponding to "max (income rate)" are "highest income rate", "most aggressive", "maximum income rate", and phrase granularity terminology templates corresponding to "select financial product" are "what financial product has" and "what financial product has". Then, after the combination, the natural language text corresponding to the SQL statement can be obtained as follows:

"what the cost-preserving financial products with the highest yield are,

"what financing product is the cost-keeping type and the highest yield rate",

"what are the most aggressive warranty type financial products",

"what maximum profitability is the cost-keeping financial product",

……。

it can be seen that the data can be exponentially produced by combining the particle size linguistic templates, so that the diversity of the generated natural language text is improved, and the model effect of the semantic analysis model trained by using the natural language text as training data is improved.

The natural language text corresponding to the SQL statement may then be determined from the sentence-granular dialog template. For example, all the determined sentence granularity grammar templates can be used as the natural language text corresponding to the SQL. For another example, the respective natural language texts may also be scored in combination with other natural language evaluation manners, for example, scoring is performed based on fluency, grammar compliance, and other aspects, and finally, a sentence granularity linguistic template with scores meeting preset requirements is selected as the natural language text corresponding to the SQL. The specific scoring method can adopt some existing natural language models, and is not detailed here.

Thus, a large number of sample pairs consisting of SQL and the corresponding natural language text can be obtained. These sample pairs are all context-free, i.e., not generated with the premise of specific context semantics. However, in an actual scenario, a user often has a problem of multiple rounds, and usually there is a semantic association in context, so that a semantic analysis model of multiple rounds of conversations is required. Therefore, the step 204 needs to be further executed, and the step 204 is described in detail below with reference to the embodiment.

In addition to the large number of context-free < SQL, Text > sample pairs obtained by the above steps, multiple dialog sample pairs may be generated for all sample pairs, or multiple dialog sample pairs may be generated for only some of the sample pairs. If the latter case is employed, at least one sample pair may be sampled from the previously derived context-free < SQL, Text > sample pairs, with multiple dialog sample pairs being generated for each sample pair sampled, respectively. Namely, each sampled sample pair is respectively taken as a sample pair of the first-round conversation, and a sample pair of the subsequent N-round conversation is generated, so that the sample pair of the first-round conversation and the sample pair of the subsequent N-round conversation form a group of multi-round conversation sample pairs. Wherein N is a positive integer.

Taking the value of N as 2 as an example, assuming that one of the context-free sample pairs is < SQL1, Text1>, the sample pair < SQL2, Text2> of the second round may be generated for < SQL1, Text1>, and then the sample pair < SQL3, Text3> of the third round may be further generated. The generation process thereof is described in detail below.

As one of the realizable ways, when determining the sample pair of the subsequent N-rounds of dialog by using one of the sample pairs sampled as the sample pair of the first round of dialog, the flow as shown in fig. 4 may be adopted, including the following steps:

step 401: and taking the sample pair of the first turn of conversation as the sample pair of the current turn.

For example, with < SQL1, Text1> first as the sample pair for the current round.

Step 402: and determining the editing strategy applicable to the sample pair of the current turn from a preset editing strategy set.

Through analysis of multiple rounds of problems occurring in an actual scene, most of the problems are found to be obtained by performing some editing (for example, deleting part of conditions, modifying part of conditions and the like) on the basis of the problems of the previous round. Therefore, in the embodiment of the application, the relationship between each turn in the actual scene may be analyzed in advance, and some editing strategies are set in advance to form an editing strategy set. These editing policies in the set of editing policies reflect semantic associations between questions of adjacent turns, i.e., what types of edits are semantically made.

The above editing policy may include, but is not limited to: add condition, modify condition, delete condition, modify focus, modify aggregation function, delete focus, restart, reject identification, and table switch. Details will be described in the next step.

Since the editing strategy that can be applied is different for different sample pairs. For example, only the SQL statement with the condition can apply the editing strategy of modifying the condition and deleting the condition. Only SQL statements with aggregation functions can be adapted to modify the statements of aggregation functions. Therefore, the corresponding applicable editing strategies can be set in advance for the grammars of various SQL statements (the SQL statements and the grammatical structures of the natural language texts corresponding to the SQL statements are similar, and also can be set for the grammars of the natural language texts), and in this step, the editing strategies applicable to the current round sample pair are determined first.

Step 403: and sampling an editing strategy from the determined editing strategies, and editing the SQL sentences and the natural language texts in the sample pairs in the current round respectively by using the sampled editing strategies to obtain the edited SQL sentences and the edited natural language texts as the sample pairs of the next round of conversation.

Assume that determining the applicable editing policy for < SQL1, Text1> includes: edit strategy 1, edit strategy 2, edit strategy 3. An editing strategy is sampled from the data, for example, an editing strategy 1 is sampled randomly, then SQL1 and Text1 are edited by using the editing strategy 1 to obtain SQL2 and Text2, and then < SQL2, Text2> is used as a second round of sample pair.

The following examples are given for each editing strategy:

example 1: SQL1 is "select financial products where profitability > 5%", and Text1 is "what financial products with profitability greater than 5% are". If the sampled editing strategy is an increasing condition, SQL2 can be obtained as "choose financial product where yield rate > 5% and yield type is warranty", and Text2 is "what financial product where yield rate is greater than 5% and warranty type is. Where the conditions added in the SQL statement must be consistent with the conditions added in the natural language text. The condition can be randomly selected from the SQL syntax tree, and the column name and the record in the table data are sampled and then filled to generate the condition, which is similar to the mode in the previous SQL generating process. And supplementing the phrase granularity linguistic templates corresponding to the added conditions in the natural language text to obtain the edited natural language text.

Example 2: SQL1 is "select financial products where profit rate > 5%", Text1 is "what financial products are with profit rate > 5%". If the sampled editing strategy is a modification condition, the condition can be modified by sampling a new column name and a new record value, and the result that SQL2 is 'select financial product where withdrawal rate > 20%'. The phrase granularity tactical template corresponding to the original condition is replaced by the phrase granularity tactical template of the modified condition in the natural language Text, and Text2 is' what financial products with withdrawal rate more than 20% are.

Example 3: SQL1 is "select financial products where profitability > 5%", and Text1 is "what financial products with profitability greater than 5% are". If the sampled editing strategy is a deletion condition, SQL2 can be obtained as "select financial products", and Text2 is "which financial products are".

Example 4: SQL1 is "select financial products where profitability > 5%", and Text1 is "what financial products with profitability greater than 5% are". If the sampled editing strategy is the modification focus, the focus parameter can be modified by using the new table name and column name, for example, SQL2 can be generated as "select stock where yield > 5%". And replacing the natural resource Text by using a phrase granularity jargon template corresponding to the new focus parameter to obtain Text2 as' which stocks with the profitability more than 5% have.

Example 5: SQL1 is "select count financing product where yield > 5%", and Text1 is "how many financing products with yield greater than 5%". If the sampled editing strategy is to modify the aggregation function, the new aggregation function can be substituted for the aggregation function in SQL1, and SQL2 is "select top5 financial product where profitability > 5%". And replacing the natural language Text by using a word granularity linguistic template corresponding to the new aggregation function to obtain Text2 which is' the word granularity linguistic template with the yield rate more than 5% before the ranking of the financial product.

Example 6: SQL1 is "choose financing product count financing product where profitability > 5%", and Text1 is "what financing products with profitability greater than 5% are, what quantity is". If the sampled editing strategy is to delete the focus, at least one focus can be deleted from SQL1, but it is necessary to ensure that the focus is still in the new SQL2, for example, SQL2 is "select financial product where yield > 5%" and Text2 is "what financial products with yield greater than 5% are".

Example 7: SQL1 is "select financial products where profitability > 5%", and Text1 is "what financial products with profitability greater than 5% are". If the sampled editing strategy is restart, a new SQL2 is generated, namely 'select count stock where release time >3 years' and Text2 'how many stocks with release time greater than 3 years' are. The restart corresponds to the generation of a new SQL statement and natural language text, in which case the SQL statement may be first generated using steps 201-202 shown in fig. 2, and then the corresponding natural language text may be generated for the newly generated SQL statement using step 203 shown in fig. 2. However, in this case, the table of the query is not replaced.

Example 8: SQL1 is "select financial products where profitability > 5%", and Text1 is "what financial products with profitability greater than 5% are". If the sampled editing strategy is recognition rejection, then SQL2 and Text2 are not generated, i.e., < SQL1, Text1> are empty in the next round, or the next round is still < SQL1, Text1 >.

Example 9: SQL1 is "select financial products where profitability > 5%", and Text1 is "what financial products with profitability greater than 5% are". If the editing strategy of the sampling is table switching, a new SQL2 is generated as "select song name where sings XX and works YY", and Text2 is "what the song made YY by XX sings". This case is similar to the restart in example 7 above, except that in this case the table of the query needs to be replaced.

It should be noted that the above editing strategy is only an example listed in this embodiment, and the editing strategy can be expanded, as long as it is within the scope of the present application.

Step 404: judging whether sample pairs of N turns after the first turn of conversation are obtained or not, and if not, executing the step 405; if so, step 406 is performed.

Step 405: taking the edited sample pair of the next dialog turn as the sample pair of the current dialog turn, and going to execute step 402.

Step 406: and forming a group of multi-turn dialogue sample pairs by the sample pair of the first turn of dialogue and the sample pair of the subsequent N turns of dialogue, and ending the process.

Assuming that N is 2, taking SQL1 and Text1 as the current turn, and editing SQL1 and Text1 respectively based on the sampling editing strategy to obtain SQL2 and Text 2. Then, taking the SQL2 and the Text2 as the current turn, and respectively editing the SQL2 and the Text2 based on the sampling editing strategy to obtain the SQL3 and the Text 3. This results in a set of multiple sample pairs:

< SQL1, Text1>, < SQL2, Text2> and < SQL3, Text3 >.

Except that the context-free sample pairs automatically generated by the embodiment can be used as training data to train the semantic analysis model, the multi-round dialogue sample pairs automatically generated by the embodiment can be used as training data to train the semantic analysis model, so that the semantic analysis model is helped to understand semantic relations among multiple rounds of problems, and the understanding capability of the semantic analysis model is enhanced.

It should be noted that the automatically generated large amount of sample pair data (including context-free sample pairs or multi-turn dialogue sample pairs) may be used as training data for training a pre-training language model, and after the pre-training language model is obtained through training, a semantic analysis model is obtained through further fine-tuning training on the basis of the pre-training language model. The present application is not limited in this regard.

The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

According to an embodiment of another aspect, a data generating apparatus is provided. Fig. 5 shows a schematic block diagram of the data generation apparatus according to an embodiment, which is disposed at the server side in the architecture shown in fig. 1. The apparatus may be an application located at a server end, or may also be a functional unit such as a plug-in or Software Development Kit (SDK) located in the application located at the server end, or may also be located at a computer terminal having a strong computing function, which is not particularly limited in this embodiment of the present invention. As shown in fig. 5, the apparatus 500 includes: the syntax tree obtaining unit 501, the SQL statement generating unit 502, and the text generating unit 503 may further include a multi-round sample generating unit 504. The main functions of each component unit are as follows:

a syntax tree obtaining unit 501 configured to obtain an SQL syntax tree, wherein the SQL syntax tree is generated in advance using SQL syntax rules.

The SQL statement generating unit 502 is configured to generate one or more SQL statements by using an SQL syntax tree in combination with the sampled table data.

The text generating unit 503 is configured to generate corresponding natural language texts for each SQL statement based on a pre-configured dialogues template to obtain a plurality of sample pairs, each sample pair being composed of an SQL statement and a corresponding natural language text thereof.

The SQL syntax tree comprises an operation part node and a condition part node; the operation part node comprises an operation keyword, a focus parameter and an aggregation function; the condition part nodes comprise condition keywords, condition parameters and condition operators; wherein the focus parameter and/or the condition parameter indicate that there is a corresponding table data type.

The SQL statement generating unit 502 may be specifically configured to traverse an SQL syntax tree, and fill the traversed focus parameter and/or condition parameter with the sampled table data to obtain more than one SQL statement.

Specifically, the SQL statement generation unit 502 may fill the focus parameter with the table name and/or the column name of the sample; and/or, filling the condition parameters by using the sampled column names and record values.

As a preferred embodiment, the text generating unit 503 may be specifically configured to: determining word granularity tactical templates corresponding to focus parameters, condition parameters, aggregation functions and condition operators in SQL sentences respectively; combining the determined word granularity phonetics templates according to the logical relationship among the focus parameters, the condition parameters, the aggregation functions and the condition operators in the SQL sentences to obtain more than one phrase granularity phonetics templates; combining more than one phrase granularity tactical template according to the logic relation between the phrases in the SQL statement to obtain more than one sentence granularity tactical template; the natural language text corresponding to the SQL statement is determined from more than one sentence granularity grammar template.

A multi-round sample generation unit 504 configured to sample at least one sample pair from the obtained plurality of sample pairs; respectively taking each sampled sample pair as a sample pair of the first round of conversation to generate a sample pair of the subsequent N rounds of conversations, wherein N is a positive integer; and forming a group of multi-turn dialogue sample pairs by using the sample pair of the first turn of dialogue and the sample pair of the subsequent N turns of dialogue.

The multi-round sample generation unit 504 may specifically perform the following operations when generating sample pairs for subsequent N-round conversations based on the sample pair for the first-round conversation:

and taking the edited sample pair of the next round of conversation as the sample pair of the current round, and switching to execute the operation of determining the editing strategy applicable to the sample pair of the current round from the preset editing strategy set until obtaining the sample pair of N rounds of conversation after the sample pair of the first round of conversation.

Wherein the set of editing policies comprises at least one of the following editing policies:

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, the system or system embodiments are substantially similar to the method embodiments and therefore are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for related points. The above-described system and system embodiments are only illustrative, wherein the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

It should be noted that, in the embodiments of the present application, the user data may be used, and in practical applications, the user-specific personal data may be used in the scheme described herein within the scope permitted by the applicable law, under the condition of meeting the requirements of the applicable law and regulations in the country (for example, the user explicitly agrees, the user is informed, etc.).

In addition, the present application also provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps of the method described in any of the preceding method embodiments.

And an electronic device comprising:

one or more processors; and

a memory associated with the one or more processors for storing program instructions that, when read and executed by the one or more processors, perform the steps of the method of any of the preceding method embodiments.

Fig. 6 illustrates an architecture of an electronic device, which may specifically include a processor 610, a video display adapter 611, a disk drive 612, an input/output interface 613, a network interface 614, and a memory 620. The processor 610, the video display adapter 611, the disk drive 612, the input/output interface 613, the network interface 614, and the memory 620 may be communicatively connected by a communication bus 630.

The processor 610 may be implemented by a general-purpose CPU, a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute a relevant program to implement the technical solution provided in the present Application.

The Memory 620 may be implemented in the form of a ROM (Read Only Memory), a RAM (random access Memory), a static storage device, a dynamic storage device, or the like. The memory 620 may store an operating system 621 for controlling the operation of the electronic device 600, a Basic Input Output System (BIOS)622 for controlling low-level operations of the electronic device 600. In addition, a web browser 623, a data storage management system 624, and a data generation device 625, etc. may also be stored. The data generating device 625 may be an application program that implements the operations of the foregoing steps in this embodiment of the application. In summary, when the technical solution provided in the present application is implemented by software or firmware, the relevant program codes are stored in the memory 620 and called for execution by the processor 610.

The input/output interface 613 is used for connecting an input/output module to realize information input and output. The i/o module may be configured as a component in a device (not shown) or may be external to the device to provide a corresponding function. The input devices may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output devices may include a display, a speaker, a vibrator, an indicator light, etc.

The network interface 614 is used for connecting a communication module (not shown in the figure) to realize the communication interaction between the device and other devices. The communication module can realize communication in a wired mode (such as USB, network cable and the like) and also can realize communication in a wireless mode (such as mobile network, WIFI, Bluetooth and the like).

Bus 630 includes a path that transfers information between the various components of the device, such as processor 610, video display adapter 611, disk drive 612, input/output interface 613, network interface 614, and memory 620.

It should be noted that although the above devices only show the processor 610, the video display adapter 611, the disk drive 612, the input/output interface 613, the network interface 614, the memory 620, the bus 630, etc., in a specific implementation, the device may also include other components necessary for normal operation. In addition, it will be understood by those skilled in the art that the above-described apparatus may also include only the components necessary to implement the embodiments of the present application, and need not include all of the components shown in the figures.

From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present application may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments of the present application.

The data processing method, apparatus, device and computer-readable storage medium provided by the present application are described in detail above, and a specific example is applied in the present application to explain the principles and embodiments of the present application, and the description of the above embodiment is only used to help understand the method and core ideas of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, the specific embodiments and the application range may be changed. In view of the above, the description should not be taken as limiting the application.

Claims

1. A method of data generation, the method comprising:

sampling at least one sample pair from the obtained plurality of sample pairs;

2. The method of claim 1, wherein the SQL syntax tree comprises an operation part node and a condition part node;

3. The method of claim 2, wherein generating one or more SQL statements using the SQL syntax tree in conjunction with sampled tabular data comprises:

4. The method of claim 1, wherein the generating, based on the preconfigured conversational template, a corresponding natural language text for each SQL statement, respectively, comprises:

5. The method of claim 1, wherein generating sample pairs for subsequent N-rounds of conversations based on the sample pairs for the first round of conversations comprises:

6. The method of claim 5, wherein the set of editing policies comprises at least one of the following editing policies:

7. A method of data generation, the method comprising:

8. The method of claim 7, wherein the SQL syntax tree comprises an operation part node and a condition part node;

9. The method of claim 8, wherein generating one or more SQL statements using the SQL syntax tree in conjunction with sampled tabular data comprises:

10. The method of claim 7, wherein the generating, based on the preconfigured conversational templates, a corresponding natural language text for each SQL statement, respectively, comprises:

11. A method of acquiring training data, the method comprising:

obtaining a sample pair as training data, wherein the sample pair comprises a multi-turn dialog sample pair generated by the method of any one of claims 1 to 6, or a sample pair generated by the method of any one of claims 7 to 10;

12. A data generation apparatus, the apparatus comprising:

the SQL sentence generating unit is configured to generate more than one SQL sentences by combining the SQL syntax tree with the sampled table data;

13. A data generation apparatus, the apparatus comprising:

14. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 11.

15. An electronic device, comprising:

one or more processors; and

a memory associated with the one or more processors for storing program instructions that, when read and executed by the one or more processors, perform the steps of the method of any of claims 1 to 11.