CN114780582A

CN114780582A - Natural answer generating system and method based on form question and answer

Info

Publication number: CN114780582A
Application number: CN202210505859.2A
Authority: CN
Inventors: 奚雪峰; 李智; 崔志明; 左严
Original assignee: Jiangsu New Hope Technology Co ltd; Suzhou University of Science and Technology
Current assignee: Jiangsu New Hope Technology Co ltd; Suzhou University of Science and Technology
Priority date: 2022-05-10
Filing date: 2022-05-10
Publication date: 2022-07-22

Abstract

The invention relates to a system and a method for generating natural answers based on table question answering.A data preprocessing module is used for carrying out Chinese word segmentation and regular deblurring processing on a question text proposed by a user; the text coding module is used for converting a problem text and a knowledge base entity proposed by a user into a data type and a computing unit in a computer form; the structured query sentence generating module is used for generating a structured query sentence corresponding to the problem according to the text code; and the natural answer generation module is used for generating natural answers according to the generated structured query sentences, and comprises subject generation, predicate generation and object generation. Adopting a regularization expression and Chinese participles to carry out semantic filling and perfection on the problem text and the knowledge base entity; adopting a pre-training model to carry out semantic representation on the problem text and the knowledge base entity; and filling the slot positions, converting the semantic representation into a complete structured query statement, and generating a natural answer based on the structured query statement.

Description

Natural answer generation system and method based on table question answering

Technical Field

The invention relates to a natural answer generation system and a natural answer generation method based on table question answering.

Background

At present, the artificial intelligence technology is developed at a high speed, and the problems in the real life of the human society need to be really solved by utilizing the prior art, so that the people can really benefit from the technology. The form question-answering is generated under the background that the current form knowledge base is inundated and information retrieval and integration are required to be realized through a question-answering technology, wherein the form knowledge base contains a large amount of knowledge from different fields, the remembering efficiency of the form knowledge base can be improved, and the execution result of a structured query statement on the knowledge base is used as the input of a question-answering result; natural answer generation based on table question-answering is thus an important extension thereof. The method for judging whether a question and answer result is accepted by a question and answer user comprises two aspects of evaluation: firstly, machine evaluation is carried out, evaluation is carried out according to natural answer generating elements, three elements of main, predicate and object are respectively set, and the evaluation comprises subject generating accuracy, predicate generating accuracy and object generating accuracy; secondly, manual evaluation is performed, wherein the manual evaluation is from three aspects of fluency evaluation, consistency evaluation and diversity evaluation, so that answer generation evaluation is performed, and fluency represents whether answer sentences are smooth or not and has good readability; the consistency evaluation represents whether the answer sentence is associated with the direction of the question text; the focus of the task is the generation of a structured query statement that is problem text oriented. Compared with English, the Chinese problem text coding task is more complex, and the difficulty is higher due to the influence of factors such as word segmentation and the like; and the current table question-answering is carried out on a general language library in the context of English or Chinese, and basically has no application to natural answer generation.

The main technical method for generating natural answers based on the table question answering comprises the following steps: the slot filling method mainly adopts a linguistic expert to manually construct a text template, selects slot modes such as a main mode, a predicate mode, a guest mode and the like, is matched with text classification, and is independently used with an execution result of SQL in a database as filling of slot positions.

The method for executing the result based on the structured query statement actually uses the execution result of the generated complete structured query statement in the database as a natural answer, namely an NL2SQL task, does not need extensive database knowledge and can be completed in a short time, and the method for executing the result based on the structured query statement is the method which is used earliest in the generation of the natural answer based on the table question answering.

Disclosure of Invention

The invention aims to overcome the defects in the prior art and provides a natural answer generation system and a natural answer generation method based on table question answering.

The purpose of the invention is realized by the following technical scheme:

the natural answer generation system based on the form question answering is characterized in that: the system comprises a data preprocessing module, a text coding module, a structured query sentence generating module and a natural answer generating module; the data preprocessing module is used for carrying out Chinese word segmentation and regular deblurring processing on a problem text proposed by a user;

the text coding module is used for converting a problem text and a knowledge base entity proposed by a user into a data type and a computing unit in a computer form;

the structured query sentence generating module generates a structured query sentence corresponding to the question according to the text coding, and the structured query sentence comprises a SELECT clause generation function and a WHERE clause generation function;

and the natural answer generation module is used for generating natural answers according to the generated structured query sentences, and comprises subject generation, predicate generation and object generation.

Further, in the system for generating natural answers based on table questions and answers, the data preprocessing module includes a regular expression processing module and a Chinese word segmentation module; the regular expression processing module is oriented to the problem text, and adopts a regular expression to convert the fuzzy digital entity into an entity which accords with the storage form of the knowledge base, wherein the fuzzy digital entity comprises a year fuzzy entity, a percentage fuzzy entity and a digital fuzzy entity; the Chinese word segmentation module is used for segmenting words in a text facing a Chinese problem text proposed by a user and obtaining a string of word sequences by adopting a word segmentation tool.

Further, in the system for generating natural answers based on the table question answering, the text coding module mathematically represents the question text data processed by the data preprocessing module and the knowledge base entity, maps the high-order vector into a high-dimensional space of a Euclidean space, codes the text into a coding sequence which can be understood and processed by adopting one-hot coding, and acquires corresponding text semantic features through a pre-training model.

Further, in the system for generating natural answers based on the table question answering, the text coding module includes a question text coding module and a knowledge base entity coding module, the question text coding module is used for coding vector representation of the question text, and the knowledge base entity coding module is used for coding vector representation of the knowledge base entity.

Further, in the system for generating natural answers based on table questions and answers, the structured query statement generating module includes a SELECT clause generating module and a WHERE clause generating module; the SELECT clause generating module is used for generating a problem text and a word vector of a knowledge base entity by adopting a pre-training model for training, and then generating a SELECT clause through a multi-classification model Softmax, wherein the SELECT clause comprises column name selection and aggregation function selection; the WHERE clause generation module is used for training a pre-training model to generate word vectors of a problem text and a knowledge base entity and then generating the WHERE clause through a multi-classification model Softmax, wherein the WHERE clause comprises column name selection, operation symbol selection, numerical value extraction and condition association symbol selection.

Further, in the system for generating natural answers based on the table question-answer, the natural answer generating module splices sentences which pass through the SELECT clause generating module and the WHERE clause generating module of the structured query sentence generating module to form a complete structured query sentence, the natural answers adopt a main-predicate-object architecture and include a subject generating module, a predicate generating module and an object generating module, the subject generates a result from the SELECT clause, the predicate generation is performed by setting predicate tags { "yes" and "there" }, tag judgment is performed based on a question text to generate a tag, and the predicate part obtains an execution result of the complete structured query sentence in the database.

The invention relates to a natural answer generation method based on table question answering, which comprises the following steps:

firstly, performing data preprocessing on an input part of a question and answer, namely a question text and an entity in a knowledge base, and deblurring and word segmentation on the question text;

secondly, performing single-hot encoding on problem text data after data preprocessing and knowledge base entities, and then entering a pre-training model for training to generate corresponding related word vectors;

secondly, inputting the feature vectors into a structured query statement generating module, generating lean sum through a SELECT clause and a WHERE clause, and forming complete SQL; the structured query statement generation comprises six subtasks, and each task is a multi-classification model; judging the probability of the label category based on the semantic representation of the problem text and the knowledge base entity, wherein the category with the highest probability is the final output result;

and finally, designing a main-predicate-guest generation framework according to the natural answer generation requirement, executing the obtained result of the complete SQL in the database to be used as an object generation result, and splicing the object generation result to be used as a final natural answer.

Furthermore, the natural answer generating method based on the form question answering comprises the steps that a data preprocessing module processes fuzzy entities of the question text and carries out Chinese word segmentation on the question text; the text coding module is used for performing semantic representation on the problem text-knowledge base entity sequence; inputting the word vector by a structured query sentence generating module, and generating a structured query sentence through a SELECT clause generating module and a WHERE clause generating module; and generating a natural answer based on the structured query sentence by a natural answer generating module to be used as a final result of the question and answer.

Furthermore, in the method for generating natural answers based on the table question answering, a regular expression processing module of the data preprocessing module is used for performing data deblurring and Chinese word segmentation tasks on the question text, and a regular expression is used for removing fuzzy digital entities of the question text; the Chinese word segmentation module is used for segmenting words in the problem text and obtaining a string of word sequences by adopting a jieba word segmentation tool;

the text encoding module encodes the question text and the knowledge base entity into a semantic vector which can be understood and processed;

a SELECT clause generating module of the structured query sentence generating module generates a column name and an aggregation function in the SELECT clause according to an input semantic vector, wherein the column name is obtained from semantic representation of a knowledge base entity, and the aggregation function comprises { AVG, MIN, MAX, COUNT and SUM }; the WHERE clause generating module (302) of the structured query sentence generating module generates column names, operation symbols, numerical value texts and associated symbols among conditions of the WHERE clauses according to the input semantic vectors, wherein the operation symbol sets are { >, < ═ and! -the associated symbol contains { and, or };

the SELECT clause generating module and the WHERE clause generating module adopt a slot filling method, and the structured query statement is generated by filling slots;

setting six slot positions, namely an SEL _ COL task, an SEL _ AGG task, a W _ CONN task, a W _ COL task, a W _ OP task and a W _ VAL task according to the characteristics of the structured query statement;

generating semantic representation aiming at a pre-training model, wherein a text coding module adopts a BERT model as the pre-training model, and input vectors comprise word embedded vectors, paragraph embedded vectors and position embedded vectors; inputting a natural language question text and a knowledge base entity into a pre-training model according to the characteristics of the pre-training model;

CLS indicates the starting position of the text participle, SEP indicates the text participleThe segmentation location, i.e., the end; the input of the module contains two aspects, the first is question text participle Q ═ (C)₁，C₂，...，C_n}; the second is text participle in knowledge base, T ═ T, (h)₁，h₂，...，h_n) A column name from a knowledge base; and finally, fusing the input, namely splicing the problem text and the knowledge base list name set, wherein the input formula of the data preprocessing module is as follows:

Input＝concation(CLS，Qestion，SEP，Column1，SEP，...) (1)

after the input is definite, the input data is sent into a data pre-training module for processing, and different parts are obtained to obtain different semantic vector representations { E }_CLS、E_Q、(E_h1，...E_hm) And inputting the following formula into the pre-training model:

Pretrain_Model(input)＝{E_CLS，E_Q，E_h} (2)

wherein the input E of the pre-training model_CLSIs a semantic representation of the entire input, E_QIs a semantic representation of the question text, E_hIs formed by the concatenation of semantic representations corresponding to entities in each knowledge base, i.e. E_h＝{E_h1，E_h2，...，E_hm}；

First, the input question text gets a word vector representation { E } based on the pre-trained model_CLS，E_Q/h，E_h/O}; for natural language question representation and knowledge base representation output by a text coding module, a structured query statement module divides a question-answering task into six subtasks, namely an SEL _ COL task, an SEL _ AGG task, a W _ CONN task, a W _ COL task, a W _ OP task and a W _ VAL task;

S-COL tasks: taking an SQL statement as a generation result of a structural query statement generation module, determining which COLUMN names are selected as the most basic task, namely a SELECT-COLUMN task, and performing Softmax operation on the output of a pre-training model by the task to obtain the probability corresponding to COLUMN, wherein the formula is as follows:

E_Qis a representation of the problem text output by the pre-trained model, E_hRepresenting the corresponding column name information, splicing the natural problem text with all column names in a knowledge base, and normalizing the natural problem text and the column names to obtain a fusion vector;

S-AGG task: selecting corresponding aggregation symbols according to a target column for columns in a SELECT clause generated by a question and answer model is an important task; the aggregate symbols include SUM, AVG, MIN, MAX, and so on, and are generated as follows:

the probability corresponding to each aggregation symbol is obtained through fusion of problem text representation and knowledge base information and normalization operation;

W-CONN task: in the WHERE clause generation, the association relation between the conditions of the WHERE clause needs attention, the association relation is set to be three elements of { and, or and None }, and the and represents the parallel relation between the conditions, the or represents the relation between the conditions, the None represents no relation between the conditions, only one condition exists, and the generation formula is as follows:

W-COL tasks: when the SELECT clause is generated, the structured query statement generating module pays attention to the selection of the column names of the WHERE clause, which is different from the SELECT clause generating task, and because the number of the column names in the WHERE clause is uncertain, data analysis needs to be performed, and a generating formula is as follows:

W-OP task: generating column names in the WHERE clause, selecting corresponding operation symbols according to the generated column names, wherein the operation symbols comprise >, <, ═ and the like, and generating a formula as follows:

the SQL sentence construction only differs by one step, namely for the prediction of WHERE-VAL, according to the analysis of training data samples, the prediction result of WHERE-VAL comprises two types, namely text numerical prediction and numerical prediction, and the extraction methods for the two types of values are different;

in text numerical value prediction, due to the particularity of text types, operation symbols in a WHERE clause can only be 'symbols', and candidate text values corresponding to column names are obtained after the operation symbols;

and then combining each candidate text value with the problem text to serve as the input of a pre-training model, carrying out two-classification operation on the output semantic representation, and judging whether the current candidate text is a value extraction target of a WHERE clause, wherein a label probability formula is as follows:

aiming at the extraction of numerical value types, wherein the operator corresponding to the column name in the WHERE clause is not limited to one type, through the analysis of training samples, the numerical value extraction in the WHERE clause is not only from the stored numerical value in the knowledge base, but also from the problem text, the numerical value is processed by a regular expression and is displayed in a standardized manner, so that the text value extraction process cannot be simply simulated;

after the column names corresponding to the WHERE clauses are obtained, the WHERE clause generating module needs to arrange the current column names and corresponding candidate values, wherein the candidate values come from digital entities of the problem text and storage values of the corresponding column names;

and finally, according with the text value extraction flow, arranging the problem text and the candidate values, obtaining semantic representation through a pre-training model, and performing two-classification judgment, wherein the label is {0, 1}, and the label probability formula is as follows:

synthesizing results of the six subtasks, and splicing the results of the SELECT clause generation module and the WHERE clause generation module to generate an SQL statement corresponding to the natural language question, wherein the obtained structured query statement is input to a natural answer generation module to generate a natural answer;

the natural answer generation module divides the natural answer generation into three main-predicate-guest slots for filling, and the main language generation module generates column name selection from the SELECT clause generation module; the predicate generation module is used for judging and generating a predicate label through the two classified texts according to a predicate set { yes } of a preset value; the result of the object generating module is the execution result of the structured query language in the database; and the generation of a complete natural answer is realized through the splicing of the main and subordinate guest slots.

Compared with the prior art, the invention has obvious advantages and beneficial effects, which are embodied in the following aspects:

firstly, a regularization expression and a Chinese word segmentation technology are adopted to complete semantic filling on a problem text and a knowledge base entity, so that fuzzy digital entities in the problem text are eliminated, and semantic information in the problem text is enriched;

performing semantic representation on the problem text and the knowledge base entity by adopting a pre-training model technology, and increasing the semantic representation dimension from 128 to 768 to realize more exquisite semantic representation of the problem text and the knowledge base entity;

and thirdly, converting the semantic representation into a complete structural query statement by using a slot filling technology, generating a natural answer based on the structural query statement, realizing modularization of the generation of the structural statement, and simultaneously realizing complete and natural conversion of the answer from a single fragment based on a slot structure of a main-predicate-guest.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the invention. The objectives and other advantages of the invention may be realized and attained by the structure particularly pointed out in the written description.

Drawings

FIG. 1: the invention is a schematic diagram of the architecture principle of the system;

FIG. 2: the flow chart of the invention;

FIG. 3: the schematic diagram of the architecture principle of the data preprocessing module;

FIG. 4 is a schematic view of: the structural principle schematic diagram of the structured query statement generation module;

FIG. 5: and the natural answer generation module is a schematic diagram of the architecture principle.

Detailed Description

In order to more clearly understand the technical features, objects, and effects of the present invention, specific embodiments will now be described in detail.

Firstly, adopting a regularization expression and a Chinese word segmentation technology to carry out semantic filling and perfection on a problem text and a knowledge base entity; performing semantic representation on the problem text and the knowledge base entity by adopting a pre-training model technology; the semantic representation is converted into a complete structured query statement using slot filling techniques, and a natural answer is generated based on the structured query statement.

As shown in fig. 1, the natural answer generation system based on table question answering includes a data preprocessing module 1, a text encoding module 2, a structured query sentence generation module 3, and a natural answer generation module 4; the data preprocessing module 1 is used for performing Chinese word segmentation and regular deblurring processing on a problem text proposed by a user;

the text coding module 2 is used for converting the problem text and the knowledge base entity proposed by the user into a data type and a computing unit in a computer form;

the structured query sentence generating module 3 is used for generating a structured query sentence corresponding to the question according to the text coding, and comprises a SELECT clause generation function and a WHERE clause generation function;

the natural answer generation module 4 generates a natural answer including subject generation, predicate generation, and object generation from the generated structured query sentence.

As shown in fig. 3, the data preprocessing module 1 includes a regular expression processing module 101 and a chinese word segmentation module 102; the regular expression processing module 101 is oriented to the problem text, and adopts a regular expression to convert fuzzy digital entities into entities conforming to the storage form of the knowledge base, wherein the fuzzy digital entities comprise year fuzzy entities, percentage fuzzy entities and digital fuzzy entities; the Chinese word segmentation module 102 is used for segmenting words in a text for a Chinese problem text proposed by a user, and a word segmentation tool is used for obtaining a string of word sequences.

The text encoding module 2 comprises a question text encoding module 201 and a knowledge base entity encoding module 202, wherein the question text encoding module 201 is used for encoding vector representation of question text, and the knowledge base entity encoding module 202 is used for encoding vector representation of knowledge base entities. The text coding module 2 is used for performing mathematical expression on the problem text data and the knowledge base entity processed by the data preprocessing module 1, mapping the high-order vector into a high-dimensional space of a European space, coding the text into a coding sequence which can be understood and processed by adopting one-hot coding, and acquiring corresponding text semantic features through a pre-training model.

As shown in fig. 4, the structured query statement generating module 3 includes a SELECT clause generating module 301 and a WHERE clause generating module 302; the SELECT clause generating module 301 performs SELECT clause generation through a multi-classification model Softmax after generating word vectors of a problem text and knowledge base entities (column names) by adopting a pre-training model for training, where the SELECT clause includes column name selection and aggregation function selection; the WHERE clause generating module 302 is configured to generate a word vector between a problem text and a knowledge base entity by using a pre-training model, and then generate a WHERE clause through a multi-classification model Softmax, WHERE the WHERE clause includes column name selection, operation symbol selection, numerical value extraction and condition association symbol selection.

As shown in fig. 5, the natural answer generating module 4 splices the sentences after passing through the SELECT clause generating module 301 and the WHERE clause generating module 302 of the structured query sentence generating module 3 to form a complete structured query sentence, the natural answer adopts a main-predicate-object architecture and includes a subject generating module 401, a predicate generating module 402 and an object generating module 403, the subject generates a result from the SELECT clause, the predicate generates a result by setting predicate tags { "yes" and "there" }, the tag determination is performed based on the question text to generate, and the predicate part is used as an execution result of the complete structured query sentence in the database.

The natural answer generation method based on the table question answering as shown in fig. 2 comprises the following steps:

secondly, performing single hot coding on problem text data after data preprocessing and knowledge base entities, and then entering a pre-training model for training to generate corresponding related word vectors;

secondly, inputting the feature vectors into a structured query statement generating module, and generating barren quota by using a SELECT clause and a WHERE clause to form complete SQL; the structured query statement generation method comprises six subtasks, wherein each task is a multi-classification model; judging the probability of the label category based on the semantic representation of the problem text and the knowledge base entity, wherein the category with the highest probability is the final output result;

Processing a fuzzy entity of the problem text and performing Chinese word segmentation on the problem text by a data preprocessing module 1; the text coding module 2 semantically expresses the problem text-knowledge base entity sequence; the structured query sentence generation module 3 inputs the word vector and generates a structured query sentence through the SELECT clause generation module and the WHERE clause generation module; and generating a natural answer based on the structured query sentence as a final result of the question answering by a natural answer generating module 4.

The regular expression processing module 101 of the data preprocessing module 1 performs data deblurring and Chinese word segmentation tasks on the problem text, and removes the fuzzy digital entity of the problem text through a regular expression; the Chinese word segmentation module 102 is used for segmenting words in the problem text, and obtaining a string of word sequences by adopting a jieba word segmentation tool;

the text coding module 2 codes the question text and the knowledge base entity into a semantic vector which can be understood and processed;

a SELECT clause generating module 301 of the structured query statement generating module 3 generates column names and aggregation functions in the SELECT clauses according to the input semantic vector, wherein the column names are obtained from the semantic representation of the knowledge base entity, and the aggregation functions comprise { AVG, MIN, MAX, COUNT and SUM }; a WHERE clause generating module (302) of the structured query sentence generating module 3 generates column names, operation symbols, numerical value texts and associated symbols among conditions of WHERE clauses according to the input semantic vectors, wherein the operation symbol sets are { >, < ═ and! Associated symbols contain { and, or };

the SELECT clause generating module 301 and the WHERE clause generating module 302 adopt a slot filling method, and realize the generation of the structured query statement through the filling of slots;

aiming at the semantic representation generated by the pre-training model, the text coding module adopts a BERT model as the pre-training model, the parameter number of the model reaches 209M, and the input vector comprises a word embedding vector (Token embedding), a paragraph embedding vector (Segment embedding) and a position embedding vector (position embedding); inputting the natural language question text and the knowledge base entity into the pre-training model according to the characteristics of the pre-training model, wherein the input contents are as shown in the figure:

problem text word segmentation

Knowledge base text participle

Input fusion

Wherein CLS represents the starting position of the text participle, SEP represents the segmentation position of the text participle, namely the ending; the input of the module contains two aspects, the first is question text participle Q ═ (C)₁，C₂，…，C_n) As shown in the above figure "some stocks with market value greater than three billion"; the second is text participle in knowledge base, T ═ T, (h)₁，h₂，…，h_n) A column name from a knowledge base; such as "stock codes", "stock abbreviations", etc.; and finally, fusing the input, namely splicing the problem text and the column name set of the knowledge base, wherein the input formula of the data preprocessing module is as follows:

Input＝concation(CLS，Qestion，SEP，Column1，SEP，...) (1)

after the input is definite, the input data is sent to a data pre-training module for processing, and different parts are obtained to obtain different semantic vector representations { E }_CLS、E_Q、(E_h1，...E_hm) The pre-training model is input with the following formula:

Pretrain_Model(input)＝{E_CLS，E_Q，E_h} (2)

First, the input question text obtains a word vector representation { E } based on a pre-trained model_CLS，E_Q/h，E_h/QThe problem text is processed by a regular expression processing module, a fuzzy entity does not exist in the text, and the performance of the problem text is better than that of a question-answer model which is not processed by the regular expression processing module; for natural language question representation and knowledge base representation output by a text coding module, a structured query statement module divides a question-answering task into six subtasks, namely an SEL _ COL task, an SEL _ AGG task, a W _ CONN task, a W _ COL task, a W _ OP task and a W _ VAL task;

E_Qis a representation of the problem text output by the pre-trained model, E_hRepresenting the corresponding column name information, splicing the natural problem text and all column names of a knowledge base, and normalizing the natural problem text and the column names to obtain a fusion vector;

S-AGG task: for the columns in the SELECT clauses generated by the question-answering model, selecting corresponding aggregation symbols according to the target columns is an important task; the aggregate symbols include SUM, AVG, MIN, MAX, and so on, and are generated as follows:

and the probability corresponding to each aggregation symbol is obtained through the fusion of the problem text representation and knowledge base information and the normalization operation.

W-CONN task: in the WHERE clause generation, the association relation between the conditions of the WHERE clause needs attention, the association relation set is three elements of { and, or and None }, and the and represents the parallel relation between the conditions, the or represents the or relationship between the conditions, and the None represents no relation between the conditions, only one condition exists, and the generation formula is as follows:

W-COL tasks: when a SELECT clause is generated, the structured query statement generation module pays attention to the selection of the column names of the WHERE clause, which is different from a SELECT clause generation task, and because the number of the column names in the WHERE clause is uncertain, data analysis needs to be performed, and a generation formula is as follows:

W-OP tasks: generating column names in the WHERE clause, selecting corresponding operation symbols according to the generated column names, wherein the operation symbols comprise >, <, ═ and the like, and generating a formula as follows:

in text numerical prediction, because of the particularity of text type, the operation symbol in the WHERE clause can only be a symbol, as shown in the following figure, for "how many stocks are worth of which the name is notoginseng entertaining? "question text, column name in WHERE clause is" stock name ", corresponding operation symbol is" ═ ", after operator is candidate text value of corresponding column name;

after the column names corresponding to the WHERE clauses are obtained, the WHERE clause generating module needs to arrange the current column names and corresponding candidate values, wherein the candidate values come from digital entities of the problem text and storage values of the corresponding column names; as shown in the following figure, for the question text "which stocks with a market value greater than three billion have" the corresponding column names are market values, the operation symbol is ">", and the candidate numerical entities are "300", "464";

and finally, according with the text value extraction flow, arranging the problem text and the candidate values, obtaining semantic representation through a pre-training model, and performing classification judgment, wherein the label is {0, 1}, and the label probability formula is as follows:

synthesizing the results of the six subtasks, and splicing the results of the SELECT clause generating module 301 and the WHERE clause generating module 302 to generate an SQL statement corresponding to the natural language question, and inputting the obtained structured query statement to the natural answer generating module 4 to generate a natural answer;

the natural answer generation module 4 divides the natural answer generation into three main-predicate-guest slots for filling, and the generation of the subject generation module 401 is selected from the column names of the SELECT clause generation module 301; a predicate generation module 402, configured to perform predicate tag determination and generation through a second classification text according to a predicate set { yes, yes } of a preset value; the results of object generation module 403 come from the results of the execution of the structured query language on the database; and the generation of a complete natural answer is realized through the splicing of the main and subordinate guest slots.

The data preprocessing module carries out deblurring operation on the input problem text through the regular expression processing module, and simultaneously carries out word segmentation with the knowledge base entity through the Chinese word segmentation module, and the flow is shown in figure 3; the text coding module respectively carries out semantic vector representation on the problem text and the knowledge base entity based on the problem text coding module and the knowledge base entity coding module; the structured query sentence generation module converts the semantic representation of the text coding module into a structured query sentence through the SELECT clause generation module and the WHERE clause generation module, as shown in fig. 4; the natural answer generation module converts the structured query statement into a natural answer through the subject generation module, the predicate generation module and the object generation module, as shown in fig. 5.

In conclusion, the problem text and the knowledge base entity are subjected to semantic filling perfection by adopting a regularization expression and a Chinese word segmentation technology, fuzzy digital entities in the problem text are eliminated, and semantic information in the problem text is enriched; semantic representation is carried out on the problem text and the knowledge base entity by adopting a pre-training model technology, the semantic representation dimensionality is improved from 128 to 768, and finer semantic representation of the problem text and the knowledge base entity is realized; the slot filling technology is utilized to convert the semantic representation into a complete structural query statement, a natural answer is generated based on the structural query statement, the modularization of the structural statement generation is realized, and meanwhile, the conversion of the answer from a single fragment to a complete natural answer is realized based on the slot structure of a main-predicate-guest.

It should be noted that: the above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention; while the foregoing description will be apparent to those skilled in the relevant art and it is intended to cover in the appended claims all such modifications and changes as fall within the true spirit of the invention.

Claims

1. The natural answer generating system based on the table question-answering is characterized in that: the system comprises a data preprocessing module (1), a text coding module (2), a structured query statement generating module (3) and a natural answer generating module (4); the data preprocessing module (1) is used for carrying out Chinese word segmentation and regular deblurring processing on a problem text proposed by a user;

the text coding module (2) converts the problem text and the knowledge base entity proposed by the user into a data type and a computing unit in a computer form;

the structured query statement generating module (3) generates a structured query statement corresponding to the question according to the text coding, and comprises SELECT clause generation and WHERE clause generation;

and the natural answer generation module (4) generates natural answers according to the generated structured query sentence, and comprises subject generation, predicate generation and object generation.

2. The system of generating natural answers based on the form question answering according to claim 1, wherein: the data preprocessing module (1) comprises a regular expression processing module (101) and a Chinese word segmentation module (102); the regular expression processing module (101) is oriented to the problem text, and converts fuzzy digital entities into entities conforming to the storage form of the knowledge base by adopting a regular expression, wherein the fuzzy digital entities comprise year fuzzy entities, percentage fuzzy entities and digital fuzzy entities; the Chinese word segmentation module (102) is used for segmenting words in a text facing a Chinese problem text proposed by a user, and a word segmentation tool is adopted to obtain a string of word sequences.

3. The system of generating natural answers based on the form question answering according to claim 1, wherein: the text encoding module (2) comprises a question text encoding module (201) and a knowledge base entity encoding module (202), wherein the question text encoding module (201) is used for encoding vector representation of a question text, and the knowledge base entity encoding module (202) is used for encoding vector representation of a knowledge base entity.

4. The system for generating natural answers based on the form question answering according to claim 1 or 3, wherein: the text coding module (2) is used for performing mathematical representation on problem text data and knowledge base entities processed by the data preprocessing module (1), mapping high-order vectors into a high-dimensional space of an Euclidean space, coding the text into an understandable and processable coding sequence by adopting unique heat coding, and acquiring corresponding text semantic features through a pre-training model.

5. The system of generating natural answers based on the form question answering according to claim 1, wherein: the structured query statement generating module (3) comprises a SELECT clause generating module (301) and a WHERE clause generating module (302); the SELECT clause generating module (301) adopts a pre-training model to train and generate a problem text and a word vector of a knowledge base entity, and then generates a SELECT clause through a multi-classification model Softmax, wherein the SELECT clause comprises column name selection and aggregation function selection; the WHERE clause generation module (302) adopts a pre-training model to train and generate word vectors of a problem text and a knowledge base entity, and then generates a WHERE clause through a multi-classification model Softmax, wherein the WHERE clause comprises column name selection, operation symbol selection, numerical value extraction and condition association symbol selection.

6. The system of generating natural answers based on the form question answering according to claim 1, wherein: the natural answer generating module (4) is used for splicing sentences which pass through a SELECT clause generating module (301) and a WHERE clause generating module (302) of the structured query sentence generating module (3) to form a complete structured query sentence, the natural answer adopts a main-predicate-object framework and comprises a subject generating module (401), a predicate generating module (402) and an object generating module (403), the subject generates a SELECT clause generating result, the predicate generation is performed through set predicate labels { 'yes' and 'having', label judgment generation is performed based on question texts, and the predicate part is used for performing complete execution results of the structured query sentence in a database.

7. The natural answer generation method based on the table question answering is characterized by comprising the following steps: the method comprises the following steps:

firstly, performing data preprocessing on an input part of a question and answer, namely a question text and an entity in a knowledge base, and deblurring the question text and segmenting words of the question text;

secondly, inputting the feature vectors into a structured query statement generating module, and splicing the feature vectors with WHERE clause generation through SELECT clause generation to form complete SQL; the structured query statement generation comprises six subtasks, and each task is a multi-classification model; judging the probability of the label category based on the semantic representation of the problem text and the knowledge base entity, wherein the category with the highest probability is the final output result;

and finally, designing a main-predicate-guest generation framework according to the natural answer generation requirement, taking the execution result of the complete SQL in the database as an object generation result, and splicing the object generation result to be the final natural answer.

8. The natural answer generating method based on table questions and answers as claimed in claim 7, wherein: processing a fuzzy entity of a problem text and performing Chinese word segmentation on the problem text by a data preprocessing module (1); the text coding module (2) is used for carrying out semantic representation on the problem text-knowledge base entity sequence; the word vector is input by a structured query sentence generating module (3), and the structured query sentence is generated by a SELECT clause generating module and a WHERE clause generating module; and generating a natural answer based on the structured query sentence by a natural answer generating module (4) as a final result of the question answering.

9. The natural answer generation method based on table question answering according to claim 7 or 8, characterized in that: a regular expression processing module (101) of the data preprocessing module (1) is used for performing data deblurring and Chinese word segmentation tasks on the problem text, and removing fuzzy digital entities of the problem text through a regular expression; the Chinese word segmentation module (102) is used for segmenting words in the problem text and obtaining a string of word sequences by adopting a jieba word segmentation tool;

the text encoding module (2) encodes the question text and the knowledge base entity into a semantic vector which can be understood and processed;

a SELECT clause generating module (301) of the structured query sentence generating module (3) generates column names and aggregation functions in the SELECT clauses according to the input semantic vectors, wherein the column names are obtained from the semantic representation of the knowledge base entity, and the aggregation functions comprise { AVG, MIN, MAX, COUNT and SUM }; a WHERE clause generating module (302) of the structured query sentence generating module (3) generates column names, operation symbols, numerical value texts and associated symbols among conditions of WHERE clauses according to input semantic vectors, wherein the operation symbol set is { >, < ═ and! -the associated symbol contains { and, or };

the SELECT clause generating module (301) and the WHERE clause generating module (302) adopt a slot filling method, and the structured query statement is generated by filling slots;

CLS tableShowing the starting position of the text participle, and SEP showing the segmentation position, namely the ending position, of the text participle; the input to the module contains two aspects, the first being the question text participle Q ═ (C)₁，C₂，...，C_n) (ii) a The second is text participles in the knowledge base, T ═ T, (h)₁，h₂，...，h_n) From column names in a knowledge base; and finally, fusing the input, namely splicing the problem text and the column name set of the knowledge base, wherein the input formula of the data preprocessing module is as follows:

Input＝concation(CLS，Qestion，SEP，Column1，SEP，...) (1)

Pretrrain_Model(input)＝{E_CLS，E_Q，E_h} (2)

wherein the input E of the pre-training model_CLSIs a semantic representation of the entire input, E_QIs a semantic representation of the question text, E_hIs formed by splicing semantic representations corresponding to entities in each knowledge base, i.e. E_h＝{E_h1，E_h2，...，E_hm}；

First, the input question text gets a word vector representation { E } based on the pre-trained model_CLS，E_O/h，E_h/O}; for natural language question representation and knowledge base representation output by the text coding module, the structured query statement module divides the question-answering task into six subtasks, namely an SEL _ COL task, an SEL _ AGG task, a W _ CONN task, a W _ COL task, a W _ OP task and a W _ VAL task;

S-COL tasks: taking an SQL (structured query language) statement as a generation result of a structural query statement generation module, determining which COLUMN names are selected as a most basic task, namely a SELECT-COLUMN task, and performing Softmax operation on the output of a pre-training model by the task to obtain the probability corresponding to COLUMN, wherein the formula is as follows:

S-AGG task: selecting corresponding aggregation symbols according to a target column for columns in a SELECT clause generated by a question and answer model is an important task; the aggregate symbols include SUM, AVG, MIN, MAX, etc., and are generated as follows:

W-OP task: generating a column name in a WHERE clause, selecting a corresponding operation symbol according to the generated column name, wherein the operation symbol comprises the following formula:

the SQL sentence construction only differs by one step, namely for the prediction of WHERE-VAL, according to the analysis of training data samples, the prediction result of WHERE-VAL comprises two types, namely text numerical value prediction and numerical prediction, and the extraction methods for the two types of values are different;

in text numerical prediction, because of the particularity of text types, the operation symbols in the WHERE clause can only be the symbols, and the operation symbols are the candidate text values corresponding to column names;

and then combining each candidate text value with the problem text to serve as the input of a pre-training model, carrying out two classification operations on the output semantic representation, and judging whether the current candidate text is a value extraction target of a WHERE clause, wherein a label probability formula is as follows:

after the column names corresponding to the WHERE clauses are obtained, the WHERE clause generating module needs to arrange the current column names and corresponding candidate values, wherein the candidate values are from the digital entities of the problem texts and the storage values of the corresponding column names;

synthesizing the results of the six subtasks, and splicing the results of the SELECT clause generation module (301) and the WHERE clause generation module (302) to generate an SQL statement corresponding to the natural language question, and inputting the obtained structured query statement into a natural answer generation module (4) to generate a natural answer;

a natural answer generation module (4) which divides the natural answer generation into three main-predicate-guest slots for filling, and a subject generation module (401) generates the column name selection from a SELECT clause generation module (301); a predicate generation module (402) which judges and generates a predicate label through a two-classification text according to a predicate set { yes, yes } of a preset value; the result of the object generation module (403) is from the result of the execution of the structured query language in the database; and the generation of a complete natural answer is realized by splicing the main and predicate slot positions.