CN113111158A - Intelligent data visualization oriented conversational question-answering implementation method - Google Patents
Intelligent data visualization oriented conversational question-answering implementation method Download PDFInfo
- Publication number
- CN113111158A CN113111158A CN202110399195.1A CN202110399195A CN113111158A CN 113111158 A CN113111158 A CN 113111158A CN 202110399195 A CN202110399195 A CN 202110399195A CN 113111158 A CN113111158 A CN 113111158A
- Authority
- CN
- China
- Prior art keywords
- sql
- visualization
- question
- name
- database
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation or dialogue systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/242—Query formulation
- G06F16/2433—Query languages
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/338—Presentation of query results
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/34—Browsing; Visualisation therefor
Abstract
The invention discloses a conversational question-answering realizing method facing intelligent data visualization, which comprises the following steps: the first step is as follows: constructing a data set through SQL function set construction, question creation and SQL, visualization scheme labeling, SQL statement review, question text review and overall review of a database collection and analysis method; the second step is that: constructing a problem by mathematics of a specific problem on the basis of the data set; the third step: establishing a model framework for converting the text into the analytical SQL and extracting the text visualization scheme; the fourth step: and establishing an evaluation scheme of automatic evaluation and human evaluation. The invention can construct a data question-answering system which supports more analysis, and can support more analysis methods compared with the common BI data question-answering system.
Description
Technical Field
The invention belongs to the technical field of artificial intelligence, and particularly relates to a business intelligence data visualization oriented conversational knowledge question-answering implementation method.
Background
With the progress of Artificial Intelligence (AI) technology and the support of big data, the AI technology is beginning to be widely applied to various fields, such as image recognition, object detection, image generation, machine translation, knowledge graph, dialogue question and answer, and the like. The application of the AI technology in the fields can help people to reduce repetitive labor, improve the working efficiency and assist people in creating. In the field of business intelligence, people obtain insights by converting original data and applying an analysis algorithm to help decisions. Currently, an analyst usually needs to go through the following steps to obtain an analysis result: knowing the data structure, converting the original data, selecting an analysis method or self-writing an analysis function, and making a visual display scheme to obtain a result. The prior art has the defects that a great deal of time is needed, the process is difficult to copy, and a lot of repeated labor is caused.
Disclosure of Invention
Based on the current situation, a technical scheme is developed to help simplify the steps, so that the complexity of analysis can be greatly simplified, an analyst can focus on understanding the insight behind the data more and obtain more knowledge from the data, and therefore, the invention provides an intelligent data visualization oriented conversational question-and-answer implementation method which comprises an algorithm based on natural language and further lays a foundation for atlas visualization of system question-and-answer knowledge.
The invention adopts the following technical scheme:
a dialogue type question-answering implementation method facing intelligent data visualization realizes business intelligent analysis, conversion of natural language into SQL (structured query language) and automatic visualization. A new NL2SQL data set (NL2BISQL data set) containing common analysis methods is constructed through a business intelligent analysis method, then a question-answering process oriented to business intelligent visualization is achieved through the model in the invention, and a visualization result is returned as an answer. The method comprises the following steps:
the first step is as follows: and (3) data set construction: constructing a data set through SQL (structured query language) function set construction, question creation and SQL, visualization scheme marking, SQL statement review, question text review and overall review of a database collection and analysis method;
the second step is that: problem construction: constructing a problem by mathematics of a specific problem on the basis of the data set;
the third step: model framework: establishing a model framework for converting the text into the analytical SQL and extracting the text visualization scheme;
the fourth step: evaluation protocol: and establishing an evaluation scheme of automatic evaluation and human evaluation.
Further, the first step is specifically as follows:
the invention is expanded on the basis of the Spider data set. It contains 200 databases, each of which contains 5.1 tables on average. Through inspection, although the data set can be generalized to different fields, data under some business scenes are added, and the application effect in the business intelligent scene can be verified more directly.
The invention adopts descriptive analysis (descriptive statistics) and inference statistics (scientific statistics) to collect common analysis methods and constructs a mechanism to enable a user to expand according to needs. After the analysis method is determined, a default visualization scheme is formulated, and finally, an SQL function set of the analysis method is formed.
After the SQL function set with the analysis method is available, the generation of the text, the corresponding SQL and the visualization can be started. The process is as follows: first, 20-30 SQL's are generated for each dataset, and the generation of these SQL's follows the following rules: 1) analytical methods covering 50%; 2) each table relating to a database; 3) automatically generating SQL according to the table attribute; 4) as the same result is expressed by different SQL, the SQL protocol is specified, and the protocol is required to be met when the SQL is generated.
For a problem with multiple possible SQL translations, the reviewer should go through to see if the SQL tag was properly selected according to the protocol. The reviewer then checks whether all SQL statements in the current database cover 50% of the analysis methods, referring to all tables in the database.
After reviewing the SQL tag, other annotators would be required to check if the question was clear and contain enough information to answer the query.
And finally, rechecking the SQL. If multiple viewers are unsure of certain annotation problems, this annotator will make the final decision.
Further, the second step is specifically as follows:
on the basis of this data set, the specific problem is mathematically processed. The present invention is based on RAT (relationship awareness self-attention) -SQL and is modified according to the features of the present invention. Specifically, given a natural language question Q, a set of relational database objects S ═ is<C,T>And analyzing a method function set COMP with the aim of generating a corresponding SQL query statement P and a corresponding visualization result VIS. The question consists of a sequence of words Q ═ Q1,q2,…,q|Q|(ii) a The database object set S is composed of columns C ═ C1,c2,...,c|C|And the table name T ═ T1,t2,...,t|T|Is formed by the following steps; the analysis method function set comprises a plurality of functions written by SQL to form COMP1,comp2,...,comp|COMP|. Each column name ciContaining wordsEach table name tiContaining wordsS; SQL query language P consisting of an Abstract Syntax Tree (AST)TTo represent; visualization result VIS composed of<name,color,axis>The composition is that name is the name of the visualization, color is the color set used, axis is the position of the axis and is represented by three numbers, the first number is 0,1 represents whether the first dimension is mapped on the horizontal axis (then the second dimension is determined to be on the vertical axis), the second number is 0,1 represents that the horizontal axis is up or down, the third number is 0,1 represents that the vertical axis is left or right, and the default is 000 when the axis information is not specified in the natural language.
Some columns in the schema are primary keys for uniquely indexing the corresponding tables, and some are foreign keys for referencing primary key columns in other tables. Further, each column is of type τ ∈ { number, text }. Thus, the set of database objects is formally represented as a directed graph G ═ V, E >. Its node V ═ C ≦ T is the column and table names of the set of objects, each with the word in its name (for the column, its type τ is placed in front of the label). Its edge E is defined by the pre-existing database relationship described in table 1.
Table 1 node types and edge construction rules of the directed graph G.
If the two nodes are in accordance with the description, an edge is constructed, corresponding labels are marked, and if any relation does not exist, the edge is not constructed.
Directed graph G is an encoding for a set of database objects. In order to correspond the question with the content in the database object set, a new database object set directed graph G using the question as the context is definedQ=<VQ,EQ>. Wherein, VQ=V∪Q=C∪T∪Q, The acquisition mode of the edges obtained after the problems are corresponding to the contents of the database object set is described in the object set link.
After the above definition, the problem is divided into two subtasks: (1) converting the text into analytical SQL; (2) the text is converted into a visualization scheme VIS. For the task of text conversion into analytic SQL, the basic structure is a coder-decoder architecture, and after an analytic method comp is selected (described in a model framework), the comp and a directed graph G are combinedQUsing an encoder fencIs coded as ci,ti,qiComp, wherein ciFor the coding of a certain list, tiFor coding a certain table name, qiCoding a certain vocabulary in the question; decoder fdecTaking the above as input, calculating (P | G)QCOMP).
Further, the third step is specifically:
the invention improves on the relation perception input coding proposed in RAT-SQL, and provides a hierarchy self-attention mechanism, which is a coding model aiming at a semi-structured input sequence. Compared with RAT-SQL, the model can better jointly encode the existing hierarchical relation structure in the input.
The idea of the self-attention mechanism is that each element can be expressed by its relationship to other elements, i.e. the relationship information is encoded into the element. The calculation method is as follows:
however, this self-attention mechanism only calculates a single-level relationship of some two elements, and in practical cases, an element may have a more direct relationship with a combination of some several elements. Algorithm 1 of the present invention is as follows:
the method is to exclude a certain element in the sequence, calculate the self-coding results of other elements, and then include the excluded element into the self-attention coding, so that the hierarchical relationship can be captured. In the task, on the basis of identifying the vocabulary of the analysis method, the problem context database object set directed graph GQAnd (6) coding is carried out. Specifically, first, the GloVe embedding method is used to align the column names c in the directed graph GiAnd the table name tiCoding to obtainAndthen run bidirectional LSTM to getAndfor the encoding of problem Q, a bi-directional LSTM is used, outputting the encoding of each word thereinIncluding the recognized vocabulary associated with the analysis method, denoted as compinit。
Preliminary derived codesAndindependent of each other, lack of directed graph GQNode connection information in (2) in order to be able to represent the relationships among these elements in the encoding, a self-attention mechanism is used. The description of the analysis method is related to the natural language in the invention, and the relativeIn NL2CompSQL, it is crucial to enable the analytical method to have the correct input, which can be seen as a whole. It is therefore desirable to be able to capture the association of analysis method related vocabulary with combinations of input related vocabulary, computed using a hierarchical self-attention mechanism. Because the relation between the relevant vocabulary of the analysis method and other vocabulary combinations is mainly expected to be captured, on the basis of the original algorithm I, the algorithm II only excludes the relevant vocabulary of the analysis method for calculation:
database relationship set links can help align tables, columns, value references in natural language questions with database relationship sets. Alignment is largely divided into two parts: the name link is linked to the value.
Name linking is the matching of a column or table name to a natural language vocabulary. The matching is divided into full matching and partial matching. Specifically, n-grams of length 1 to 5 in the natural language question are first calculated and then it is determined whether they completely match the column name or table name or whether the n-grams are a subsequence of the column name or table name, thereby obtaining 4 relationships, TEM (full match of table names), TPM (partial match of table names), CEM (full match of column names), CPM (partial match of column names).
Matching of a question with a set of database objects also results in matching of the question with values in the database, and although matching of a question with a column name does not require value matching, it is difficult to find the corresponding column when the column name in the question is not mentioned explicitly, and background knowledge is required to alleviate the problem. The values in the database are a good source of background knowledge, so the question can be matched to the values in the database, defining the relationship as CELLMATCH. The matching of values in the database requires database query, so SQL clauses such as SELECT, LIKE and the LIKE are used for constructing query statements.
Intuition shows that columns and tables appearing in SQLP will typically have corresponding references in natural language questions. To capture the intuition in the model, the attention of the relationship perception is taken as the sum of each element in y andpointer mechanism between all columns/tables to compute column and table alignment matrices
The decoder traverses the results in depth-first order based on an Abstract Syntax Tree (AST). Using the LSTM to output a behavior at each step, wherein one behavior is to expand the finally generated node into a grammar rule APPLYRULE; another behavior is to select a column or table, i.e., SELECTCOL or SELECTTAB, from a collection of objects. The process by which the decoder generates SQL may be expressed as P (P | Y) ═ Πtp(αt|α<tY), wherein Y ═ fenc(GQ) For the final output of the encoder, α<tAll actions before the t-th step.
The invention uses LSTM based on tree to code analysis method function f to obtain femdLstm (f). The encoder output is modified to:
Y=fenc(GQ,COMP)
the text visualization scheme is extracted. The visualization scheme is simplified in the following way, wherein the visualization of two-dimensional data is only processed, and the visualization control dimension only comprises visualization type names (such as pie charts and line charts), color colors and axial positions axis.
Therefore, this problem can be simplified to calculate P ((name, color, axis) | Q, P). The reason for conditioning the resulting SQL here is that the problem may not contain a description of the visualization, so the default visualization scheme of the analysis method is used at this time. The problem is coded Q by using bidirectional LSTM, and then discrete probability distribution p (name), p (color), and p (axis) of each dimension of the visualization scheme is output in the last step. When the maximum value of the probability distribution is smaller than a certain threshold value, a default scheme is selected; and when the threshold value is exceeded, taking the name corresponding to the maximum value.
Further, the fourth step is specifically as follows:
the invention utilizes automatic evaluation to evaluate the accuracy of the generated analytical SQL statements and the visualization scheme. Following the evaluation method of NL2CompSQL data set Spider evaluation, the invention evaluates from three aspects of component matching, exact matching and running accuracy.
For human evaluation, the present invention provides for evaluation at both transverse and longitudinal angles. The horizontal direction is similar to automatic evaluation, namely the completion degree and the accuracy of data answers under different artificial intelligence algorithms are compared, and only the data answers are scored by human beings; longitudinal evaluation refers to the evaluation of the effect of the system on business intelligence analysis, so the design method of the invention is as follows: a set of data sets and analysis targets are predetermined and business intelligence analysis experts are invited to 20, which are equally divided into two groups, one group using common Business Intelligence (BI) question and answer tools (e.g., MS Power BI) to make visualizations to gain insight and the other group using the present system. Recording the completion time, satisfaction, use experience and the like of the two groups of people.
Compared with the prior art, the invention has the following advantages:
1. the invention can construct a data question-answering system which supports more analysis, and can support more analysis methods compared with the common BI data question-answering system.
2. The present invention builds on the novelty of the definition of natural language grammar.
3. The invention provides an overall evaluation method and a unified framework capable of processing multi-type problems.
Drawings
FIG. 1 is a diagram of a business intelligence data visualization process of the present invention.
FIG. 2 is a schematic diagram of the data construction of the present invention.
FIG. 3 is a test error graph of the system. The ordinate is the error and the abscissa is the number of training sessions.
Detailed Description
The following describes embodiments of the present invention in detail with reference to the accompanying drawings.
As shown in fig. 1, the analysis process of the present embodiment performs a conversational knowledge question-answering method oriented to business intelligence data visualization. The analysis method is as follows:
the first step is as follows: and (5) constructing an SQL function set of the analysis method.
1) Database collection
The expansion is performed on the basis of the Spider data set. It contains 200 databases, each of which contains 5.1 tables on average. Through inspection, the data set can be generalized to different fields, and data under some business scenes are added, so that the application effect in the business intelligent scene can be verified more directly.
2) SQL function set construction of analysis method
The invention constructs a mechanism by using descriptive analysis (descriptive statistics) and inference statistics (inferred statistics) of a statistical analysis method, so that a user can expand according to needs. After the analysis method is determined, the SQL is required to be understood deeply to the classmates and converted into an SQL function, a default visualization scheme is formulated, and finally an SQL function set of the analysis method is formed.
3) Question creation and SQL, visualization scheme tagging
After the SQL function set with the analysis method, the text and the corresponding SQL and visualization are started to be generated. First, 20-30 SQL's are generated for each dataset, and the generation of these SQL's follows the following rules: analytical methods covering 50%; each table relating to a database; automatically generating SQL according to the table attribute; since different SQL representations can be used for the same result.
4) SQL statement and visualization scheme review
For a problem with multiple possible SQL translations, the reviewer should go through to see if the SQL tag was properly selected according to the protocol. The reviewer then checks whether all SQL statements in the current database cover 50% of the analysis methods, referring to all tables in the database.
5) Question text review
After reviewing the SQL tag, other annotators would be required to check if the question was clear and contain enough information to answer the query. Then, a student who has passed grade 6 in English will review and correct each question. They first check if the question is grammatically correct and natural. Next, they ensure that the questions reflect the meaning of their corresponding SQL tags, analysis methods, and visualizations. Finally, to increase the variety of questions, annotators are required to add paraphrased versions of certain questions.
6) Overall review
If multiple viewers are unsure of certain annotation problems, this annotator will make the final decision. The script is run to execute and parse all SQL tags to ensure that they are correct.
The second step is that: on the basis of the data set, the specific problem is mathematically processed.
Given a natural language question Q, a set of relational database objects S ═ C, T >, and a set of analysis method functions COMP, the goal is to generate a corresponding SQL query statement P and a corresponding visualization result VIS. The question consists of a sequence of words Q ═ Q1,q2,...,q|Q|(ii) a The database object set S is composed of columns C ═ C1,c2,...,c|C|And the table name T ═ T1,t2,...,t|T|Is formed by the following steps; the analysis method function set comprises a plurality of functions written by SQL to form COMP1,comp2,...,comp|COMP|. Each column name ciContaining wordsEach table name tiContaining wordsThe SQL query language P is represented by an Abstract Syntax Tree (AST) T; visualization result VIS composed of<name,color,axis>Composition, where name uses the name of the visualization, color represents the color set used, axisThe axis position is represented by three numbers, the first number 0,1 represents whether the first dimension is mapped on the horizontal axis (and the second dimension is determined to be on the vertical axis), the second number 0,1 represents that the horizontal axis is up or down, the third number 0,1 represents that the vertical axis is left or right, and the default is 000 when no axis information is specified in the natural language.
Some columns in the schema are primary keys for uniquely indexing the corresponding tables, and some are foreign keys for referencing primary key columns in other tables. Further, each column is of type τ ∈ { number, test }. Thus, formally we represent the set of database objects as a directed graph G ═ V, E >. Its node V ═ C ≦ T is the column and table names of the set of objects, each with the word in its name (for the column, its type τ is placed in front of the label). Its edge E is defined by the pre-existing database relationship described in table 1.
Table 1 node type and edge construction rule of directed graph G
If the two nodes are in accordance with the description, an edge is constructed, corresponding labels are marked, and if any relation does not exist, the edge is not constructed.
Directed graph G is an encoding for a set of database objects. To be able to map a question to the contents of a set of database objects, we define a new problem-context directed graph G of a set of database objectsQ=<VQ,EQ>. Wherein VQ=V∪Q=C∪T∪Q, The way in which the question is obtained in correspondence with the contents of the database object set is taught in the object set link in 4.2.3. Having defined the above, the problem is divided into two subtasks: (1) converting the text into analytical SQL; (2) the text is converted into a visualization scheme VIS. For the task of text conversion into analytic SQL, the basic structure is a coder-decoder architecture, and after an analytic method comp is selected (described in a model framework), the comp and a directed graph G are combinedQUsing an encoder fencIs coded as ci,ti,qiComp, wherein ciFor the coding of a certain list, tiFor coding a certain table name, qiCoding a certain vocabulary in the question; decoder fdecTaking the above as input, P (P | G) is calculatedQCOMP).
The third step: text conversion to analytical SQL, and text visualization.
(1) Analytical method identification
The goal of this section is to identify the vocabulary representing the analysis methods from the sentences. Given a question Q, calculate each vocabulary QiIS the probability p (IS _ COMP | q) of the analytical methodi). The probability that each word is the analysis method is output by using a bidirectional LSTM (BilSTM) structure, and the word comp with the highest probability value is used as the analysis method described by the problem.
(2) Relationship-aware input coding based on hierarchical self-attention mechanism
The invention improves on the relation perception input coding proposed in RAT-SQL, and provides a hierarchy self-attention mechanism, which is a coding model aiming at a semi-structured input sequence.
Consider a set of inputsWherein each element xiMay not be uniform. The idea is that each element can be represented by its relationship to other elements, i.e. the relationship information is encoded into the element. The calculation method is as follows:
softmax refers to a normalized exponential function. LayerNorm layer normalization function. ReLu: a linear rectification function. SelfAttn: a self-attention mechanism function. Concat: a function that connects multiple functions or arrays. LSMT: a fast classification function.
However, this self-attention mechanism only calculates a single-level relationship of some two elements, and in practical cases, an element may have a more direct relationship with a combination of some several elements. Therefore, the present invention proposes a self-attention mechanism with hierarchy, and its algorithm is as follows:
the method is to exclude a certain element in the sequence, calculate the self-coding results of other elements, and then include the excluded element into the self-attention coding, so that the hierarchical relationship can be captured.
In the invention, on the basis of identifying the vocabulary of the analysis method, the problem context database object set directed graph GQAnd (6) coding is carried out. Specifically, GloVe embed was used firstding method for column name c in directed graph GiAnd the table name tiCoding to obtainAndthen run bidirectional LSTM to getAndfor the encoding of the problem Q, the two-way LSTM in (1) is used to output the encoding of each word thereinIncluding the recognized vocabulary associated with the analysis method, denoted as compinit。
Preliminary derived codesAndindependent of each other, lack of directed graph GQNode connection information in (2) in order to be able to represent the relationships among these elements in the encoding, a self-attention mechanism is used. The description of the analysis method in the natural language of the present invention is related to that of NL2CompSQL, and it is crucial to enable the analysis method to have correct input, which can be viewed as a whole. Therefore, it is desirable to capture the association of analysis method related vocabulary with the combination of input related vocabulary, and calculate it using a hierarchical self-attention mechanism. Because the relation between the relevant vocabulary of the analysis method and other vocabulary combinations is mainly expected to be captured, on the basis of the original algorithm I, only the relevant vocabulary of the analysis method is excluded from calculation:
hierarchical self-attention mechanism of algorithm two-sided analysis method
(3) Database relational set linking
Database relationship set links can help align tables, columns, value references in natural language questions with database relationship sets. Alignment is largely divided into two parts: the name link is linked to the value.
Name linking is the matching of a column or table name to a natural language vocabulary. The matching is divided into full matching and partial matching. Specifically, n-grams of length 1 to 5 in the natural language question are first calculated and then it is determined whether they completely match the column name or table name or whether the n-grams are a subsequence of the column name or table name, thereby obtaining 4 relationships, TEM (full match of table names), TPM (partial match of table names), CEM (full match of column names), CPM (partial match of column names).
Matching of a question with a set of database objects also results in matching of the question with values in the database, and although matching of a question with a column name does not require value matching, it is difficult to find the corresponding column when the column name in the question is not mentioned explicitly, and background knowledge is required to alleviate the problem. The values in the database are a good source of background knowledge, so the question can be matched to the values in the database, defining the relationship as CELLMATCH. The matching of values in the database requires database query, so SQL clauses such as SELECT, LIKE and the LIKE are used for constructing query statements.
Columns and tables appearing in SQL P will typically have corresponding references in natural language questions. To capture the intuition in the model, the relationship-aware attention is used as a pointer mechanism between each element in y and all columns/tables to compute the column and table alignment matrix
(4) Decoder
The decoder traverses the results in depth-first order based on the Abstract Syntax Tree (AST) proposed by Yin et al. Using the LSTM to output a behavior at each step, wherein one behavior is to expand the finally generated node into a grammar rule APPLYRULE; another behavior is to select a column or table, i.e., SELECTCOL or SELECTTAB, from a collection of objects.
Thus, the process of the decoder generating SQL can be expressed as P (P | Y) ═ Πtp(at|a<tY), wherein Y ═ fenc(GQ) For the final output of the encoder, a<tAll actions before the t-th step.
(5) Enhancing functions using SQL analysis
When a human being applies an analytic method, the human being usually has abstract knowledge similar to a function for the analytic method, and the abstract knowledge helps the human being construct a complex analytic statement. Therefore, it is reasonable to assume that the SQL analysis function can further help the model build analytics SQL. And (3) adding the code of the SQL function corresponding to the analysis method identified in the step (1) into the result of the encoder, so that the decoder can use the analysis method function to help generate the final SQL statement. Encoding the analysis method function f using a tree-based LSTM to obtain femdLstm (f). The encoder output is modified to:
Y=fenc(GQ,COMP)。
the fourth step: automated assessment and human assessment
The invention utilizes automatic evaluation to evaluate the accuracy of the generated analytical SQL statements and the visualization scheme. Following the evaluation method of NL2CompSQL data set Spider evaluation, the invention evaluates from three aspects of component matching, exact matching and running accuracy.
For human evaluation, the present invention provides for evaluation at both transverse and longitudinal angles. The horizontal direction is similar to automatic evaluation, namely the completion degree and the accuracy of data answers under different artificial intelligence algorithms are compared, and only the data answers are scored by human beings; longitudinal evaluation refers to the evaluation of the effect of the system on business intelligence analysis, so the design method of the invention is as follows: a set of data sets and analysis targets are predetermined and business intelligence analysis experts are invited to 20, which are equally divided into two groups, one group using common BI question and answer tools (e.g., MS Power BI) to make visualizations to gain insight and the other group using the present system. Recording the completion time, satisfaction, use experience and the like of the two groups of people.
The business-oriented intelligent visual multi-turn dialogue system can be realized through business intelligent analysis and visualization, namely, a user can carry out multi-turn dialogue with the system to inquire data problems or knowledge problems, and the system returns visual answers to the data problems; for knowledge questions, the system returns textual answers directly.
Accurately solving a differential-integral equation with frequency-dependent time delay, and accurately analyzing chaotic dynamics and chaotic communication with frequency-dependent time delay feedback.
In this embodiment, the answer may be automatically generated by an algorithm. The dialog system implements: summarizing and researching the business intelligent analysis method, namely summarizing the analysis method and the analysis habit used by a business intelligent analyst and designing a set of form expression of the analysis method; visual automatic generation, namely selecting a proper visual coding strategy, recommending query sentences and automatically generating a visual result; the realization of multi-turn conversation solves common designation and information omission which often occur in the conversation and supports further modification or addition of limitation on question sentences; the multi-task unified framework is used for the research of the dialogue question and answer. As the business intelligent system covers various data, users may have different question demands, mainly comprising data question answering and knowledge questions; the generation of the business intelligence visualization dialog data set. Due to the use of artificial intelligence algorithms, sufficient data needs to be collected to ensure the training effect. But the method is limited by manpower and material resources, and a labor-saving mode needs to be designed to generate a data set; the help of the subject to the BI analyst is evaluated. Appropriate evaluation indexes are designed, and evaluation is performed from two angles of automatic evaluation and human evaluation.
While the preferred embodiments and principles of this invention have been described in detail, it will be apparent to those skilled in the art that variations may be made in the embodiments based on the teachings of the invention and such variations are considered to be within the scope of the invention.
Claims (6)
1. A dialogue type question-answering realizing method facing intelligent data visualization is characterized by comprising the following steps:
the first step is as follows: constructing a data set through SQL function set construction, question creation and SQL, visualization scheme labeling, SQL statement review, question text review and overall review of a database collection and analysis method;
the second step is that: constructing a problem by mathematics of a specific problem on the basis of the data set;
the third step: establishing a model framework for converting the text into the analytical SQL and extracting the text visualization scheme;
the fourth step: and establishing an evaluation scheme of automatic evaluation and human evaluation.
2. The intelligent data visualization-oriented conversational question answering implementation method as claimed in claim 1, wherein the first step is as follows:
the method is expanded on the basis of a Spider data set, and comprises 200 databases, wherein each database contains 5.1 tables on average;
a descriptive analysis and reasoning statistics collection common analysis method is adopted, and a mechanism is constructed to enable a user to expand according to needs; after the analysis method is determined, a default visualization scheme is formulated, and finally, an SQL function set of the analysis method is formed;
after the SQL function set with the analysis method, the generation of text and corresponding SQL and visualization may begin: first, 20-30 SQL's are generated for each dataset, and the generation of these SQL's follows the following rules: 1) analytical methods covering 50%; 2) each table relating to a database; 3) automatically generating SQL according to the table attribute; 4) as the same result is expressed by different SQL, the SQL protocol is specified, and the protocol is required to be met when the SQL is generated.
3. The intelligent data visualization-oriented conversational question answering implementation method as claimed in claim 2, wherein the second step is as follows:
given a natural language question Q, a set of relational database objects S ═ is<C,T>And analyzing a method function set COMP with the aim of generating a corresponding SQL query statement P and a corresponding visualization result VIS; the question consists of a sequence of words Q ═ Q1,q2,…,qQ(ii) a The database object set S is composed of columns C ═ C1,c2,...,c|C|And the table name T ═ T1,t2,...,t|T|Is formed by the following steps; the analysis method function set comprises a plurality of functions written by SQL to form COMP1,comp2,...,comp|COMP|(ii) a Each column name ci containing a wordEach table name tiContaining wordsS; SQL query language P consisting of an Abstract Syntax Tree (AST)TTo represent; visualization result VIS composed of<name,color,axis>The system comprises a name, a color, an axis position, a horizontal axis position and a vertical axis position, wherein the name is a visualization name, the color is a color set, the axis position is represented by three numbers, the first number is 0,1 represents whether a first dimension is mapped on the horizontal axis, the second number is 0,1 represents that the horizontal axis is up or down, the third number is 0,1 represents that the vertical axis is left or right, and the default is 000 when axis information is not specified in a natural language;
some columns in the schema are primary keys for uniquely indexing the corresponding tables, and some are foreign keys for referencing primary key columns in other tables; furthermore, the type of each column is tau ∈ { number, text }; thus, a set of database objects is formally represented as a directed graph G ═ V, E >, whose nodes V ═ C ═ u T are the column names and table names of the set of objects, each with the word in its name;
directed graph G is an encoding for a set of database objects; defining a new problem-context database object set directed graph GQ=<VQ,EQ>Wherein V isQ=V∪Q=C∪T∪Q, The method is characterized in that the acquisition mode of the edges obtained after the problems correspond to the contents of the database object set is described in the object set link;
after the above definition, the problem is divided into two subtasks: (1) converting the text into analytical SQL; (2) text is converted into a visualization scheme VIS; for the task of converting text into analytic SQL, the basic structure is a coder-decoder structure, after an analytic method comp is selected, the comp and a directed graph G are processedQUsing an encoder fencIs coded as ci,ti,qiComp, wherein ciFor the coding of a certain list, tiFor coding a certain table name, qiCoding a certain vocabulary in the question; decoder fdecTaking the above as input, calculating (P | G)QCOMP).
4. The intelligent data visualization-oriented conversational question-answering implementation method of claim 3, wherein in the third step, the text conversion into analytical SQL specifically comprises:
the idea of the self-attention mechanism is that each element can be expressed by its relationship with other elements, i.e. the relationship information is encoded into the element by:
softmax refers to normalized exponential function;
LayerNorm is a layer normalization function;
ReLu: a linear rectification function;
SelfAttn: a self-attention mechanism function;
concat: a function connecting a plurality of functions or arrays;
LSMT: a fast classification function;
however, this self-attention mechanism only calculates a single-level relationship of some two elements, and in practical cases, an element may have a more direct relationship with a combination of some elements; the algorithm one is as follows:
first, the GloVe embedding method is used to align the column names c in the directed graph GiAnd the table name tiCoding to obtainAndthen run bidirectional LSTM to getAndfor the encoding of problem Q, a bi-directional LSTM is used, outputting the encoding of each word thereinIncluding the recognized vocabulary associated with the analysis method, denoted as compinit;
And the algorithm II is as the following table, and only relevant words of the analysis method are excluded for calculation:
the database relation set link can help tables, columns and value references in the natural language problem align with the database relation set; alignment is largely divided into two parts: a name link and a value link;
name linking is matching column or table names with natural language vocabulary; matching is divided into full matching and partial matching; specifically, n-grams with a length of 1 to 5 in the natural language question are first calculated, and then it is judged whether they completely match the list name or table name or whether the n-grams are subsequences of the list name or table name, thereby obtaining 4 relationships, TEM, TPM, CEM, CPM;
the values in the database are a good source of background knowledge, so the question can be matched to the values in the database, defining the relationship as CELLMATCH; the value matching in the database needs to carry out database query, so SQL clauses such as SELECT, LIKE and the LIKE are used for constructing query sentences;
to capture the intuition in the model, the relationship-aware attention is used as a pointer mechanism between each element in y and all columns/tables to compute the column and table alignment matrix
The decoder traverses in the order of depth priority based on the abstract syntax tree to obtain a result; using the LSTM to output a behavior at each step, wherein one behavior is to expand the finally generated node into a grammar rule APPLYRULE; another behavior is to select a column or table, i.e., SELECTCOL or SELECTTAB, from a set of objects; the process by which the decoder generates SQL may be expressed as P (py) ═ Πtp(at|a<tY), wherein Y ═ fenc(GQ) For the final output of the encoder, a<tAll behaviors before the t step;
encoding the analysis method function f using a tree-based LSTM to obtain femdThe encoder output is modified to:
Y=fenc(GQ,COMP)。
5. an intelligent data visualization oriented conversational question answering implementation method as claimed in claim 3 or 4, wherein in the third step, the text visualization scheme is extracted: the visualization scheme is simplified, firstly, visualization of two-dimensional data is only processed, and secondly, the visualization control dimension only comprises a visualization type name, a color and an axis position axis;
the problem is simplified to be P ((name, color, axis) | Q, P), the reason that the result SQL is taken as a condition is that the problem possibly does not contain visual description, so a default visualization scheme of an analysis method is used at this time; encoding Q the problem by using a bidirectional LSTM, and then outputting discrete probability distribution p (name), p (color) and p (axis) of each dimension of the visualization scheme in the last step; when the maximum value of the probability distribution is smaller than a certain threshold value, a default scheme is selected; and when the threshold value is exceeded, taking the name corresponding to the maximum value.
6. The intelligent data visualization-oriented conversational question answering implementation method as claimed in claim 5, wherein the fourth step is as follows:
evaluating the accuracy of the generated analytical SQL statements and the visualization scheme by automatic evaluation;
for human assessment, two angles, transverse and longitudinal, are provided to be assessed; and (3) transverse evaluation: the completion degree and the accuracy of data answers under different artificial intelligence algorithms are compared, and only the data answers are manually scored; longitudinal evaluation: presetting a group of data sets and analysis targets, inviting business intelligent analysts to count the names, dividing the names into two groups, wherein one group uses a common business intelligent question-answering tool, and the other group uses the method; two sets of data are recorded.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110399195.1A CN113111158B (en) | 2021-04-14 | 2021-04-14 | Intelligent data visualization oriented conversational question-answering implementation method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110399195.1A CN113111158B (en) | 2021-04-14 | 2021-04-14 | Intelligent data visualization oriented conversational question-answering implementation method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113111158A true CN113111158A (en) | 2021-07-13 |
CN113111158B CN113111158B (en) | 2022-05-10 |
Family
ID=76716746
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110399195.1A Active CN113111158B (en) | 2021-04-14 | 2021-04-14 | Intelligent data visualization oriented conversational question-answering implementation method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113111158B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114782122A (en) * | 2022-03-15 | 2022-07-22 | 福建亿力电力科技有限责任公司 | Automatic analysis method and system for bidder information in bidding material |
CN116821168A (en) * | 2023-08-24 | 2023-09-29 | 吉奥时空信息技术股份有限公司 | Improved NL2SQL method based on large language model |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170061809A1 (en) * | 2015-01-30 | 2017-03-02 | Xerox Corporation | Method and system for importing hard copy assessments into an automatic educational system assessment |
CN110209787A (en) * | 2019-05-29 | 2019-09-06 | 袁琦 | A kind of intelligent answer method and system based on pet knowledge mapping |
CN111400469A (en) * | 2020-03-12 | 2020-07-10 | 法雨科技(北京)有限责任公司 | Intelligent generation system and method for voice question answering |
CN111542813A (en) * | 2017-10-09 | 2020-08-14 | 塔谱软件有限责任公司 | Using object models of heterogeneous data to facilitate building data visualizations |
-
2021
- 2021-04-14 CN CN202110399195.1A patent/CN113111158B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170061809A1 (en) * | 2015-01-30 | 2017-03-02 | Xerox Corporation | Method and system for importing hard copy assessments into an automatic educational system assessment |
CN111542813A (en) * | 2017-10-09 | 2020-08-14 | 塔谱软件有限责任公司 | Using object models of heterogeneous data to facilitate building data visualizations |
CN110209787A (en) * | 2019-05-29 | 2019-09-06 | 袁琦 | A kind of intelligent answer method and system based on pet knowledge mapping |
CN111400469A (en) * | 2020-03-12 | 2020-07-10 | 法雨科技(北京)有限责任公司 | Intelligent generation system and method for voice question answering |
Non-Patent Citations (1)
Title |
---|
曾帅等: "面向知识自动化的自动问答研究进展", 《自动化学报》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114782122A (en) * | 2022-03-15 | 2022-07-22 | 福建亿力电力科技有限责任公司 | Automatic analysis method and system for bidder information in bidding material |
CN116821168A (en) * | 2023-08-24 | 2023-09-29 | 吉奥时空信息技术股份有限公司 | Improved NL2SQL method based on large language model |
CN116821168B (en) * | 2023-08-24 | 2024-01-23 | 吉奥时空信息技术股份有限公司 | Improved NL2SQL method based on large language model |
Also Published As
Publication number | Publication date |
---|---|
CN113111158B (en) | 2022-05-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108920544A (en) | A kind of personalized position recommended method of knowledge based map | |
CN113111158B (en) | Intelligent data visualization oriented conversational question-answering implementation method | |
CN111767732B (en) | Document content understanding method and system based on graph attention model | |
CN111930906A (en) | Knowledge graph question-answering method and device based on semantic block | |
Sharma et al. | A survey of methods, datasets and evaluation metrics for visual question answering | |
CN110457450B (en) | Answer generation method based on neural network model and related equipment | |
CN116596347B (en) | Multi-disciplinary interaction teaching system and teaching method based on cloud platform | |
CN110929714A (en) | Information extraction method of intensive text pictures based on deep learning | |
CN115759092A (en) | Network threat information named entity identification method based on ALBERT | |
CN115827844A (en) | Knowledge graph question-answering method and system based on spark ql statement generation | |
JP2022151838A (en) | Extraction of open information from low resource language | |
CN113378024B (en) | Deep learning-oriented public inspection method field-based related event identification method | |
CN112347252B (en) | Interpretability analysis method based on CNN text classification model | |
CN114168615A (en) | Method and system for querying SCD (substation configuration description) file of intelligent substation by natural language | |
CN114020900A (en) | Chart English abstract generation method based on fusion space position attention mechanism | |
Maji et al. | An interpretable deep learning system for automatically scoring request for proposals | |
CN115860002B (en) | Combat task generation method and system based on event extraction | |
CN114331226B (en) | Intelligent enterprise demand diagnosis method and system and storage medium | |
Tran et al. | ViCLEVR: A Visual Reasoning Dataset and Hybrid Multimodal Fusion Model for Visual Question Answering in Vietnamese | |
CN112199114B (en) | Software defect report distribution method based on self-attention mechanism | |
CN112732942A (en) | User-oriented multi-turn question-answer legal document entity relationship extraction method | |
CN117541202A (en) | Employment recommendation system based on multi-mode knowledge graph and pre-training large model fusion | |
CN114547256A (en) | Text semantic matching method and device for intelligent question answering of fire safety knowledge | |
Li et al. | Fault Diagnosis and System Maintenance Based on Large Language Models and Knowledge Graphs | |
CN117874244A (en) | Construction method and system of power safety production risk prevention and control knowledge graph |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right | ||
TR01 | Transfer of patent right |
Effective date of registration: 20220824 Address after: Room 405, 6-8 Jiaogong Road, Xihu District, Hangzhou City, Zhejiang Province, 310013 Patentee after: Hangzhou Taoyi Data Technology Co.,Ltd. Address before: 310018 no.1158, No.2 street, Baiyang street, Hangzhou Economic and Technological Development Zone, Zhejiang Province Patentee before: HANGZHOU DIANZI University |