CN113111158A - Intelligent data visualization oriented conversational question-answering implementation method - Google Patents

Intelligent data visualization oriented conversational question-answering implementation method Download PDF

Info

Publication number
CN113111158A
CN113111158A CN202110399195.1A CN202110399195A CN113111158A CN 113111158 A CN113111158 A CN 113111158A CN 202110399195 A CN202110399195 A CN 202110399195A CN 113111158 A CN113111158 A CN 113111158A
Authority
CN
China
Prior art keywords
sql
visualization
question
name
database
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110399195.1A
Other languages
Chinese (zh)
Other versions
CN113111158B (en
Inventor
李齐良
李舒琴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Taoyi Data Technology Co ltd
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN202110399195.1A priority Critical patent/CN113111158B/en
Publication of CN113111158A publication Critical patent/CN113111158A/en
Application granted granted Critical
Publication of CN113111158B publication Critical patent/CN113111158B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/338Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor

Abstract

The invention discloses a conversational question-answering realizing method facing intelligent data visualization, which comprises the following steps: the first step is as follows: constructing a data set through SQL function set construction, question creation and SQL, visualization scheme labeling, SQL statement review, question text review and overall review of a database collection and analysis method; the second step is that: constructing a problem by mathematics of a specific problem on the basis of the data set; the third step: establishing a model framework for converting the text into the analytical SQL and extracting the text visualization scheme; the fourth step: and establishing an evaluation scheme of automatic evaluation and human evaluation. The invention can construct a data question-answering system which supports more analysis, and can support more analysis methods compared with the common BI data question-answering system.

Description

Intelligent data visualization oriented conversational question-answering implementation method
Technical Field
The invention belongs to the technical field of artificial intelligence, and particularly relates to a business intelligence data visualization oriented conversational knowledge question-answering implementation method.
Background
With the progress of Artificial Intelligence (AI) technology and the support of big data, the AI technology is beginning to be widely applied to various fields, such as image recognition, object detection, image generation, machine translation, knowledge graph, dialogue question and answer, and the like. The application of the AI technology in the fields can help people to reduce repetitive labor, improve the working efficiency and assist people in creating. In the field of business intelligence, people obtain insights by converting original data and applying an analysis algorithm to help decisions. Currently, an analyst usually needs to go through the following steps to obtain an analysis result: knowing the data structure, converting the original data, selecting an analysis method or self-writing an analysis function, and making a visual display scheme to obtain a result. The prior art has the defects that a great deal of time is needed, the process is difficult to copy, and a lot of repeated labor is caused.
Disclosure of Invention
Based on the current situation, a technical scheme is developed to help simplify the steps, so that the complexity of analysis can be greatly simplified, an analyst can focus on understanding the insight behind the data more and obtain more knowledge from the data, and therefore, the invention provides an intelligent data visualization oriented conversational question-and-answer implementation method which comprises an algorithm based on natural language and further lays a foundation for atlas visualization of system question-and-answer knowledge.
The invention adopts the following technical scheme:
a dialogue type question-answering implementation method facing intelligent data visualization realizes business intelligent analysis, conversion of natural language into SQL (structured query language) and automatic visualization. A new NL2SQL data set (NL2BISQL data set) containing common analysis methods is constructed through a business intelligent analysis method, then a question-answering process oriented to business intelligent visualization is achieved through the model in the invention, and a visualization result is returned as an answer. The method comprises the following steps:
the first step is as follows: and (3) data set construction: constructing a data set through SQL (structured query language) function set construction, question creation and SQL, visualization scheme marking, SQL statement review, question text review and overall review of a database collection and analysis method;
the second step is that: problem construction: constructing a problem by mathematics of a specific problem on the basis of the data set;
the third step: model framework: establishing a model framework for converting the text into the analytical SQL and extracting the text visualization scheme;
the fourth step: evaluation protocol: and establishing an evaluation scheme of automatic evaluation and human evaluation.
Further, the first step is specifically as follows:
the invention is expanded on the basis of the Spider data set. It contains 200 databases, each of which contains 5.1 tables on average. Through inspection, although the data set can be generalized to different fields, data under some business scenes are added, and the application effect in the business intelligent scene can be verified more directly.
The invention adopts descriptive analysis (descriptive statistics) and inference statistics (scientific statistics) to collect common analysis methods and constructs a mechanism to enable a user to expand according to needs. After the analysis method is determined, a default visualization scheme is formulated, and finally, an SQL function set of the analysis method is formed.
After the SQL function set with the analysis method is available, the generation of the text, the corresponding SQL and the visualization can be started. The process is as follows: first, 20-30 SQL's are generated for each dataset, and the generation of these SQL's follows the following rules: 1) analytical methods covering 50%; 2) each table relating to a database; 3) automatically generating SQL according to the table attribute; 4) as the same result is expressed by different SQL, the SQL protocol is specified, and the protocol is required to be met when the SQL is generated.
For a problem with multiple possible SQL translations, the reviewer should go through to see if the SQL tag was properly selected according to the protocol. The reviewer then checks whether all SQL statements in the current database cover 50% of the analysis methods, referring to all tables in the database.
After reviewing the SQL tag, other annotators would be required to check if the question was clear and contain enough information to answer the query.
And finally, rechecking the SQL. If multiple viewers are unsure of certain annotation problems, this annotator will make the final decision.
Further, the second step is specifically as follows:
on the basis of this data set, the specific problem is mathematically processed. The present invention is based on RAT (relationship awareness self-attention) -SQL and is modified according to the features of the present invention. Specifically, given a natural language question Q, a set of relational database objects S ═ is<C,T>And analyzing a method function set COMP with the aim of generating a corresponding SQL query statement P and a corresponding visualization result VIS. The question consists of a sequence of words Q ═ Q1,q2,…,q|Q|(ii) a The database object set S is composed of columns C ═ C1,c2,...,c|C|And the table name T ═ T1,t2,...,t|T|Is formed by the following steps; the analysis method function set comprises a plurality of functions written by SQL to form COMP1,comp2,...,comp|COMP|. Each column name ciContaining words
Figure BDA0003019718140000031
Each table name tiContaining words
Figure BDA0003019718140000032
S; SQL query language P consisting of an Abstract Syntax Tree (AST)TTo represent; visualization result VIS composed of<name,color,axis>The composition is that name is the name of the visualization, color is the color set used, axis is the position of the axis and is represented by three numbers, the first number is 0,1 represents whether the first dimension is mapped on the horizontal axis (then the second dimension is determined to be on the vertical axis), the second number is 0,1 represents that the horizontal axis is up or down, the third number is 0,1 represents that the vertical axis is left or right, and the default is 000 when the axis information is not specified in the natural language.
Some columns in the schema are primary keys for uniquely indexing the corresponding tables, and some are foreign keys for referencing primary key columns in other tables. Further, each column is of type τ ∈ { number, text }. Thus, the set of database objects is formally represented as a directed graph G ═ V, E >. Its node V ═ C ≦ T is the column and table names of the set of objects, each with the word in its name (for the column, its type τ is placed in front of the label). Its edge E is defined by the pre-existing database relationship described in table 1.
Table 1 node types and edge construction rules of the directed graph G.
Figure BDA0003019718140000033
Figure BDA0003019718140000041
If the two nodes are in accordance with the description, an edge is constructed, corresponding labels are marked, and if any relation does not exist, the edge is not constructed.
Directed graph G is an encoding for a set of database objects. In order to correspond the question with the content in the database object set, a new database object set directed graph G using the question as the context is definedQ=<VQ,EQ>. Wherein, VQ=V∪Q=C∪T∪Q,
Figure BDA0003019718140000042
Figure BDA0003019718140000043
The acquisition mode of the edges obtained after the problems are corresponding to the contents of the database object set is described in the object set link.
After the above definition, the problem is divided into two subtasks: (1) converting the text into analytical SQL; (2) the text is converted into a visualization scheme VIS. For the task of text conversion into analytic SQL, the basic structure is a coder-decoder architecture, and after an analytic method comp is selected (described in a model framework), the comp and a directed graph G are combinedQUsing an encoder fencIs coded as ci,ti,qiComp, wherein ciFor the coding of a certain list, tiFor coding a certain table name, qiCoding a certain vocabulary in the question; decoder fdecTaking the above as input, calculating (P | G)QCOMP).
Further, the third step is specifically:
the invention improves on the relation perception input coding proposed in RAT-SQL, and provides a hierarchy self-attention mechanism, which is a coding model aiming at a semi-structured input sequence. Compared with RAT-SQL, the model can better jointly encode the existing hierarchical relation structure in the input.
The idea of the self-attention mechanism is that each element can be expressed by its relationship to other elements, i.e. the relationship information is encoded into the element. The calculation method is as follows:
Figure BDA0003019718140000051
Figure BDA0003019718140000052
Figure BDA0003019718140000053
Figure BDA0003019718140000054
yi=SelfAttn(xi,X);
Figure BDA0003019718140000055
however, this self-attention mechanism only calculates a single-level relationship of some two elements, and in practical cases, an element may have a more direct relationship with a combination of some several elements. Algorithm 1 of the present invention is as follows:
Figure BDA0003019718140000056
the method is to exclude a certain element in the sequence, calculate the self-coding results of other elements, and then include the excluded element into the self-attention coding, so that the hierarchical relationship can be captured. In the task, on the basis of identifying the vocabulary of the analysis method, the problem context database object set directed graph GQAnd (6) coding is carried out. Specifically, first, the GloVe embedding method is used to align the column names c in the directed graph GiAnd the table name tiCoding to obtain
Figure BDA0003019718140000057
And
Figure BDA0003019718140000058
then run bidirectional LSTM to get
Figure BDA0003019718140000059
And
Figure BDA00030197181400000510
for the encoding of problem Q, a bi-directional LSTM is used, outputting the encoding of each word therein
Figure BDA00030197181400000511
Including the recognized vocabulary associated with the analysis method, denoted as compinit
Preliminary derived codes
Figure BDA00030197181400000512
And
Figure BDA00030197181400000513
independent of each other, lack of directed graph GQNode connection information in (2) in order to be able to represent the relationships among these elements in the encoding, a self-attention mechanism is used. The description of the analysis method is related to the natural language in the invention, and the relativeIn NL2CompSQL, it is crucial to enable the analytical method to have the correct input, which can be seen as a whole. It is therefore desirable to be able to capture the association of analysis method related vocabulary with combinations of input related vocabulary, computed using a hierarchical self-attention mechanism. Because the relation between the relevant vocabulary of the analysis method and other vocabulary combinations is mainly expected to be captured, on the basis of the original algorithm I, the algorithm II only excludes the relevant vocabulary of the analysis method for calculation:
Figure BDA0003019718140000061
database relationship set links can help align tables, columns, value references in natural language questions with database relationship sets. Alignment is largely divided into two parts: the name link is linked to the value.
Name linking is the matching of a column or table name to a natural language vocabulary. The matching is divided into full matching and partial matching. Specifically, n-grams of length 1 to 5 in the natural language question are first calculated and then it is determined whether they completely match the column name or table name or whether the n-grams are a subsequence of the column name or table name, thereby obtaining 4 relationships, TEM (full match of table names), TPM (partial match of table names), CEM (full match of column names), CPM (partial match of column names).
Matching of a question with a set of database objects also results in matching of the question with values in the database, and although matching of a question with a column name does not require value matching, it is difficult to find the corresponding column when the column name in the question is not mentioned explicitly, and background knowledge is required to alleviate the problem. The values in the database are a good source of background knowledge, so the question can be matched to the values in the database, defining the relationship as CELLMATCH. The matching of values in the database requires database query, so SQL clauses such as SELECT, LIKE and the LIKE are used for constructing query statements.
Intuition shows that columns and tables appearing in SQLP will typically have corresponding references in natural language questions. To capture the intuition in the model, the attention of the relationship perception is taken as the sum of each element in y andpointer mechanism between all columns/tables to compute column and table alignment matrices
Figure BDA0003019718140000062
Figure BDA0003019718140000071
Figure BDA0003019718140000072
Figure BDA0003019718140000073
The decoder traverses the results in depth-first order based on an Abstract Syntax Tree (AST). Using the LSTM to output a behavior at each step, wherein one behavior is to expand the finally generated node into a grammar rule APPLYRULE; another behavior is to select a column or table, i.e., SELECTCOL or SELECTTAB, from a collection of objects. The process by which the decoder generates SQL may be expressed as P (P | Y) ═ Πtp(αt<tY), wherein Y ═ fenc(GQ) For the final output of the encoder, α<tAll actions before the t-th step.
The invention uses LSTM based on tree to code analysis method function f to obtain femdLstm (f). The encoder output is modified to:
Y=fenc(GQ,COMP)
the text visualization scheme is extracted. The visualization scheme is simplified in the following way, wherein the visualization of two-dimensional data is only processed, and the visualization control dimension only comprises visualization type names (such as pie charts and line charts), color colors and axial positions axis.
Therefore, this problem can be simplified to calculate P ((name, color, axis) | Q, P). The reason for conditioning the resulting SQL here is that the problem may not contain a description of the visualization, so the default visualization scheme of the analysis method is used at this time. The problem is coded Q by using bidirectional LSTM, and then discrete probability distribution p (name), p (color), and p (axis) of each dimension of the visualization scheme is output in the last step. When the maximum value of the probability distribution is smaller than a certain threshold value, a default scheme is selected; and when the threshold value is exceeded, taking the name corresponding to the maximum value.
Further, the fourth step is specifically as follows:
the invention utilizes automatic evaluation to evaluate the accuracy of the generated analytical SQL statements and the visualization scheme. Following the evaluation method of NL2CompSQL data set Spider evaluation, the invention evaluates from three aspects of component matching, exact matching and running accuracy.
For human evaluation, the present invention provides for evaluation at both transverse and longitudinal angles. The horizontal direction is similar to automatic evaluation, namely the completion degree and the accuracy of data answers under different artificial intelligence algorithms are compared, and only the data answers are scored by human beings; longitudinal evaluation refers to the evaluation of the effect of the system on business intelligence analysis, so the design method of the invention is as follows: a set of data sets and analysis targets are predetermined and business intelligence analysis experts are invited to 20, which are equally divided into two groups, one group using common Business Intelligence (BI) question and answer tools (e.g., MS Power BI) to make visualizations to gain insight and the other group using the present system. Recording the completion time, satisfaction, use experience and the like of the two groups of people.
Compared with the prior art, the invention has the following advantages:
1. the invention can construct a data question-answering system which supports more analysis, and can support more analysis methods compared with the common BI data question-answering system.
2. The present invention builds on the novelty of the definition of natural language grammar.
3. The invention provides an overall evaluation method and a unified framework capable of processing multi-type problems.
Drawings
FIG. 1 is a diagram of a business intelligence data visualization process of the present invention.
FIG. 2 is a schematic diagram of the data construction of the present invention.
FIG. 3 is a test error graph of the system. The ordinate is the error and the abscissa is the number of training sessions.
Detailed Description
The following describes embodiments of the present invention in detail with reference to the accompanying drawings.
As shown in fig. 1, the analysis process of the present embodiment performs a conversational knowledge question-answering method oriented to business intelligence data visualization. The analysis method is as follows:
the first step is as follows: and (5) constructing an SQL function set of the analysis method.
1) Database collection
The expansion is performed on the basis of the Spider data set. It contains 200 databases, each of which contains 5.1 tables on average. Through inspection, the data set can be generalized to different fields, and data under some business scenes are added, so that the application effect in the business intelligent scene can be verified more directly.
2) SQL function set construction of analysis method
The invention constructs a mechanism by using descriptive analysis (descriptive statistics) and inference statistics (inferred statistics) of a statistical analysis method, so that a user can expand according to needs. After the analysis method is determined, the SQL is required to be understood deeply to the classmates and converted into an SQL function, a default visualization scheme is formulated, and finally an SQL function set of the analysis method is formed.
3) Question creation and SQL, visualization scheme tagging
After the SQL function set with the analysis method, the text and the corresponding SQL and visualization are started to be generated. First, 20-30 SQL's are generated for each dataset, and the generation of these SQL's follows the following rules: analytical methods covering 50%; each table relating to a database; automatically generating SQL according to the table attribute; since different SQL representations can be used for the same result.
4) SQL statement and visualization scheme review
For a problem with multiple possible SQL translations, the reviewer should go through to see if the SQL tag was properly selected according to the protocol. The reviewer then checks whether all SQL statements in the current database cover 50% of the analysis methods, referring to all tables in the database.
5) Question text review
After reviewing the SQL tag, other annotators would be required to check if the question was clear and contain enough information to answer the query. Then, a student who has passed grade 6 in English will review and correct each question. They first check if the question is grammatically correct and natural. Next, they ensure that the questions reflect the meaning of their corresponding SQL tags, analysis methods, and visualizations. Finally, to increase the variety of questions, annotators are required to add paraphrased versions of certain questions.
6) Overall review
If multiple viewers are unsure of certain annotation problems, this annotator will make the final decision. The script is run to execute and parse all SQL tags to ensure that they are correct.
The second step is that: on the basis of the data set, the specific problem is mathematically processed.
Given a natural language question Q, a set of relational database objects S ═ C, T >, and a set of analysis method functions COMP, the goal is to generate a corresponding SQL query statement P and a corresponding visualization result VIS. The question consists of a sequence of words Q ═ Q1,q2,...,q|Q|(ii) a The database object set S is composed of columns C ═ C1,c2,...,c|C|And the table name T ═ T1,t2,...,t|T|Is formed by the following steps; the analysis method function set comprises a plurality of functions written by SQL to form COMP1,comp2,...,comp|COMP|. Each column name ciContaining words
Figure BDA0003019718140000101
Each table name tiContaining words
Figure BDA0003019718140000102
The SQL query language P is represented by an Abstract Syntax Tree (AST) T; visualization result VIS composed of<name,color,axis>Composition, where name uses the name of the visualization, color represents the color set used, axisThe axis position is represented by three numbers, the first number 0,1 represents whether the first dimension is mapped on the horizontal axis (and the second dimension is determined to be on the vertical axis), the second number 0,1 represents that the horizontal axis is up or down, the third number 0,1 represents that the vertical axis is left or right, and the default is 000 when no axis information is specified in the natural language.
Some columns in the schema are primary keys for uniquely indexing the corresponding tables, and some are foreign keys for referencing primary key columns in other tables. Further, each column is of type τ ∈ { number, test }. Thus, formally we represent the set of database objects as a directed graph G ═ V, E >. Its node V ═ C ≦ T is the column and table names of the set of objects, each with the word in its name (for the column, its type τ is placed in front of the label). Its edge E is defined by the pre-existing database relationship described in table 1.
Table 1 node type and edge construction rule of directed graph G
Figure BDA0003019718140000103
Figure BDA0003019718140000111
If the two nodes are in accordance with the description, an edge is constructed, corresponding labels are marked, and if any relation does not exist, the edge is not constructed.
Directed graph G is an encoding for a set of database objects. To be able to map a question to the contents of a set of database objects, we define a new problem-context directed graph G of a set of database objectsQ=<VQ,EQ>. Wherein VQ=V∪Q=C∪T∪Q,
Figure BDA0003019718140000112
Figure BDA0003019718140000113
The way in which the question is obtained in correspondence with the contents of the database object set is taught in the object set link in 4.2.3. Having defined the above, the problem is divided into two subtasks: (1) converting the text into analytical SQL; (2) the text is converted into a visualization scheme VIS. For the task of text conversion into analytic SQL, the basic structure is a coder-decoder architecture, and after an analytic method comp is selected (described in a model framework), the comp and a directed graph G are combinedQUsing an encoder fencIs coded as ci,ti,qiComp, wherein ciFor the coding of a certain list, tiFor coding a certain table name, qiCoding a certain vocabulary in the question; decoder fdecTaking the above as input, P (P | G) is calculatedQCOMP).
The third step: text conversion to analytical SQL, and text visualization.
(1) Analytical method identification
The goal of this section is to identify the vocabulary representing the analysis methods from the sentences. Given a question Q, calculate each vocabulary QiIS the probability p (IS _ COMP | q) of the analytical methodi). The probability that each word is the analysis method is output by using a bidirectional LSTM (BilSTM) structure, and the word comp with the highest probability value is used as the analysis method described by the problem.
(2) Relationship-aware input coding based on hierarchical self-attention mechanism
The invention improves on the relation perception input coding proposed in RAT-SQL, and provides a hierarchy self-attention mechanism, which is a coding model aiming at a semi-structured input sequence.
Consider a set of inputs
Figure BDA0003019718140000121
Wherein each element xiMay not be uniform. The idea is that each element can be represented by its relationship to other elements, i.e. the relationship information is encoded into the element. The calculation method is as follows:
Figure BDA0003019718140000122
Figure BDA0003019718140000123
Figure BDA0003019718140000124
Figure BDA0003019718140000125
yi=SelfAttn(xi,X);
Figure BDA0003019718140000126
softmax refers to a normalized exponential function. LayerNorm layer normalization function. ReLu: a linear rectification function. SelfAttn: a self-attention mechanism function. Concat: a function that connects multiple functions or arrays. LSMT: a fast classification function.
However, this self-attention mechanism only calculates a single-level relationship of some two elements, and in practical cases, an element may have a more direct relationship with a combination of some several elements. Therefore, the present invention proposes a self-attention mechanism with hierarchy, and its algorithm is as follows:
Figure BDA0003019718140000127
the method is to exclude a certain element in the sequence, calculate the self-coding results of other elements, and then include the excluded element into the self-attention coding, so that the hierarchical relationship can be captured.
In the invention, on the basis of identifying the vocabulary of the analysis method, the problem context database object set directed graph GQAnd (6) coding is carried out. Specifically, GloVe embed was used firstding method for column name c in directed graph GiAnd the table name tiCoding to obtain
Figure BDA0003019718140000131
And
Figure BDA0003019718140000132
then run bidirectional LSTM to get
Figure BDA0003019718140000133
And
Figure BDA0003019718140000134
for the encoding of the problem Q, the two-way LSTM in (1) is used to output the encoding of each word therein
Figure BDA0003019718140000135
Including the recognized vocabulary associated with the analysis method, denoted as compinit
Preliminary derived codes
Figure BDA0003019718140000136
And
Figure BDA0003019718140000137
independent of each other, lack of directed graph GQNode connection information in (2) in order to be able to represent the relationships among these elements in the encoding, a self-attention mechanism is used. The description of the analysis method in the natural language of the present invention is related to that of NL2CompSQL, and it is crucial to enable the analysis method to have correct input, which can be viewed as a whole. Therefore, it is desirable to capture the association of analysis method related vocabulary with the combination of input related vocabulary, and calculate it using a hierarchical self-attention mechanism. Because the relation between the relevant vocabulary of the analysis method and other vocabulary combinations is mainly expected to be captured, on the basis of the original algorithm I, only the relevant vocabulary of the analysis method is excluded from calculation:
hierarchical self-attention mechanism of algorithm two-sided analysis method
Figure BDA0003019718140000138
(3) Database relational set linking
Database relationship set links can help align tables, columns, value references in natural language questions with database relationship sets. Alignment is largely divided into two parts: the name link is linked to the value.
Name linking is the matching of a column or table name to a natural language vocabulary. The matching is divided into full matching and partial matching. Specifically, n-grams of length 1 to 5 in the natural language question are first calculated and then it is determined whether they completely match the column name or table name or whether the n-grams are a subsequence of the column name or table name, thereby obtaining 4 relationships, TEM (full match of table names), TPM (partial match of table names), CEM (full match of column names), CPM (partial match of column names).
Matching of a question with a set of database objects also results in matching of the question with values in the database, and although matching of a question with a column name does not require value matching, it is difficult to find the corresponding column when the column name in the question is not mentioned explicitly, and background knowledge is required to alleviate the problem. The values in the database are a good source of background knowledge, so the question can be matched to the values in the database, defining the relationship as CELLMATCH. The matching of values in the database requires database query, so SQL clauses such as SELECT, LIKE and the LIKE are used for constructing query statements.
Columns and tables appearing in SQL P will typically have corresponding references in natural language questions. To capture the intuition in the model, the relationship-aware attention is used as a pointer mechanism between each element in y and all columns/tables to compute the column and table alignment matrix
Figure BDA0003019718140000141
Figure BDA0003019718140000142
Figure BDA0003019718140000143
Figure BDA0003019718140000144
(4) Decoder
The decoder traverses the results in depth-first order based on the Abstract Syntax Tree (AST) proposed by Yin et al. Using the LSTM to output a behavior at each step, wherein one behavior is to expand the finally generated node into a grammar rule APPLYRULE; another behavior is to select a column or table, i.e., SELECTCOL or SELECTTAB, from a collection of objects.
Thus, the process of the decoder generating SQL can be expressed as P (P | Y) ═ Πtp(at|a<tY), wherein Y ═ fenc(GQ) For the final output of the encoder, a<tAll actions before the t-th step.
(5) Enhancing functions using SQL analysis
When a human being applies an analytic method, the human being usually has abstract knowledge similar to a function for the analytic method, and the abstract knowledge helps the human being construct a complex analytic statement. Therefore, it is reasonable to assume that the SQL analysis function can further help the model build analytics SQL. And (3) adding the code of the SQL function corresponding to the analysis method identified in the step (1) into the result of the encoder, so that the decoder can use the analysis method function to help generate the final SQL statement. Encoding the analysis method function f using a tree-based LSTM to obtain femdLstm (f). The encoder output is modified to:
Y=fenc(GQ,COMP)。
the fourth step: automated assessment and human assessment
The invention utilizes automatic evaluation to evaluate the accuracy of the generated analytical SQL statements and the visualization scheme. Following the evaluation method of NL2CompSQL data set Spider evaluation, the invention evaluates from three aspects of component matching, exact matching and running accuracy.
For human evaluation, the present invention provides for evaluation at both transverse and longitudinal angles. The horizontal direction is similar to automatic evaluation, namely the completion degree and the accuracy of data answers under different artificial intelligence algorithms are compared, and only the data answers are scored by human beings; longitudinal evaluation refers to the evaluation of the effect of the system on business intelligence analysis, so the design method of the invention is as follows: a set of data sets and analysis targets are predetermined and business intelligence analysis experts are invited to 20, which are equally divided into two groups, one group using common BI question and answer tools (e.g., MS Power BI) to make visualizations to gain insight and the other group using the present system. Recording the completion time, satisfaction, use experience and the like of the two groups of people.
The business-oriented intelligent visual multi-turn dialogue system can be realized through business intelligent analysis and visualization, namely, a user can carry out multi-turn dialogue with the system to inquire data problems or knowledge problems, and the system returns visual answers to the data problems; for knowledge questions, the system returns textual answers directly.
Accurately solving a differential-integral equation with frequency-dependent time delay, and accurately analyzing chaotic dynamics and chaotic communication with frequency-dependent time delay feedback.
In this embodiment, the answer may be automatically generated by an algorithm. The dialog system implements: summarizing and researching the business intelligent analysis method, namely summarizing the analysis method and the analysis habit used by a business intelligent analyst and designing a set of form expression of the analysis method; visual automatic generation, namely selecting a proper visual coding strategy, recommending query sentences and automatically generating a visual result; the realization of multi-turn conversation solves common designation and information omission which often occur in the conversation and supports further modification or addition of limitation on question sentences; the multi-task unified framework is used for the research of the dialogue question and answer. As the business intelligent system covers various data, users may have different question demands, mainly comprising data question answering and knowledge questions; the generation of the business intelligence visualization dialog data set. Due to the use of artificial intelligence algorithms, sufficient data needs to be collected to ensure the training effect. But the method is limited by manpower and material resources, and a labor-saving mode needs to be designed to generate a data set; the help of the subject to the BI analyst is evaluated. Appropriate evaluation indexes are designed, and evaluation is performed from two angles of automatic evaluation and human evaluation.
While the preferred embodiments and principles of this invention have been described in detail, it will be apparent to those skilled in the art that variations may be made in the embodiments based on the teachings of the invention and such variations are considered to be within the scope of the invention.

Claims (6)

1. A dialogue type question-answering realizing method facing intelligent data visualization is characterized by comprising the following steps:
the first step is as follows: constructing a data set through SQL function set construction, question creation and SQL, visualization scheme labeling, SQL statement review, question text review and overall review of a database collection and analysis method;
the second step is that: constructing a problem by mathematics of a specific problem on the basis of the data set;
the third step: establishing a model framework for converting the text into the analytical SQL and extracting the text visualization scheme;
the fourth step: and establishing an evaluation scheme of automatic evaluation and human evaluation.
2. The intelligent data visualization-oriented conversational question answering implementation method as claimed in claim 1, wherein the first step is as follows:
the method is expanded on the basis of a Spider data set, and comprises 200 databases, wherein each database contains 5.1 tables on average;
a descriptive analysis and reasoning statistics collection common analysis method is adopted, and a mechanism is constructed to enable a user to expand according to needs; after the analysis method is determined, a default visualization scheme is formulated, and finally, an SQL function set of the analysis method is formed;
after the SQL function set with the analysis method, the generation of text and corresponding SQL and visualization may begin: first, 20-30 SQL's are generated for each dataset, and the generation of these SQL's follows the following rules: 1) analytical methods covering 50%; 2) each table relating to a database; 3) automatically generating SQL according to the table attribute; 4) as the same result is expressed by different SQL, the SQL protocol is specified, and the protocol is required to be met when the SQL is generated.
3. The intelligent data visualization-oriented conversational question answering implementation method as claimed in claim 2, wherein the second step is as follows:
given a natural language question Q, a set of relational database objects S ═ is<C,T>And analyzing a method function set COMP with the aim of generating a corresponding SQL query statement P and a corresponding visualization result VIS; the question consists of a sequence of words Q ═ Q1,q2,…,qQ(ii) a The database object set S is composed of columns C ═ C1,c2,...,c|C|And the table name T ═ T1,t2,...,t|T|Is formed by the following steps; the analysis method function set comprises a plurality of functions written by SQL to form COMP1,comp2,...,comp|COMP|(ii) a Each column name ci containing a word
Figure FDA0003019718130000021
Each table name tiContaining words
Figure FDA0003019718130000022
S; SQL query language P consisting of an Abstract Syntax Tree (AST)TTo represent; visualization result VIS composed of<name,color,axis>The system comprises a name, a color, an axis position, a horizontal axis position and a vertical axis position, wherein the name is a visualization name, the color is a color set, the axis position is represented by three numbers, the first number is 0,1 represents whether a first dimension is mapped on the horizontal axis, the second number is 0,1 represents that the horizontal axis is up or down, the third number is 0,1 represents that the vertical axis is left or right, and the default is 000 when axis information is not specified in a natural language;
some columns in the schema are primary keys for uniquely indexing the corresponding tables, and some are foreign keys for referencing primary key columns in other tables; furthermore, the type of each column is tau ∈ { number, text }; thus, a set of database objects is formally represented as a directed graph G ═ V, E >, whose nodes V ═ C ═ u T are the column names and table names of the set of objects, each with the word in its name;
directed graph G is an encoding for a set of database objects; defining a new problem-context database object set directed graph GQ=<VQ,EQ>Wherein V isQ=V∪Q=C∪T∪Q,
Figure FDA0003019718130000023
Figure FDA0003019718130000024
The method is characterized in that the acquisition mode of the edges obtained after the problems correspond to the contents of the database object set is described in the object set link;
after the above definition, the problem is divided into two subtasks: (1) converting the text into analytical SQL; (2) text is converted into a visualization scheme VIS; for the task of converting text into analytic SQL, the basic structure is a coder-decoder structure, after an analytic method comp is selected, the comp and a directed graph G are processedQUsing an encoder fencIs coded as ci,ti,qiComp, wherein ciFor the coding of a certain list, tiFor coding a certain table name, qiCoding a certain vocabulary in the question; decoder fdecTaking the above as input, calculating (P | G)QCOMP).
4. The intelligent data visualization-oriented conversational question-answering implementation method of claim 3, wherein in the third step, the text conversion into analytical SQL specifically comprises:
the idea of the self-attention mechanism is that each element can be expressed by its relationship with other elements, i.e. the relationship information is encoded into the element by:
Figure FDA0003019718130000031
Figure FDA0003019718130000032
Figure FDA0003019718130000033
Figure FDA0003019718130000034
yi=SelfAttn(xi,X);
Figure FDA0003019718130000035
softmax refers to normalized exponential function;
LayerNorm is a layer normalization function;
ReLu: a linear rectification function;
SelfAttn: a self-attention mechanism function;
concat: a function connecting a plurality of functions or arrays;
LSMT: a fast classification function;
however, this self-attention mechanism only calculates a single-level relationship of some two elements, and in practical cases, an element may have a more direct relationship with a combination of some elements; the algorithm one is as follows:
Figure FDA0003019718130000036
first, the GloVe embedding method is used to align the column names c in the directed graph GiAnd the table name tiCoding to obtain
Figure FDA0003019718130000037
And
Figure FDA0003019718130000038
then run bidirectional LSTM to get
Figure FDA0003019718130000039
And
Figure FDA00030197181300000310
for the encoding of problem Q, a bi-directional LSTM is used, outputting the encoding of each word therein
Figure FDA00030197181300000311
Including the recognized vocabulary associated with the analysis method, denoted as compinit
And the algorithm II is as the following table, and only relevant words of the analysis method are excluded for calculation:
Figure FDA0003019718130000041
the database relation set link can help tables, columns and value references in the natural language problem align with the database relation set; alignment is largely divided into two parts: a name link and a value link;
name linking is matching column or table names with natural language vocabulary; matching is divided into full matching and partial matching; specifically, n-grams with a length of 1 to 5 in the natural language question are first calculated, and then it is judged whether they completely match the list name or table name or whether the n-grams are subsequences of the list name or table name, thereby obtaining 4 relationships, TEM, TPM, CEM, CPM;
the values in the database are a good source of background knowledge, so the question can be matched to the values in the database, defining the relationship as CELLMATCH; the value matching in the database needs to carry out database query, so SQL clauses such as SELECT, LIKE and the LIKE are used for constructing query sentences;
to capture the intuition in the model, the relationship-aware attention is used as a pointer mechanism between each element in y and all columns/tables to compute the column and table alignment matrix
Figure FDA0003019718130000042
Figure FDA0003019718130000043
Figure FDA0003019718130000044
Figure FDA0003019718130000045
The decoder traverses in the order of depth priority based on the abstract syntax tree to obtain a result; using the LSTM to output a behavior at each step, wherein one behavior is to expand the finally generated node into a grammar rule APPLYRULE; another behavior is to select a column or table, i.e., SELECTCOL or SELECTTAB, from a set of objects; the process by which the decoder generates SQL may be expressed as P (py) ═ Πtp(at|a<tY), wherein Y ═ fenc(GQ) For the final output of the encoder, a<tAll behaviors before the t step;
encoding the analysis method function f using a tree-based LSTM to obtain femdThe encoder output is modified to:
Y=fenc(GQ,COMP)。
5. an intelligent data visualization oriented conversational question answering implementation method as claimed in claim 3 or 4, wherein in the third step, the text visualization scheme is extracted: the visualization scheme is simplified, firstly, visualization of two-dimensional data is only processed, and secondly, the visualization control dimension only comprises a visualization type name, a color and an axis position axis;
the problem is simplified to be P ((name, color, axis) | Q, P), the reason that the result SQL is taken as a condition is that the problem possibly does not contain visual description, so a default visualization scheme of an analysis method is used at this time; encoding Q the problem by using a bidirectional LSTM, and then outputting discrete probability distribution p (name), p (color) and p (axis) of each dimension of the visualization scheme in the last step; when the maximum value of the probability distribution is smaller than a certain threshold value, a default scheme is selected; and when the threshold value is exceeded, taking the name corresponding to the maximum value.
6. The intelligent data visualization-oriented conversational question answering implementation method as claimed in claim 5, wherein the fourth step is as follows:
evaluating the accuracy of the generated analytical SQL statements and the visualization scheme by automatic evaluation;
for human assessment, two angles, transverse and longitudinal, are provided to be assessed; and (3) transverse evaluation: the completion degree and the accuracy of data answers under different artificial intelligence algorithms are compared, and only the data answers are manually scored; longitudinal evaluation: presetting a group of data sets and analysis targets, inviting business intelligent analysts to count the names, dividing the names into two groups, wherein one group uses a common business intelligent question-answering tool, and the other group uses the method; two sets of data are recorded.
CN202110399195.1A 2021-04-14 2021-04-14 Intelligent data visualization oriented conversational question-answering implementation method Active CN113111158B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110399195.1A CN113111158B (en) 2021-04-14 2021-04-14 Intelligent data visualization oriented conversational question-answering implementation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110399195.1A CN113111158B (en) 2021-04-14 2021-04-14 Intelligent data visualization oriented conversational question-answering implementation method

Publications (2)

Publication Number Publication Date
CN113111158A true CN113111158A (en) 2021-07-13
CN113111158B CN113111158B (en) 2022-05-10

Family

ID=76716746

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110399195.1A Active CN113111158B (en) 2021-04-14 2021-04-14 Intelligent data visualization oriented conversational question-answering implementation method

Country Status (1)

Country Link
CN (1) CN113111158B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114782122A (en) * 2022-03-15 2022-07-22 福建亿力电力科技有限责任公司 Automatic analysis method and system for bidder information in bidding material
CN116821168A (en) * 2023-08-24 2023-09-29 吉奥时空信息技术股份有限公司 Improved NL2SQL method based on large language model

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170061809A1 (en) * 2015-01-30 2017-03-02 Xerox Corporation Method and system for importing hard copy assessments into an automatic educational system assessment
CN110209787A (en) * 2019-05-29 2019-09-06 袁琦 A kind of intelligent answer method and system based on pet knowledge mapping
CN111400469A (en) * 2020-03-12 2020-07-10 法雨科技(北京)有限责任公司 Intelligent generation system and method for voice question answering
CN111542813A (en) * 2017-10-09 2020-08-14 塔谱软件有限责任公司 Using object models of heterogeneous data to facilitate building data visualizations

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170061809A1 (en) * 2015-01-30 2017-03-02 Xerox Corporation Method and system for importing hard copy assessments into an automatic educational system assessment
CN111542813A (en) * 2017-10-09 2020-08-14 塔谱软件有限责任公司 Using object models of heterogeneous data to facilitate building data visualizations
CN110209787A (en) * 2019-05-29 2019-09-06 袁琦 A kind of intelligent answer method and system based on pet knowledge mapping
CN111400469A (en) * 2020-03-12 2020-07-10 法雨科技(北京)有限责任公司 Intelligent generation system and method for voice question answering

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
曾帅等: "面向知识自动化的自动问答研究进展", 《自动化学报》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114782122A (en) * 2022-03-15 2022-07-22 福建亿力电力科技有限责任公司 Automatic analysis method and system for bidder information in bidding material
CN116821168A (en) * 2023-08-24 2023-09-29 吉奥时空信息技术股份有限公司 Improved NL2SQL method based on large language model
CN116821168B (en) * 2023-08-24 2024-01-23 吉奥时空信息技术股份有限公司 Improved NL2SQL method based on large language model

Also Published As

Publication number Publication date
CN113111158B (en) 2022-05-10

Similar Documents

Publication Publication Date Title
CN108920544A (en) A kind of personalized position recommended method of knowledge based map
CN113111158B (en) Intelligent data visualization oriented conversational question-answering implementation method
CN111767732B (en) Document content understanding method and system based on graph attention model
CN111930906A (en) Knowledge graph question-answering method and device based on semantic block
Sharma et al. A survey of methods, datasets and evaluation metrics for visual question answering
CN110457450B (en) Answer generation method based on neural network model and related equipment
CN116596347B (en) Multi-disciplinary interaction teaching system and teaching method based on cloud platform
CN110929714A (en) Information extraction method of intensive text pictures based on deep learning
CN115759092A (en) Network threat information named entity identification method based on ALBERT
CN115827844A (en) Knowledge graph question-answering method and system based on spark ql statement generation
JP2022151838A (en) Extraction of open information from low resource language
CN113378024B (en) Deep learning-oriented public inspection method field-based related event identification method
CN112347252B (en) Interpretability analysis method based on CNN text classification model
CN114168615A (en) Method and system for querying SCD (substation configuration description) file of intelligent substation by natural language
CN114020900A (en) Chart English abstract generation method based on fusion space position attention mechanism
Maji et al. An interpretable deep learning system for automatically scoring request for proposals
CN115860002B (en) Combat task generation method and system based on event extraction
CN114331226B (en) Intelligent enterprise demand diagnosis method and system and storage medium
Tran et al. ViCLEVR: A Visual Reasoning Dataset and Hybrid Multimodal Fusion Model for Visual Question Answering in Vietnamese
CN112199114B (en) Software defect report distribution method based on self-attention mechanism
CN112732942A (en) User-oriented multi-turn question-answer legal document entity relationship extraction method
CN117541202A (en) Employment recommendation system based on multi-mode knowledge graph and pre-training large model fusion
CN114547256A (en) Text semantic matching method and device for intelligent question answering of fire safety knowledge
Li et al. Fault Diagnosis and System Maintenance Based on Large Language Models and Knowledge Graphs
CN117874244A (en) Construction method and system of power safety production risk prevention and control knowledge graph

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20220824

Address after: Room 405, 6-8 Jiaogong Road, Xihu District, Hangzhou City, Zhejiang Province, 310013

Patentee after: Hangzhou Taoyi Data Technology Co.,Ltd.

Address before: 310018 no.1158, No.2 street, Baiyang street, Hangzhou Economic and Technological Development Zone, Zhejiang Province

Patentee before: HANGZHOU DIANZI University