CN113868493A

CN113868493A - Visual chart recommendation method

Info

Publication number: CN113868493A
Application number: CN202111065907.2A
Authority: CN
Inventors: 魏世超
Original assignee: Inspur Communication Information System Co Ltd
Current assignee: Inspur Communication Information System Co Ltd
Priority date: 2021-09-13
Filing date: 2021-09-13
Publication date: 2021-12-31

Abstract

The invention relates to the field of chart visualization, and particularly provides a chart visualization recommendation method. Compared with the prior art, the method has the advantages that the most meaningful visual results are learned from numerous visual practical data sets, marked and indexed, the meaningful visual types are found by searching the index, the problems that visual type operations are responsible and numerous, and visual result redundancy caused by the large enumeration search space is solved, meanwhile, the method can be integrated on data analysis software as a chart visual recommendation engine, and the usability of the data analysis software is improved.

Description

Visual chart recommendation method

Technical Field

The invention relates to the field of chart visualization, and particularly provides a chart visualization recommendation method.

Background

Data visualization is used by an increasing number of people as an important means of data analysis. The goal of visualization recommendation is to automatically generate results for analysts to explore and select by some technical means to reduce visualization obstacles. However, data visualization presents certain difficulties for most people who are not specialized in visualization techniques.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a chart visualization recommendation method with strong practicability.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a chart visualization recommendation method includes the steps of firstly, extracting a plurality of data features and corresponding meaningful visualization chart types from a real visualization data set, then respectively training a classification model by using classifiers, learning meaningful visualization from the classification model, testing accuracy by using a test set, and finally, fusing results of the plurality of classifiers to select a meaningful chart suitable for the data set.

Furthermore, the classifier adopts a decision tree and a support vector machine and naive Bayes, the three classifiers construct a classification model for data in visualization practice, the visualization result is divided into meaningful and meaningless, when new visualization exploration is carried out, the meaningless result is discarded, the meaningful visualization result is reserved, and the result is recommended to a user.

Further, the ID3 algorithm in the decision tree uses information gain as an attribute selection metric, entropy measures the uncertainty of things, the more uncertain things its entropy is, the more desirable information needed to assemble the tuples in D is given by the formula:

wherein p is_iIs that any tuple in D belongs to C_iIs the average amount of information needed to identify the class label of the tuple in D, the tuple in D is divided by some attribute A, where A has v different values { a }₁,a₂,…,a_vApplied to v outputs on A, D can be divided into v partitions or subsets { D with attribute A₁,D₂,…,D_vIn which D is_jContaining tuples in D, theirA has a value of a_j；

Since partitions may contain tuples from different classes, more information is also needed for more accurate classification:

wherein

Serving as a weight for the jth partition. Info_A(D) The method is based on expected information required by classifying D tuples according to A division, and the smaller the required expected information is, the higher the partition purity is;

the information gain is defined as the difference between the original information requirement and the new information requirement, i.e. the information gain is defined as the difference between the original information requirement and the new information requirement

Gain(A)＝Info(D)-Info_A(D)

The attribute a with the highest gain (a) is selected as the classification attribute for node N.

Further, the SVM classification is a machine learning method established on the basis of a statistical learning theory, is suitable for linear separable samples and non-separable samples, and is provided with samples (x)_i，y_i)(x_i∈R^d；y_i∈{-1，+1}；i＝1，2，...，n)，x_iIs a feature vector, y_iAnd if the sample is a class label, converting the problem into a solution convex quadratic optimization problem if the sample is linearly separable, and obtaining a formula:

wherein ω is a weight vector; c is a penalty factor; xi_iIs a relaxation factor; b is an offset; obtaining dual description of the optimization problem by Lagrange operator, satisfying y_i[(ω·x_i)+b]Under the condition of 1, a classification decision function can be obtained as follows:

if the samples are linearly inseparable, the samples in the input space can be mapped into a high-dimensional linearly separable feature space through nonlinear mapping, and the optimal classification decision function in the feature vector is obtained by utilizing the kernel function, and the method comprises the following steps:

in the formula

Is a kernel function.

Further, the bayesian classifier is a probability classifier, when classifying data according to a plurality of features, the plurality of features are assumed to be independent from each other, then each classification probability is obtained by using a conditional probability multiplication method, then the probability with the maximum probability is selected as the judgment of a machine, and a set of training data sets { (x) is given₁,y₁),(x₂,y₂),…,(x_m,y_m) Where m is the number of samples, each data set contains n features, i.e. X_i＝(x_i1,x_i2,…,x_im) And class label set is { y₁,y₂,…,y_kJudging the category of a new sample, namely solving the maximum posterior probability argmaxp (y | x):

since the denominator of formula is for each p (y ═ y)_i| x) are the same, and the final discrimination formula is:

further, when the data set is collected, the data with the best visualization experiment effect usually contains 10 attribute columns by using a BI analysis tool, and the graph with the better visualization effect is marked and displayed.

Further, when extracting the data features, for the training data set, the description of the column attribute features is divided into:

length, type, category (c), qualitative (q), and temporal (t) classified;

for the classification data, count0, radio, entrypy and gini of different values are counted;

for numerical data, a plurality of statistical features including maximum, minimum, mean, median, mode, variance, standard, median absolute deviation and distribution are calculated;

many of the paired-column features depend on a single column type determined by single-column feature extraction.

Further, in determining meaningful visual chart types, respectively marking a bar chart, a pie chart, a line chart and a scatter chart with 0-3, respectively enumerating four visual chart types of a single column and a double column and corresponding meaningful visual results, wherein for a data set with m column attributes, each column has 4m chart display types, and 2m (m-1) possible chart display types exist between each two columns.

Furthermore, when the accuracy rate is tested by using the test set, labels are attached to each data set according to the visualization practice in the BI analysis software, all possible visualization results are enumerated simultaneously, and the labels are respectively attached to all the visualization results by combining the BI software, so that a meaningful visualization result is finally obtained;

and finally, the output of the three classifiers is the judgment of all possible visualization results, 1 represents a 'meaningful' visualization result, 0 represents a 'meaningless' visualization result, indexes of the visualization results in the data set are marked, and the type of a meaningful visual representation chart can be positioned through the indexes.

Further, in order to further improve the accuracy of finding a deliberate chart suitable for a data set, an ensemble learning method is adopted, a model combining three simple classifiers is trained, the class with the largest number of votes is marked as an output result through a relative majority voting method, and if a plurality of class marks all obtain the highest number of votes, one class label is randomly selected as the output.

Compared with the prior art, the chart visualization recommendation method has the following outstanding beneficial effects:

the method and the device learn the most significant visual results from numerous visual practical data sets, mark the most significant visual results and establish indexes, find the significant visual types by searching the indexes, and avoid the problems that visual type operations are responsible and numerous and the visual result redundancy caused by the large enumeration search space is solved.

The adopted classifier effectively learns the meaningful visual result and obtains good accuracy performance on the test set.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a schematic diagram of an ensemble learning model in a chart visualization recommendation method.

Detailed Description

The present invention will be described in further detail with reference to specific embodiments in order to better understand the technical solutions of the present invention. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

A preferred embodiment is given below:

according to the chart visualization recommendation method, firstly, a plurality of data features and corresponding meaningful visualization chart types are extracted from a real visualization data set, then classifiers are used for training classification models respectively, meaningful visualization is learned from the classification models, a test set is used for accuracy test, and finally, a plurality of classifier results are fused to select the meaningful charts suitable for the data set.

The method comprises the following specific steps:

s1, when the data set is collected:

with the aid of the BI analysis tools used in work, a large number of data visualization results are accumulated in long-term data visualization practices, in which the number of columns and rows of data sets is also very different, although some data sets contain hundreds of columns of attributes, most data sets are smaller than 25 columns, the data set with the best visualization practice effect usually contains about 10 attribute columns, and the graph with the better visualization effect is marked and displayed. These data sets typically contain time series attributes, classification attributes and numerical attributes, wherein the particulars of a portion of the data set are shown in table (1):

watch (1)

S2, when data features are extracted:

for the training data set, the characterization of the column attributes is listed in table (2), and these features can be classified into 6 classes:

length is the number of data lines.

type is a data type divided into categorical (C), temporal (Q), and quantitative (temporal (T)).

For the classification data, the number of different values thereof (count0), the ratio (radio), the entropy (entrypy), and the kini coefficient (gini) were counted.

For numerical data, a number of statistical features are calculated, including maximum, minimum, mean, median, mode, variance, standard, median absolute deviation, and distribution.

Many of the paired-column features depend on a single column type determined by single-column feature extraction. For example, Pearson's correlation coefficient requires two columns of digits, χ²Two classification columns are required.

S3, when determining the type of the meaningful visual chart:

in daily visualization practice, more than 85% of the visualization results can be represented by bar graphs (bar), pie graphs (pie), line graphs (line), or scatter graphs (scatter), and only the recommendation of four visualization graphs is considered herein, and the bar graphs, pie graphs, line graphs, and scatter graphs are labeled with 0-3, respectively, enumerating four visualization graph types, single-column and two-column, respectively, and the corresponding "meaningful" visualization results, for a dataset with m column attributes, 4m graph presentation types per column, and 2m (m-1) possible graph presentation types between each two columns.

S4, when the accuracy test is carried out by using the test set:

the method comprises the steps of selecting data sets in the fields of communication, automobiles, chemical industry, transportation industry, sales and the like, labeling each data set according to visualization practices in BI analysis software, enumerating all possible visualization results, and labeling all 21453 visualization results respectively by combining the BI software. Finally, 680 meaningful visualization results are obtained. Then 21453 collected data are used for model training by adopting a Decision Tree (DT), a support vector machine (svm) and a Bayes classifier, and then 6 test sets (shown in the table 1) are input to obtain the test accuracy shown in the table (3). The DT method has the highest accuracy on 6 test data sets, the average accuracy is 0.8609, the Bayes accuracy is the lowest, and the average accuracy is 0.7223. The output of the three classifiers is the judgment of all possible visualization results, 1 represents a 'meaningful' visualization result, 0 represents a 'meaningless' visualization result, and marks the index of the 'meaningless' visualization result in the data set, and the type of the meaningful visualization representation can be positioned through the index.

Watch (3)

The classifier adopts a decision tree and a support vector machine and naive Bayes to construct a classification model for data in visualization practice, the visualization result is divided into a meaningful result and a meaningless result, the meaningless result is discarded when new visualization exploration is carried out, the meaningful visualization result is reserved, and the result is recommended to a user.

The ID3 algorithm in the decision tree uses information gain as an attribute selection metric, and the method is based on the concept of entropy in information theory, wherein the entropy measures the uncertainty of things, and the more uncertain things have larger entropy. The desired information required to assemble the tuples in D is given by the formula:

wherein p is_iIs that any tuple in D belongs to C_iIs the average amount of information needed to identify the class label of the tuple in D, the tuple in D is divided by some attribute A, where A has v different values { a }₁,a₂,…,a_vApplied to v outputs on A, D can be divided into v partitions or subsets { D with attribute A₁,D₂,…,D_vIn which D is_jContaining tuples in D, their A value being a_j；

wherein

Gain(A)＝Info(D)-Info_A(D)

SVM classification is a machine learning method established on the basis of a statistical learning theory, is suitable for linear divisible samples and non-divisible samples, and is provided with samples (x)_i,y_i)(x_i∈R^d；y_i∈{-1,+1}；i＝1,2,...,n)， x_iIs a feature vector, y_iAnd if the sample is a class label, converting the problem into a solution convex quadratic optimization problem if the sample is linearly separable, and obtaining a formula:

in the formula

Is a kernel function.

The Bayes classifier is a probability classifier, when classifying data according to a plurality of features, the plurality of features can be assumed to be independent, then each classification probability is obtained by utilizing a conditional probability multiplication method, then the probability is selected as the judgment of a machine, and a group of training data sets { (x) is given₁,y₁),(x₂,y₂),…,(x_m,y_m) Where m is the number of samples, each data set contains n features, i.e. X_i＝(x_i1,x_i2,…,x_im) And class label set is { y₁,y₂,…,y_kJudging the category of a new sample, namely solving the maximum posterior probability argmaxp (y | x):

s5, improving accuracy by adopting the idea of ensemble learning:

as shown in fig. 1, in order to further improve the accuracy, an ensemble learning method is adopted to train a model combining three simple classifiers, the class with the largest number of votes obtained is labeled as an output result by a relative majority voting method (pluralitic voting), and if a plurality of class labels all obtain the highest number of votes, one class label is randomly selected as the output.

The models and the accuracy are shown in the table (4), and the classification accuracy of the integrated learning model is slightly improved relative to the DT model with the highest accuracy as can be seen from the table.

Watch (4)

The above embodiments are only specific ones of the present invention, and the scope of the present invention includes but is not limited to the above embodiments, and any suitable changes or substitutions according to the claims of a chart visualization recommendation method of the present invention and by any person of ordinary skill in the art should fall within the scope of the present invention.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A chart visualization recommendation method is characterized in that firstly, a plurality of data features and corresponding meaningful visualization chart types are extracted from a real visualization data set, then classifiers are used for training classification models respectively, meaningful visualization is learned from the classification models, a test set is used for carrying out accuracy test, and finally, the results of the classifiers are fused to select the meaningful charts suitable for the data set.

2. A chart visualization recommendation method according to claim 1, wherein said classifier employs decision tree, support vector machine and naive bayes, said three classifiers construct a classification model for data in visualization practice, classify visualization results into meaningful and meaningless, discard meaningless results when performing new visualization exploration, retain meaningful visualization results and recommend results to user.

3. A chart visualization recommendation method according to claim 2, characterized in that the ID3 algorithm in the decision tree uses information gain as the attribute selection metric, entropy measures the uncertainty of things, the more uncertain things its entropy is, the desired information needed to assign the tuple in D is given by the formula:

wherein

Gain(A)＝Info(D)-Info_A(D)

4. A visual chart recommendation method according to claim 3, wherein said SVM classification is a machine learning method based on statistical learning theory, and is applicable to linear separable and inseparable samples, and is provided with samples (x)_i,y_i)(x_i∈R^d；y_i∈{-1,+1}；i＝1,2,...,n)，x_iIs a feature vector, y_iAnd if the sample is a class label, converting the problem into a solution convex quadratic optimization problem if the sample is linearly separable, and obtaining a formula:

in the formula

Is a kernel function.

5. According to claim 4The chart visualization recommendation method is characterized in that the Bayesian classifier is a probability classifier, when data are classified according to a plurality of features, the plurality of features can be assumed to be independent, then each classification probability is obtained by utilizing a conditional probability multiplication rule, then the probability is the maximum, the probability is selected as the judgment of a machine, and a group of training data sets { (x) is given₁,y₁),(x₂,y₂),…,(x_m,y_m) Where m is the number of samples, each data set contains n features, i.e. X_i＝(x_i1,x_i2,…,x_im) And class label set is { y₁,y₂,…,y_kJudging the category of a new sample, namely solving the maximum posterior probability argmaxp (y | x):

6. a chart visualization recommendation method according to claim 5, wherein when collecting data sets, using BI analysis tools, the data with the best visualization experiment effect usually contains 10 attribute columns, and the chart with the better visualization effect is marked and displayed.

7. The chart visualization recommendation method according to claim 6, wherein when extracting the data features, for the training data set, the description of the column attribute features is divided into:

length, type, category (c), qualitative (q), and temporal (t) classified;

8. A chart visualization recommendation method according to claim 7, wherein in determining meaningful visualization chart types, 0-3 labels are respectively used for bar chart, pie chart, line chart and scatter chart to respectively enumerate four visualization chart types of single column and double column and corresponding meaningful visualization results, and for a data set with m column attributes, there are 4m chart display types in each column and 2m (m-1) possible chart display types between each two columns.

9. The chart visualization recommendation method according to claim 8, wherein when the accuracy test is performed by using the test set, labels are attached to each data set according to visualization practices in the BI analysis software, all possible visualization results are enumerated simultaneously, and all visualization results are respectively attached with labels in combination with the BI analysis software, so that meaningful visualization results are finally obtained;

10. A chart visualization recommendation method according to claim 9, wherein in order to further improve the accuracy of finding a meaningful chart suitable for a data set, an ensemble learning method is adopted, a model combining three simple classifiers is trained, the class with the largest number of votes obtained is labeled as an output result by a relative majority voting method, and if a plurality of class labels all obtain the highest number of votes, one class label is randomly selected as an output.