CN110175191B

CN110175191B - Modeling method for data filtering rule in data analysis

Info

Publication number: CN110175191B
Application number: CN201910401717.XA
Authority: CN
Inventors: 周鹏程; 荆一楠; 何震瀛; 王晓阳
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2019-05-14
Filing date: 2019-05-14
Publication date: 2023-06-27
Anticipated expiration: 2039-05-14
Also published as: CN110175191A

Abstract

The invention belongs to the technical field of data analysis, and particularly relates to a data filtering rule modeling method in data analysis. The data filtering rule modeling method mainly comprises three parts: (1) data column analysis filtering (2) data range analysis filtering (3) automatic visualization of the result set. According to the invention, by reasonably setting related rules, how to apply the data filtering rules in data analysis to establish an analysis filtering model is solved, and the model is utilized to analyze and filter data and intuitively display the data. The invention can facilitate the user to quickly screen the data and find out the interested data subset, and analyze and mine the connection between the data items.

Description

Modeling method for data filtering rule in data analysis

Technical Field

The invention belongs to the technical field of data analysis, and particularly relates to a data filtering rule modeling method in data analysis.

Background

In the ubiquitous age of data, users' decisions are increasingly driven by data. Often, differences in the results of data analysis can significantly impact the decision making process. Selecting improper data, whether intentional or unintentional, may result in erroneous, misleading, or "fragile" decisions. Especially for users who have data analysis experience with millidata analysis, the results of these poor data analysis may lead to serious economic losses. So that the user is guided to perform good data selection energy band to better quality data analysis exploration experience.

In order to enable users without data analysis experience to eliminate error-prone data exploration processes and complicated analysis filtering condition setting as much as possible, good data analysis filtering effects are obtained in a straightforward manner. It is undoubtedly necessary to use a standardized process to determine how to perform the filtering analysis selection of the data, and how to automatically perform the modeling of the data filtering rules according to the characteristics of the data.

Disclosure of Invention

The invention aims to provide a data filtering rule modeling method for an interactive data exploration scene, so that data on a data set can be quickly analyzed and mined, and a user can conveniently explore and analyze the data.

For recommendation rule modeling on a dataset, we expect the characteristics as follows:

1. interpretability: how to properly generate recommendations within a visualization system;

2. feasibility: generating recommendations should have sufficient analytical significance to be able to mine potential associations between data;

3. quality: because of the characteristics explored by users, the construction of the model is efficient and robust.

The data filtering rule modeling method provided by the invention comprises the following specific steps:

(1) Given a data set D composed of a large amount of data, the importance of a data column is calculated by adopting a random forest feature selection method according to whether key data are specified by a user or not. The specific flow is as follows:

(1.1) importance score (variable importance measures), expressed in VIM, gini index in GI, assuming m data columns X ₁ ，X ₂ ，X ₃ ，...，X _m Now each column X is calculated _j Gini index score VIM of (a) _j ^(Gini) That is, the j-th column represents the average amount of change in node splitting uncertainty in all decision trees of the Random Forest (RF); wherein Gini index:

wherein K represents that m nodes have K categories and p in RF all decision trees _mk Represents the proportion of class k in node m, p _mk′ A complement representing the proportion of class k in node m; intuitively, two samples are randomly extracted from the node m, and the class labels of the samples are inconsistent.

（1.2）Data column X _j The significance of node m, i.e. the Gini index change before and after branching of node m, is

；

And

respectively, the Gini index of two new nodes after branching.

(1.3) data column X _j The nodes that appear in decision tree i are in set M, then X _j The importance in the ith tree is:

。

(1.4) n trees in random forest, data column X _j The importance of (2) is:

。

(1.5) according to the ranking of the calculated importance, returning the two most important columns of data of the analysis and filtration result to the user, wherein the ranking of the importance of A is A, B and is higher than that of B.

(2) Data range analysis filtering. The invention takes A, B two columns as an example to describe how to analyze and filter the data range, and the specific flow is as follows:

(2.1) the invention is first divided into three categories according to A, B two column data types: a numerical value type N, a discrete value type X and a time sequence type T; for the numerical N, discretization processing is firstly carried out, namely, the data are subjected to box division processing to obtain each box record N ', and the count record of each box is calculated to be CNT (N'); for discrete value type X, calculating a count of each discrete value as CNT (X);

because the time sequence type data often has the characteristic of quartering, the invention can divide the time segment box automatically according to the time sequence data range of the data column T, and the data column T is divided into time sequence boxes to be marked as T'; such as: the time sequence box T 'is divided by taking the year as a unit in the data range of T from 2017 to 2019, and the time sequence box T' is divided by taking the month as a unit in the data range of T only in 2019; the data range of the same column T is only 1 month of 2019, and the time sequence box T' is divided in units of days.

(2.2) forming two data analysis filtering combination models according to three different data types, and performing data filtering analysis on the data set D (wherein all "/" means "or" and are not represented as division); the method comprises the following steps:

(2.2.1) A is time-series data, and B is discrete value type or numerical value type; a selecting proper near-segment time as a first filtering condition t according to the unit of the time sequence box t' obtained in the step (2.1) _recent (e.g., last three years, last six months, last seven days, insufficient to produce this filtering); the data set after the condition screening of the A column is D ^* The data column B is filtered to obtain a discrete data column B ^* X of (2) ₁ ^* ，x ₂ ^* ，...，x _k ^* Or a numerical data column B ^* Re-binning to obtain (n) ₁ ^* ）′，（n ₂ ^* ）′，...，（n _k ^* ) ' wherein the number of boxes is k, x ^* /（n ^* ) The three values CNT (x ^* ) _top3 /CNT(（n ^* ）′) _top3 Three discrete data x _max ^* Or box (n) _max ^* ) The' numerical range serves as a second filtering condition; with two filtering conditions t _recent And x _max ^* /（n _max ^* ) Intersection t of _recent ∩x _max ^* /（n _max ^* ) ' as an analysis filtering condition of the analysis filtering combination model, performing data filtering analysis on the data set D;

(2.2.2) A is discrete value type or numerical value type, and B is time sequence type data; a calculates CNT (x) for each discrete value quantity or binCNT (n'), selecting the five constants x with the highest count _top5 Or box (n) _top5 ) The numerical range corresponding to' (discrete value or insufficient bin number would not produce this filtering) is used as the first filtering condition; the data set after the condition screening of the A column is D ^* The method comprises the steps of carrying out a first treatment on the surface of the Selecting the constant x with the most counting in A _max Or box (n) _max ) ' corresponding data column B ^* Time sequence range t of (2) _max As a second filtering condition; with two filtering conditions x _top5 /(n _top5 ) ' and t _max Is the intersection x of (2) _top5 /(n _top5 )′∩t _max As analysis filter conditions of the analysis filter combination model, data filter analysis was performed on the data set D.

(3) In order to present the analysis-filtered data to the user, the present invention automatically visualizes the resulting dataset obtained by the two-step analysis filtering of steps (1), (2). The specific flow is as follows:

(3.1) visualizing the result data set to obtain a cardinal value d (X) of a column X, a maximum value max (X) of the column X, a minimum value min (X), a record number |X| of the column X, a data type (X) of the column X, a count CNT (X ') of each bin data X' and a corresponding X 'thereof (each discrete value of the discrete value column X can be regarded as a bin), and a correlation coefficient correlation (X, CNT (X') of each bin data X 'and a corresponding count CNT (X').

(3.2) defining a set of clipping rules according to the column type (X) obtained in (3.1); when the data type of column x is time-sequential: the visual chart may be a bar chart or a line chart; when the data type of the column x is discrete value type or numerical value type: the visualization chart may be a histogram, pie chart, or scatter chart.

(3.3) the invention provides a data analysis method-relative information entropy to determine how the result data set obtained from the analysis and filtration in the steps (1) and (2) is visualized automatically; the core idea of the method is to calculate the ratio of the information entropy of each data column X visualization to the normalized chart information entropy, and record as C (X) ₁ ，C（X） ₂ ，...，C（X） _k The method comprises the steps of carrying out a first treatment on the surface of the Comparing each relative informationEntropy, maximum C (X) _max The corresponding chart type is the visualization type of the data column X. The specific method comprises the following steps:

(3.3.1) the bar graph is one of the most commonly used charts by analysts, and the height difference of the bar is utilized to improve the recognition degree of the user on the data difference; the bar graph is suitable for various scenes, and can better show the details of the data when the number of x' elements (namely the number of boxes) is more; calculating the relative information entropy of the histogram using the cardinal value d (X) of column X, |d (X) | representing the value of cardinal value d (X) of column X;

(3.3.2) the pie chart may show multiple sets of data representing the overall ratio of each set of data; in the pie chart we need a differentiated CNT (x') to highlight the fraction of each fraction, for which shannon entropy is introduced:

as part of the decision criteria; where y represents each value of CNT (x '), and P (y) represents the number ratio of y, i.e., the occurrence probability of y at CNT (x');

the advantage of the (3.3.3) line graph can reflect the situation that the same thing changes in development in different time; when the data CNT (X ') and X' conform to a certain distribution (such as linear distribution, exponential distribution, logarithmic distribution, low power distribution), the expression of the distribution is denoted as distribution (X ', CNT (X')), and the information entropy C (X) is 1; otherwise, the information entropy C (X) is 0;

C（X）= distribution(x′,CNT（x′）)；

(3.3.4) the scatter plot represents the relationship between the two variables by coordinate axes; calculation using correlation coefficient corridation (x ', CNT (x'));

C（X）= correlation (x′,CNT(x′))。

(3.4) obtaining the relative information entropy sequence under various visual charts by comparing the columns X, and obtaining the maximum value C (X) of the relative information entropy _max . (1) (2) analysis of the filtered resulting dataset Using C (X) _max And visually displaying the corresponding chart type.

According to the invention, by reasonably setting related rules, how to apply the data filtering rules in data analysis to establish an analysis filtering model is solved, and the model is utilized to analyze and filter data and intuitively display the data. The invention can facilitate the user to quickly screen the data and find out the interested data subset, and analyze and mine the connection between the data items.

Drawings

FIG. 1 is a diagram of an example of data column analysis.

FIG. 2 is a process of data analysis filtering.

FIG. 3 is an example of data analysis filtering. Wherein (a) is a sales date filtering instance graph and (b) is a sales price filtering instance graph.

FIG. 4 is a comparison of the visualization of the result dataset. Wherein (a) is a result dataset histogram display and (b) is a result dataset ray diagram display.

FIG. 5 is a flow chart of the method of the present invention.

Detailed Description

In this section we describe the invention by means of a specific data analysis system.

The data selected by the invention comprises 33 columns 344355 pieces of data. Operating in accordance with the procedure described above, the data columns and data ranges are analyzed and the data resulting from the analysis is visualized and then returned to the user for presentation. As shown in the following FIG. 1, the data column analysis method of the present invention analyzes all the remaining data columns by using profit columns as key columns, and the analysis result is that the importance of both the sales date and the sales price is the highest.

The invention establishes a data filtering rule model based on the scheme provided in the step (2), combines the target column sales date and the selling price under the screening condition, and the data analysis system obtains the operation sequence of the analysis data shown in the following figure 2 based on the data filtering rule model, so as to obtain the maximum box data range 0-57 of the selling price with the sales date of the last month. Finally, the example of the filtering result system shown in fig. 3 is obtained and displayed.

The invention takes the form of an automated visualization. The result dataset will be analyzed autonomously and presented in a suitable visual chart. As shown in fig. 4 below, the left plot shows less suitable data as a bar graph, while visualizing data as a right plot line graph is easier to see trends than visualizing as a bar graph. Therefore, the invention adopts the right line graph to display the selling price of the data array.

Claims

1. A data filtering rule modeling method in data analysis comprises the following specific steps:

(1) Given a data set D formed by a large amount of data, calculating the importance of a data column according to whether key data are designated by a user or not by adopting a random forest feature selection method; the specific flow is as follows:

(1.1) an importance score, expressed in VIM; the Gini index is expressed in GI assuming m data columns X ₁ ，X ₂ ，X ₃ ，...，X _m To calculate each column X _j Gini index score VIM of (a) _j ^(Gini) That is, the j-th column represents the average amount of change in node splitting uncertainty in all decision trees in the random forest RF; gini index is:

wherein K represents that m nodes have K categories and p in RF all decision trees _mk Represents the proportion of class k in node m, p _mk′ A complement representing the proportion of class k in node m;

(1.2) data column X _j The significance at node m, i.e., the Gini index variation before and after branching at node m, is:

GI _l and GI _r Gini indexes respectively representing two new nodes after branching;

(1.4) n trees in random forest, data column X _j The importance of (2) is:

(1.5) according to the calculated importance ranking, returning the two columns of data with the most important analysis and filtration results to the user, wherein the importance ranking of A is A, B and is higher than that of B;

(2) Analyzing and filtering a data range; the specific flow is as follows:

(2.1) first three classes are classified according to A, B two column data types: a numerical value type N, a discrete value type X and a time sequence type T; for the numerical N, discretizing is firstly carried out, namely, the data are subjected to box division processing to obtain each box record N ', and the count record of each box division is calculated to be CNT (N'); for discrete value type X, calculating a count of each discrete value as CNT (X);

time sequence type T, dividing a time segment box according to the time sequence data range of the data column T, and dividing the data column T into time sequence boxes to obtain each time sequence box record as T';

(2.2) forming two data analysis and filtration combined modes according to three different data types, and carrying out data filtration analysis on the data set D; the method comprises the following steps:

(2.2.1) A is time-series data, and B is discrete value type or numerical value type; a selecting proper near-segment time as the first filtering according to the unit of the time sequence box t' obtained in the step (2.1)Condition t _recent The method comprises the steps of carrying out a first treatment on the surface of the The data set after the condition screening of column A is marked as D ^* The data column B is filtered to obtain a discrete data column B ^* X of (2) ₁ ^* ，x ₂ ^* ，...，x _k ^* Or a numerical data column B ^* Re-binning to obtain (n) ₁ ^* )′，(n ₂ ^* )′，...，(n _k ^* ) ' wherein the number of boxes is k, x ^* /(n ^* ) The three values CNT (x ^* ) _top3 /CNT((n ^* )′) _top3 Three discrete data x _max ^* Or box (n) _max ^* ) The' numerical range serves as a second filtering condition; with two filtering conditions t _recent And x _max ^* /(n _max ^* ) Intersection t of _recent ∩x _max ^* /(n _max ^* ) ' as an analysis filtering condition of the analysis filtering combination model, performing data filtering analysis on the data set D;

(2.2.2) A is discrete value type or numerical value type, and B is time sequence type data; a calculating CNT (x)/CNT (n') for each discrete value quantity or bin, selecting the five constants x with the highest count _top5 Or box (n) _top5 ) The' corresponding numerical range is taken as the first filtering condition; the data set after the condition screening of the A column is D ^* The method comprises the steps of carrying out a first treatment on the surface of the Selecting the constant x with the most counting in A _max Or box (n) _max ) ' corresponding data column B ^* Time sequence range t of (2) _max As a second filtering condition; with two filtering conditions x _top5 /(n _top5 ) ' and t _max Is the intersection x of (2) _top5 /(n _top5 )′∩t _max As the analysis and filtration conditions of the analysis and filtration combined model, carrying out data filtration and analysis on the data set D;

(3) Automatically visualizing the resulting dataset resulting from the analysis filtering of steps (1), (2) for presenting the analysis filtered data to a user; the specific flow is as follows:

(3.1) visualizing the result data set to obtain a cardinal value d (X) of a column X, a maximum value max (X) of the column X, a minimum value min (X), a record bar number |X| of the column X, a data type (X) of the column X, correlation coefficients correlation (X, CNT (X ')) of each bin data X' and a count CNT (X ') corresponding to each bin data X';

(3.2) defining a set of clipping rules according to the column type (X) obtained in (3.1); when the data type of column x is time-sequential: the visual chart is a bar chart and a line chart; when the data type of the column x is discrete value type or numerical value type: the visual chart is a histogram, a pie chart and a scatter chart;

(3.3) adopting a data analysis method-relative information entropy to determine how to automatically visualize the result data set obtained after analysis and filtration in the steps (1) and (2); the core idea of the method is to calculate the ratio of the information entropy of each data column X visualization into various charts relative to the normalized chart information entropy, and record as C (X) ₁ ，C(X) ₂ ，...，C(X) _k The method comprises the steps of carrying out a first treatment on the surface of the Comparing the magnitude of each relative information entropy, maximum C (X) _max The corresponding chart type is the visualization type of the data column X; the method comprises the following steps:

(3.3.1) in the bar graph, the height difference of the bar is used for improving the recognition degree of the user on the data difference; calculating the relative information entropy of the histogram uses the cardinal value d (X) of column X, |d (X) | represents the value of cardinal value d (X) of column X:

(3.3.2) the pie chart may show multiple sets of data representing the overall ratio of each set of data; in the pie chart, a differentiated CNT (x') is required to highlight the fraction of each fraction, for which shannon entropy is introduced: sigma (sigma) _{y∈CNT(x′)} -P (y) log P (y) as part of the decision criterion; where y represents each value of CNT (x '), and P (y) represents the number ratio of y, i.e., the occurrence probability of y at CNT (x');

(3.3.3) the line graph may reflect the situation of the same thing developing changes in different times; when the data CNT (x ') and x' conform to a certain distribution: when linear distribution, exponential distribution, logarithmic distribution or low power distribution is performed, the distribution expression is marked as distribution (X ', CNT (X')), and the information entropy C (X) is 1; otherwise, the information entropy C (X) is 0;

C(X)＝distribution(x′,CNT(x′))

(3.3.4) in the scatter diagram, the relationship between the two variables is represented by the coordinate axes; calculation using correlation coefficient corridation (x ', CNT (x'));

C(X)＝correlation(x′,CNT(x′))

(3.4) obtaining the relative information entropy sequence under various visual charts by comparing the columns X, and obtaining the maximum value C (X) of the relative information entropy _max The method comprises the steps of carrying out a first treatment on the surface of the Analyzing and filtering the obtained result data set in the steps (1) and (2) by adopting C (X) _max And visually displaying the corresponding chart type.