CN108960434B - Method and device for analyzing data based on machine learning model interpretation - Google Patents

Method and device for analyzing data based on machine learning model interpretation Download PDF

Info

Publication number
CN108960434B
CN108960434B CN201810683818.6A CN201810683818A CN108960434B CN 108960434 B CN108960434 B CN 108960434B CN 201810683818 A CN201810683818 A CN 201810683818A CN 108960434 B CN108960434 B CN 108960434B
Authority
CN
China
Prior art keywords
model
feature
interpretation
features
name
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810683818.6A
Other languages
Chinese (zh)
Other versions
CN108960434A (en
Inventor
方荣
李福龙
杨慧斌
詹镇江
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
4Paradigm Beijing Technology Co Ltd
Original Assignee
4Paradigm Beijing Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 4Paradigm Beijing Technology Co Ltd filed Critical 4Paradigm Beijing Technology Co Ltd
Priority to CN201810683818.6A priority Critical patent/CN108960434B/en
Publication of CN108960434A publication Critical patent/CN108960434A/en
Application granted granted Critical
Publication of CN108960434B publication Critical patent/CN108960434B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present disclosure relates to a method and apparatus for analyzing data based on machine learning model interpretation. A method of analyzing data based on machine learning model interpretation includes: obtaining and displaying model interpretation content, wherein the model interpretation content comprises at least one of model structure interpretation, model feature importance and model prediction interpretation; receiving a data analysis request made by a user for at least one of the feature names displayed by the model interpretation content; respectively calculating the data distribution of all the features in the training sample under each feature name in the at least one feature name; and outputting the data distribution in a visual manner.

Description

Method and device for analyzing data based on machine learning model interpretation
Technical Field
The invention relates to the field of machine learning, in particular to a method and a device for analyzing data based on machine learning model interpretation.
Background
The conventional machine learning modeling process requires a significant amount of time for data analysis to decide how to perform feature processing. However, for practical problems, the amount of data to be analyzed is often very large. At this time, if data analysis is performed indiscriminately, it is a very cumbersome and time-consuming process. However, if the data analysis is performed differently, sufficient business experience depending on the user may be required.
In addition, if the effect of the model is known only by the model index, the model is equivalent to a black box, and the user does not understand the meaning of the model.
Therefore, in the prior art, there is a lack of a solution that can effectively analyze data and understand a model during machine learning.
Disclosure of Invention
In order to solve the above problems, the present invention proposes a method of analyzing data based on machine learning model interpretation.
According to the present invention, there is provided a method of analyzing data based on machine learning model interpretation, the method may comprise: obtaining and displaying model interpretation content, wherein the model interpretation content comprises at least one of model structure interpretation, model feature importance and model prediction interpretation; receiving a data analysis request made by a user for at least one of the feature names displayed by the model interpretation content; respectively calculating the data distribution of all the features in the training sample under each feature name in the at least one feature name; and outputting the data distribution in a visual manner.
According to one embodiment of the invention, for discrete features, the data distribution may be a stacked histogram, a grouped histogram, and/or a scatter plot grouped by labels with respect to the occurrence of the feature under each group.
According to an embodiment of the present invention, for the continuous features, the data distribution may be an average value map, a point line map, and/or a kernel density estimation map obtained by counting the features grouped by the markers.
According to an embodiment of the present invention, the model may be a logistic regression model, and the model structure interpretation may be displayed as a feature name, distribution information of weight values of respective features under the same feature name, and/or dimension information of respective features under the same feature name.
According to one embodiment of the present invention, distribution information of non-zero weight values and/or all weight values of respective features under the same feature name may be represented by a box plot, wherein the box plot includes at least one of the following items: a minimum value, a first quartile, a median, a third quartile, and a maximum value.
According to an embodiment of the invention, the dimension information may indicate at least one of: (1) the ratio of the number of features with non-zero weight values under the same feature name to the total number of features under the same feature name; (2) the dimension number of all the features under the same feature name and/or the ratio of the dimension number to the total dimension number of the features of the model; (3) the dimension number of the feature with the weight value of non-zero under the same feature name and/or the proportion of the dimension number to the total dimension number of the feature with the weight value of non-zero of the model.
According to one embodiment of the invention, the model may be a logistic regression model, and the model prediction interpretation may be displayed as the features of the prediction samples and their corresponding weights; alternatively, the model may be a decision tree model, and the model prediction interpretation may be displayed as a decision path of the prediction sample and its corresponding weight.
According to an embodiment of the invention, the data analysis request may include: a hover operation or a click operation made for the at least one of the feature names displayed by the model interpretation content.
According to an embodiment of the invention, the method may further comprise: receiving a characteristic name searching instruction; searching a target characteristic name in the characteristic names displayed by the model interpretation content according to the characteristic name searching instruction; and displaying the searched target characteristic name and the corresponding data distribution.
According to an embodiment of the present invention, the step of visually outputting the data distribution may include: displaying the data distribution by establishing a correlation of model interpretation content with the data distribution.
According to an embodiment of the present invention, the step of visually outputting the data distribution may include: and displaying the data distribution near the model interpretation content.
According to the present invention, there is provided an apparatus for analyzing data based on machine learning model interpretation, the apparatus may include: the model interpretation unit is used for acquiring and displaying model interpretation contents, and the model interpretation contents comprise at least one of model structure interpretation, model feature importance and model prediction interpretation; a receiving unit configured to receive a data analysis request made by a user for at least one of the feature names displayed by the model interpretation content; the calculation unit is used for respectively calculating the data distribution of all the features in the training sample under each feature name in the at least one feature name; and an output unit for outputting the data distribution in a visualized manner.
According to one embodiment of the invention, for discrete features, the data distribution may be a stacked histogram, a grouped histogram, and/or a scatter plot grouped by labels with respect to the occurrence of the feature under each group.
According to an embodiment of the present invention, for the continuous features, the data distribution may be an average value map, a point line map, and/or a kernel density estimation map obtained by counting the features grouped by the markers.
According to an embodiment of the present invention, the model may be a logistic regression model, and the model interpretation unit may interpret the model structure as a feature name, distribution information of weight values of respective features under the same feature name, and/or dimension information of respective features under the same feature name.
According to an embodiment of the present invention, the model interpretation unit may represent distribution information of non-zero weight values and/or all weight values of respective features under the same feature name by box plots, respectively, wherein the box plots include at least one of: a minimum value, a first quartile, a median, a third quartile, and a maximum value.
According to an embodiment of the present invention, the model interpretation unit may display the dimension information as at least one of: (1) the ratio of the number of features with non-zero weight values under the same feature name to the total number of features under the same feature name; (2) the dimension number of all the features under the same feature name and/or the ratio of the dimension number to the total dimension number of the features of the model; (3) the dimension number of the feature with the weight value of non-zero under the same feature name and/or the proportion of the dimension number to the total dimension number of the feature with the weight value of non-zero of the model.
According to an embodiment of the present invention, the model may be a logistic regression model, and the model interpretation unit may display the model prediction interpretation as features of prediction samples and their corresponding weights; alternatively, the model may be a decision tree model, and the model interpretation unit may interpret the model prediction as a decision path of a prediction sample and its corresponding weight.
According to an embodiment of the invention, the data analysis request may include: a hover operation or a click operation made for the at least one of the feature names displayed by the model interpretation content.
According to an embodiment of the present invention, the apparatus may further include a search unit configured to receive a feature name search instruction and search for a target feature name among the feature names displayed by the model interpreted content according to the feature name search instruction, and the output unit may further display the searched target feature name and a corresponding data distribution.
According to an embodiment of the present invention, the output unit may display the data distribution by establishing a correlation of model interpretation contents with the data distribution.
According to an embodiment of the present invention, the output unit may display the data distribution near model interpretation content.
According to the present invention, there is provided a computer-readable storage medium having stored thereon a computer program for executing the method of analyzing data based on machine learning model interpretation described in any of the preceding embodiments.
According to the present invention, there is provided a computing device comprising a storage component and a processor, wherein the storage component has stored therein computer-executable instructions that, when executed by the processor, perform the method of analyzing data based on machine learning model interpretation as described in any of the preceding embodiments.
By adopting the invention, the user can select data from a large amount of training data in a targeted manner for analysis, and can effectively combine data analysis with model understanding.
Drawings
The above and/or other objects and advantages of the present invention will become more apparent by describing embodiments below with reference to the accompanying drawings, in which:
FIG. 1 is a flow diagram of a method of analyzing data based on machine learning model interpretation, according to an embodiment of the invention;
FIG. 2 illustrates an example of analyzing data based on model feature importance, according to an embodiment of the invention;
3A-3C illustrate examples of analyzing data based on model structure interpretation, according to embodiments of the invention;
FIG. 4 shows an example of model prediction result interpretation;
5-8 illustrate examples of data distribution in a training sample for all features under a feature name;
FIG. 9 shows a block diagram of an apparatus for analyzing data based on machine learning model interpretation, according to an embodiment of the invention;
FIG. 10 shows a schematic block diagram of a computing device according to an embodiment of the invention.
Detailed Description
As required, detailed embodiments of the present invention are disclosed herein; however, it is to be understood that the disclosed embodiments are merely exemplary of the invention that may be embodied in various and alternative forms. The figures are not necessarily to scale; some features may be exaggerated or minimized to show details of particular components. Therefore, specific structural and functional details disclosed herein are not to be interpreted as limiting, but merely as a representative basis for teaching one skilled in the art to variously employ the present invention.
A method and apparatus for analyzing data based on machine learning model interpretation according to an embodiment of the present invention will be described below with reference to the accompanying drawings.
First, a method of analyzing data based on machine learning model interpretation according to an embodiment of the present invention will be described with reference to the accompanying drawings
FIG. 1 is a flow diagram of a method of analyzing data based on machine learning model interpretation, according to an embodiment of the invention.
As shown in fig. 1, in step S101, model explanatory content may be acquired and displayed. The model interpretation content may include at least one of a model structure interpretation, a model feature importance, and a model prediction interpretation.
Regarding the model structure interpretation, the logistic regression model is taken as an example for explanation, and the step of obtaining and displaying the model structure interpretation of the logistic regression model comprises the following steps: obtaining model parameters of a logistic regression model, wherein the model parameters comprise all features in the logistic regression model and weight values of all the features; aggregating all the obtained characteristics in the model parameters according to the characteristic names; performing feature statistics on each feature name to obtain feature statistical information of each feature name, wherein the feature statistical information indicates distribution information of weight values of each feature under the same feature name and/or dimension information of each feature under the same feature name; and displaying the feature name and the corresponding feature statistical information through a graphical interface. The model structure interpretation of the logistic regression model may be displayed as a feature name, distribution information of weight values of respective features under the same feature name, and/or dimension information of respective features under the same feature name. A model structure explanation of the logistic regression model will be described in detail below with reference to fig. 3A and 3B.
It should be noted that although the model structure interpretation is described using a logistic regression model as an example, the machine learning model may also be other models (such as a naive bayes model, a support vector machine model, a decision tree model, and the like).
Continuing with the example of the logistic regression model, the logistic regression model can be regarded as a series of features and their corresponding weights, and accordingly, the model parameters of the logistic regression model can include the features of the logistic regression model and the corresponding weight values of each feature in the model. As an example, in the training process of the machine learning model, the feature may be generally represented as a hashed value, and accordingly, the feature textual data in the system may further include a mapping relationship between an original value of the feature before being hashed and a hashed value of the feature after being hashed. Therefore, it is possible to obtain an original value of the feature from the feature original text data and further obtain a weight value of each feature.
The feature name may correspond to the result of feature engineering of a field or fields of the original data sheet, which is intended to describe the behavior of an aspect of the training sample. The features of the machine learning samples are classified into discrete features and continuous features according to a feature processing method. For discrete features, each feature name corresponds to a set of features, each feature indicating a value of the feature name and corresponding to a weight value. For continuous features, there is only one feature under the feature name and its corresponding weight, and the value of the feature is continuously changed.
Because the feature scale of the logistic regression model is huge (for example, feature dimensions of tens of millions are often achieved), in order to more intuitively show different influences of different features on the model prediction result, all features in the model parameters can be aggregated according to the feature names to which the features belong, so as to display the difference degree between the weight values of the features under the same feature name, and the classification function of the corresponding feature names is embodied through the difference degree. For example, in one embodiment of the present invention, the model parameters of the logistic regression model may include the following features: professions are blue collar workers, professions are technicians, professions are service personnel, professions are retired personnel, professions are managers, and these characteristics can be aggregated under the corresponding characteristic names "professions". As another example, the model parameters of the logistic regression model may include the following features: the age is 20 to 30 years, the age is 30 to 40 years, the age is 40 to 50 years, the age is 50 to 60 years, the age is 60 or more, and the like, and these characteristics can be aggregated under the corresponding characteristic name of "age". The features included under each feature name may be counted to obtain feature statistics for each feature name of the logical model. The distribution information of the weight values of the features under the same feature name may indicate how the weight values (all weight values or non-zero weight values) of the features under the same feature name are distributed, such as which ranges the weight values are mainly distributed in, whether the weight values of the features are distributed more or less intensively, where the key value-taking points are located, whether the distribution is uniform, and so on. The dimension information of the respective features under the same feature name may indicate the dimension information of all the features under each feature name, or the dimension information of a certain class of features (for example, valid features with a weight value different from zero) under each feature name. The dimension information can enable a user to know the effective feature dimension under the feature name, so that the user is helped to know the online performance of the model. The individual feature names and their feature statistics may be displayed via a graphical interface. Here, as an example, each feature name and its associated feature statistical information may be displayed via a chart, a graph, a character, or the like.
The dimension information of the respective features under the same feature name may indicate at least one of: (1) the ratio of the number of features with non-zero weight values under the same feature name to the total number of features under the same feature name; (2) the dimension number of all the features under the same feature name and/or the ratio of the dimension number to the total dimension number of the features of the model; (3) the dimension number of the feature with the weight value of non-zero under the same feature name and/or the proportion of the dimension number to the total dimension number of the feature with the weight value of non-zero of the model. The feature with a non-zero weight value is also referred to as a valid feature, and accordingly, the ratio of the number of features with a non-zero weight value in the same feature name to the total number of features in the same feature name is also referred to as a valid feature ratio. For example, if 10 features are included in a feature name, wherein 3 features correspond to a weight value of 0, and the remaining 7 features correspond to weight values other than 0, the percentage of valid features of the feature name is 70%. In addition, if the total dimension number of all the features under all the feature names of one model is 100, and the dimension number of all the features under one feature name a is 20, the dimension number of all the features under the feature name a is 20, and the ratio of the dimension number to the total dimension number of the features of the model is 20%. In another example, if the total dimension number of all features under all feature names of one model is 100, wherein the total dimension number of features with non-zero weight values is 80 and the dimension number of features with non-zero weight values under one feature name B is 10, the dimension number of features with non-zero weight values under the feature name B is 10, and the proportion of the dimension number to the total dimension number of features with non-zero weight values of the model is 10/80 × 100% ═ 12.5%.
Distribution information of non-zero weight values and/or all weight values of respective features under the same feature name may be represented by a box plot, respectively, wherein the box plot includes at least one of: a minimum value, a first quartile, a median, a third quartile, and a maximum value.
The specific form of importance of a model feature is the feature and its corresponding importance value. For Logistic Regression (LR) models, the features are typically discrete, so the feature importance of a logistic regression model refers to the importance of the set of features. However, for a Gradient Boosting Decision Tree (GBDT) model, its features are generally continuous, so the feature importance of the GBDT model refers to the importance of the features (the feature set and the features of the continuous features can be understood as the same concept). The method for calculating the importance of the model features is various, the logistic regression model and the GBDT model can calculate the importance of the model features only based on the models, the logistic regression model can use the variance of the weights as the measure of the importance of the model features, and the GBDT model can add all gains (gain) obtained by splitting the features at each time as the measure of the importance of the model features. In addition, logistic regression models and GBDT models can also compute model feature importance based on both the model and the sample. For the logistic regression model, the importance of a feature is calculated, the feature in all samples can be scattered (shuffle), then the area under the curve (AUC) calculation is performed based on the model, and the difference between the calculated AUC and the AUC of the original model is used as the importance measure of the feature. For the GBDT model, an assessment of feature importance may also be made based on the out-of-bag data error rate. In addition, the model feature importance may also be calculated based on the samples only. For example, an LR model is trained for each feature in the sample table, the AUC value of each feature is obtained by using the training data, and this AUC is used as the value of the importance of the feature.
With respect to model prediction interpretation, if the model is a logistic regression model, the model prediction interpretation is displayed as the features of the prediction samples and their corresponding weights. If the model is a decision tree model, the model prediction interpretation is displayed as a decision path of the prediction sample and its corresponding weight.
In step S102, a data analysis request made by a user for at least one of the feature names displayed by the model interpretation content may be received. The data analysis request may include a hover operation or a click operation made for the at least one of the feature names displayed by the model interpretation content. In one example, the data analysis request may be a hover operation made for a model structure interpretation of the logistic regression model for at least one of the displayed feature names. In another example, the data analysis request may be a click operation made on at least one of the displayed feature names for model predictive interpretation of the decision tree model.
In step S103, the data distribution of all the features in the training sample under each feature name in the at least one feature name can be calculated respectively. For discrete features, the data distribution is a stacked histogram, a grouped histogram, and/or a scatter plot grouped by labels (labels) for the occurrence of the features under each grouping. For continuous features, the data distribution is an average value graph, a point line graph and/or a kernel density estimation graph obtained by counting the features grouped according to marks. The marker is described by taking a two-term distribution problem as an example, and whether a person is diabetic or not can be determined by analyzing indicators (i.e., characteristic names) such as age, sex, body mass index, mean blood pressure, disease index, and the like, and label ═ 0 indicates that the person is not diseased and label ═ 1 indicates that the person is diseased. For the two groups of non-diseased and diseased, the data distribution of each characteristic (e.g., specific age, sex, body mass index, mean blood pressure, disease index, etc.) of the training data under the corresponding group can be counted respectively. Various forms of data distribution will be described below with reference to fig. 5 to 8.
In step S104, the data distribution may be output in a visualized manner. The step of visually outputting the data distribution may include: displaying the data distribution by establishing a correlation of model interpretation content with the data distribution. Additionally, the step of visually outputting the data distribution may include: and displaying the data distribution near the model interpretation content. In one example, the data distribution may be displayed in the vicinity of the model structure interpretation. In another example, the data distribution may be displayed near the importance of the model feature.
In addition, the method for analyzing data based on machine learning model interpretation according to the present invention may further include: receiving a characteristic name searching instruction; searching a target characteristic name in the characteristic names displayed by the model interpretation content according to the characteristic name searching instruction; and displaying the searched target feature names and the corresponding data distribution.
FIG. 2 illustrates an example of analyzing data based on model feature importance, according to an embodiment of the invention. Specifically, an example of calculating and displaying a kernel density estimation curve for a feature name based on the importance of the model feature is shown in fig. 2.
As shown in fig. 2, the kernel density estimation curve may be displayed below the model feature importance histogram. In the model feature importance histogram in fig. 2, the left column is a feature name (e.g., f _ pRank, f _ ussub, f _ ussex, etc.), wherein the labeled number is the feature importance coefficient of the corresponding feature name, e.g., the feature importance coefficient of the feature name f _ pRank is 0.5299. When a hover operation or a click operation is made for the feature name f _ pRank, a kernel density estimation graph in the lower half of fig. 2 may be displayed in which the ordinate is the kernel density value and the abscissa is the respective feature value under the feature name f _ pRank, and the kernel density function thereof is shown for both target values of 1 and 0.
The model feature importance provides a basis and guidance for a user to select which feature name to analyze, and also helps the user intuitively understand the model feature importance by associatively displaying data distribution according to the model feature importance. The calculation of the data distribution can be carried out in real time after the user selects one feature name, or the data distribution of all the features in the training sample can be calculated at one time. The importance of the model features requires some secondary processing calculations on the model. For example, for a tree model, one way to calculate the importance of a model feature is to calculate the reduction in weighted purities of the feature as all non-leaf nodes split, with more reduction indicating that the feature is more important. For a linear model, a calculation method of the importance of model features is to generate some disturbances on the features and calculate the average reduction of the classification accuracy after the disturbances and the classification accuracy before the disturbances.
The range of the feature importance coefficient depends on the calculation method of the feature importance. In the logistic regression model, the range of the characteristic importance coefficient obtained only by depending on the AUC of the training sample is 0-1. The logistic regression model utilizes a calculation method of auc change, and the characteristic importance coefficient range of the logistic regression model is-1. The GBDT model utilizes a split gain-summed computational method whose characteristic importance coefficients range between minimum partition gain and positive infinity. The model feature importance is not an absolute concept, and therefore the specific value of the feature importance coefficient is not important, the difference between feature importance coefficients between features being of interest.
It should be noted that the method of calculating and displaying the kernel density estimation curve for the feature name based on the importance of the model feature shown in fig. 2 is only an example, and the present disclosure is not limited thereto. In another embodiment, when a hover operation or a click operation is made for a feature name of a discrete feature, a stacked histogram, a grouped histogram, and/or a scatter diagram grouped by a marker with respect to the occurrence of the feature under each group may be visually output. In another embodiment, when a hover operation or a click operation is made for feature names of continuous features, an average value map and/or a point line map obtained by counting the features grouped by markers may be visually output. That is, the form of the data distribution output in a visualized manner is not limited to the kernel density estimation curve, and the kind of the data distribution output at the same time is not limited to one.
Fig. 3A to 3C illustrate examples of analyzing data based on model structure interpretation according to an embodiment of the present invention. In this example, a grouping histogram after grouping by the marker with respect to the occurrence of the feature under each grouping is calculated and visually output based on the model structure interpretation of the logistic regression model.
As shown in fig. 3A and 3B, the model structure interpretation of the logistic regression model is displayed as a feature name, distribution information of weight values of respective features under the same feature name, and dimension information of respective features under the same feature name. Respectively representing distribution information of non-zero weight values and/or all weight values of each feature under the same feature name through a box plot, wherein the box plot comprises at least one of the following items: a minimum value, a first quartile, a median, a third quartile, and a maximum value. The dimension information indicates at least one of: (1) the ratio of the number of features with non-zero weight values under the same feature name to the total number of features under the same feature name; (2) the dimension number of all the features under the same feature name and/or the ratio of the dimension number to the total dimension number of the features of the model; (3) the dimension number of the feature with the weight value of non-zero under the same feature name and/or the proportion of the dimension number to the total dimension number of the feature with the weight value of non-zero of the model.
Referring to fig. 3A to 3C, the feature name t _ pRank is described as an example. From fig. 3A and 3B, it is understood that the feature name number is 40, the total feature dimension is 119369, the feature dimension of the feature name t _ pRank is 71, the ratio of the valid features of the feature name t _ pRank (i.e., the ratio of the number of features with non-zero weight values under the feature name t _ pRank to the total number of features under the feature name t _ pRank) is 100%, and the feature dimension/total feature dimension (i.e., 71/110369) of the feature name t _ pRank is 0.06%.
In one embodiment of the present invention, the box plot process in fig. 3A and 3B may include: all weighted values or non-zero weighted values of the features under the same feature name are arranged in an ascending order or a descending order according to the weighted values; and drawing a box line graph of the feature name by taking the sorted first weighted value, last weighted value and/or weighted value of a preset quantile as key points. Wherein, the preset quantile can be appointed according to the requirement. For example, in one embodiment of the present invention, the preset quantiles may include a first quartile, a median, and a third quartile, and when the box-line graph is drawn, the non-zero weights under each feature name may be arranged from small to large, and the box-line graph may be drawn based on five points, i.e., the minimum value of the statistical weights, the 1/4 quantiles, the median, the 3/4 quantiles, and the maximum value. Of course, in other embodiments of the present invention, the above-mentioned 5 quantiles may be replaced by other quantiles, and the number of the preset quantiles may be further increased, for example, 10% and 90% quantiles may be increased. The box line graphs show the distribution of the weight values of the features under the same feature name, so that the distribution condition can be conveniently observed, optionally, a reference line with a weight value of 0 can be marked between the box line graphs of the feature names, and whether the model prediction is biased to positive influence or negative influence is determined by observing the offset condition of each box line graph relative to the reference line 0.
In fig. 3A and 3B, the feature names may be sorted based on the weight variance (box line graph in the graph), and when the user selects the feature name t _ pRank by a hover or click operation, data analysis is performed on the relevant features of the feature group t _ pRank in the training data. In this example, the user selects a feature set f _ pRank whose feature engineering method is f _ pRank ═ discrete (pRank), and whose feature engineering is to discretize pRank. As shown in fig. 3C, the data distribution may be displayed in the form of a histogram, specifically, the features may be grouped according to the value of label (label), and then the number of times each feature appears in each group may be counted. In addition, in an example, the data distribution may be displayed by establishing a correlation of the model structure interpretation and the data distribution as shown in FIG. 3A or FIG. 3B, and FIG. 3C may be displayed in proximity to the model structure interpretation as shown in FIG. 3A or FIG. 3B.
In addition, in order to facilitate the user to quickly acquire the data distribution concerned, in one embodiment of the present invention, a search function is also provided. Optionally, the search function may include a search for characteristic names. The specific steps of the search function may include: receiving a characteristic name searching instruction; searching a target characteristic name in the characteristic names displayed by the model interpretation content according to the characteristic name searching instruction; and displaying the searched target characteristic name and the corresponding data distribution. Specifically referring to fig. 3A and 3B, a user may input a target feature name to be searched in a search box, the system may search for the target feature name among all feature names of the logistic regression model, and after the target feature name is searched, the target feature name and corresponding data distribution may be displayed. It should be noted that the search strategy of the above feature name search may adopt fuzzy search or precise search.
Fig. 4 shows an example of model prediction result interpretation. The interpretation of the prediction results can also be described based on the logistic regression model described above. The interpretation of the prediction results of the logistic regression model may be expressed in terms of features and weights. In fig. 4, non-zero features and their corresponding weights are listed, and the weights may be ranked from high to low, finding features with high weights. A feature may be selected and the data distribution for that feature calculated and displayed using the methods described above. It should be noted that although the prediction interpretation is described using a logistic regression model as an example, the present disclosure is not limited to the model prediction interpretation shown in fig. 4, and one skilled in the art may expect to calculate and display data distributions based on other forms of model prediction interpretation for other models.
Fig. 5 to 8 show examples of data distribution of all features in the training sample under the feature name. For discrete features, the data distribution is a stacked histogram, a grouped histogram, and/or a scatter plot of the occurrences of the features under each grouping after grouping by the markers. For continuous features, the data distribution is an average value graph, a point line graph and/or a kernel density estimation graph obtained by counting the features grouped according to marks. FIG. 5 shows a stacked histogram for discrete features, where the number of occurrences of each feature is counted after being grouped by labels. Fig. 6 shows an average value map obtained by counting the features grouped by the mark. Fig. 7 shows a dot line graph obtained by counting the features grouped by the mark. In addition, the dotted dots shown in fig. 8 represent the relationship between the features and the marks.
Fig. 9 illustrates a block diagram of an apparatus for analyzing data based on machine learning model interpretation according to an embodiment of the present invention.
As shown in fig. 9, an apparatus for analyzing data based on machine learning model interpretation according to an embodiment of the present invention may include: a model interpretation unit 201, configured to obtain and display model interpretation content, where the model interpretation content includes at least one of model structure interpretation, model feature importance, and model prediction interpretation; a receiving unit 202, configured to receive a data analysis request made by a user for at least one of the feature names displayed by the model interpretation content; a calculating unit 203, configured to calculate data distributions of all features in the training sample under each feature name in the at least one feature name respectively; and an output unit 204 for outputting the data distribution in a visualized manner.
Wherein, for discrete features, the data distribution may be a stacked histogram, a grouped histogram, and/or a scatter plot grouped by labels with respect to the occurrence of the features under each grouping; for continuous features, the data distribution may be an average map, a point-line map, and/or a kernel density estimation map obtained by counting features grouped by labels.
For the logistic regression model, the model interpretation unit 201 may interpret the model structure as a feature name, distribution information of weight values of respective features under the same feature name, and/or dimension information of respective features under the same feature name. The model interpretation unit 201 may represent distribution information of non-zero weight values and/or all weight values of respective features under the same feature name, respectively, by box plots including at least one of: a minimum value, a first quartile, a median, a third quartile, and a maximum value. The model interpretation unit 201 may display the dimension information as at least one of: (1) the ratio of the number of features with non-zero weight values under the same feature name to the total number of features under the same feature name; (2) the dimension number of all the features under the same feature name and/or the ratio of the dimension number to the total dimension number of the features of the model; (3) the dimension number of the feature with the weight value of non-zero under the same feature name and/or the proportion of the dimension number to the total dimension number of the feature with the weight value of non-zero of the model. The model interpretation unit 201 may display the model prediction interpretation as the features of the prediction samples and their corresponding weights. For a decision tree model, model interpretation unit 201 may display the model prediction interpretation as a decision path of the prediction sample and its corresponding weight.
The data analysis request includes: a hover operation or a click operation made for the at least one of the feature names displayed by the model interpretation content.
The apparatus may further include a search unit 205, the search unit 205 being configured to receive a feature name search instruction and search for a target feature name among the feature names displayed by the model interpreted content according to the feature name search instruction, and the output unit 204 may display the searched target feature name and a corresponding data distribution. The output unit 204 may display the data distribution by establishing an association of model interpretation content with the data distribution. In one example, the output unit 204 may display the data distribution near the model interpretation content.
The specific operations shown above in conjunction with fig. 1 to 8 may be respectively performed by corresponding units in the apparatus shown in fig. 9, and details of the specific operations will not be described herein.
FIG. 10 shows a schematic block diagram of a computing device according to an embodiment of the invention.
As shown in fig. 10, the computing apparatus 300 provided according to the embodiment of the present invention may include a processor 301 and a storage unit 302, and the storage unit 302 stores therein computer-executable instructions, which when executed by the processor 301, perform the method for analyzing data based on machine learning model interpretation according to any of the embodiments described above.
In addition, according to an embodiment of the present invention, there is also provided a computer-readable storage medium having stored thereon a computer program for executing the method for analyzing data based on machine learning model interpretation described in any of the foregoing embodiments.
By adopting the invention, the user can select data from a large amount of training data in a targeted manner for analysis, and can effectively combine data analysis with model understanding.
The processes, methods or algorithms disclosed herein may be delivered to or implemented by a processing device, controller or computer, which may include any existing programmable or dedicated electronic control unit. Similarly, the processes, methods or algorithms may be stored as data and instructions executable by a controller or computer in a variety of forms including, but not limited to, information permanently stored on non-writable storage media such as ROM devices and information variably stored on writable storage media such as floppy diskettes, magnetic tape, CDs, RAM devices, and other magnetic and optical media. The processes, methods, or algorithms may also be implemented in software executable objects. Alternatively, the processes, methods or algorithms may be implemented in whole or in part using suitable hardware components (such as ASICs, FPGAs, state machines, controllers or other hardware components or devices), or a combination of hardware, software and firmware components.
While exemplary embodiments are described above, it is not intended that these embodiments describe all possible forms of the invention. Rather, the words used in the specification are words of description rather than limitation, and it is understood that various changes may be made without departing from the spirit and scope of the invention. Furthermore, features of various implementing embodiments may be combined to form further embodiments of the invention.

Claims (24)

1. A method of analyzing data based on machine learning model interpretation, comprising:
obtaining and displaying model interpretation content, wherein the model interpretation content comprises at least one of model structure interpretation, model feature importance and model prediction interpretation;
receiving a data analysis request made by a user for at least one of the feature names displayed by the model interpretation content;
respectively calculating the data distribution of all the features in the training sample under each feature name in the at least one feature name; and
outputting the data distribution in a visual manner.
2. The method of claim 1, wherein for discrete features, the data distribution is a stacked histogram, a grouped histogram, and/or a scatter plot grouped by labels about feature occurrences under each grouping.
3. The method of claim 1, wherein the data distribution is an average map, a dot line map, and/or a kernel density estimation map obtained by counting the features grouped by markers for consecutive features.
4. The method according to claim 1, wherein the model is a logistic regression model, and the model structure interpretation is displayed as a feature name, distribution information of weight values of respective features under the same feature name, and/or dimension information of respective features under the same feature name.
5. The method according to claim 4, wherein distribution information of non-zero weight values and/or all weight values of respective features under the same feature name is represented by a box plot, respectively, wherein the box plot includes at least one of: a minimum value, a first quartile, a median, a third quartile, and a maximum value.
6. The method of claim 4, wherein the dimension information indicates at least one of: (1) the ratio of the number of features with non-zero weight values under the same feature name to the total number of features under the same feature name; (2) the dimension number of all the features under the same feature name and/or the ratio of the dimension number to the total dimension number of the features of the model; (3) the dimension number of the feature with the weight value of non-zero under the same feature name and/or the proportion of the dimension number to the total dimension number of the feature with the weight value of non-zero of the model.
7. The method of claim 1, wherein the model is a logistic regression model and the model prediction interpretation is displayed as features of the prediction samples and their corresponding weights; alternatively, the model is a decision tree model and the model prediction interpretation is displayed as a decision path of the prediction sample and its corresponding weight.
8. The method of claim 1, wherein the data analysis request comprises: a hover operation or a click operation made for the at least one of the feature names displayed by the model interpretation content.
9. The method of claim 1, further comprising:
receiving a characteristic name searching instruction;
searching a target characteristic name in the characteristic names displayed by the model interpretation content according to the characteristic name searching instruction;
and displaying the searched target characteristic name and the corresponding data distribution.
10. The method of claim 1, wherein visually outputting the data distribution comprises: displaying the data distribution by establishing a correlation of model interpretation content with the data distribution.
11. The method of claim 1, wherein visually outputting the data distribution comprises: and displaying the data distribution near the model interpretation content.
12. An apparatus for analyzing data based on machine learning model interpretation, comprising:
the model interpretation unit is used for acquiring and displaying model interpretation contents, and the model interpretation contents comprise at least one of model structure interpretation, model feature importance and model prediction interpretation;
a receiving unit configured to receive a data analysis request made by a user for at least one of the feature names displayed by the model interpretation content;
the calculation unit is used for respectively calculating the data distribution of all the features in the training sample under each feature name in the at least one feature name; and
an output unit for outputting the data distribution in a visualized manner.
13. The apparatus of claim 12, wherein for discrete features, the data distribution is a stacked histogram, a grouped histogram, and/or a scatter plot grouped by labels about feature occurrences under each grouping.
14. The apparatus of claim 12, wherein the data distribution is an average map, a dot line map, and/or a kernel density estimation map obtained by counting the features grouped by markers for consecutive features.
15. The apparatus according to claim 12, wherein the model is a logistic regression model, and the model interpretation unit interprets the model structure as a feature name, distribution information of weight values of respective features under the same feature name, and/or dimension information of respective features under the same feature name.
16. The apparatus of claim 15, wherein the model interpretation unit represents distribution information of non-zero weight values and/or all weight values of the respective features under the same feature name by box plots, respectively, wherein the box plots include at least one of: a minimum value, a first quartile, a median, a third quartile, and a maximum value.
17. The apparatus of claim 15, wherein the model interpretation unit displays the dimensional information as at least one of: (1) the ratio of the number of features with non-zero weight values under the same feature name to the total number of features under the same feature name; (2) the dimension number of all the features under the same feature name and/or the ratio of the dimension number to the total dimension number of the features of the model; (3) the dimension number of the feature with the weight value of non-zero under the same feature name and/or the proportion of the dimension number to the total dimension number of the feature with the weight value of non-zero of the model.
18. The apparatus of claim 12, wherein the model is a logistic regression model, and the model interpretation unit displays the model prediction interpretation as features of prediction samples and their corresponding weights; or, the model is a decision tree model, and the model interpretation unit interprets the model prediction as a decision path of a prediction sample and its corresponding weight.
19. The apparatus of claim 12, wherein the data analysis request comprises: a hover operation or a click operation made for the at least one of the feature names displayed by the model interpretation content.
20. The apparatus of claim 12, further comprising a search unit for receiving a feature name search instruction and searching for a target feature name among the feature names displayed by the model interpreted content according to the feature name search instruction, the output unit further displaying the searched target feature name and a corresponding data distribution.
21. The apparatus of claim 12, wherein the output unit displays the data distribution by establishing a correlation of model interpretation content with the data distribution.
22. The apparatus of claim 12, wherein the output unit displays the data distribution near model interpretation content.
23. A computer readable storage medium having stored thereon a computer program for executing the method of analyzing data based on machine learning model interpretation of any of claims 1 to 11.
24. A computing device comprising a storage component and a processor, wherein the storage component has stored therein computer-executable instructions that, when executed by the processor, perform a method of analyzing data based on machine learning model interpretation as claimed in any one of claims 1 to 11.
CN201810683818.6A 2018-06-28 2018-06-28 Method and device for analyzing data based on machine learning model interpretation Active CN108960434B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810683818.6A CN108960434B (en) 2018-06-28 2018-06-28 Method and device for analyzing data based on machine learning model interpretation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810683818.6A CN108960434B (en) 2018-06-28 2018-06-28 Method and device for analyzing data based on machine learning model interpretation

Publications (2)

Publication Number Publication Date
CN108960434A CN108960434A (en) 2018-12-07
CN108960434B true CN108960434B (en) 2021-07-20

Family

ID=64487649

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810683818.6A Active CN108960434B (en) 2018-06-28 2018-06-28 Method and device for analyzing data based on machine learning model interpretation

Country Status (1)

Country Link
CN (1) CN108960434B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113302632B (en) * 2019-01-28 2024-06-14 三菱电机株式会社 Development assistance device, development assistance system, and development assistance method
CN110443346B (en) * 2019-08-12 2023-05-02 腾讯科技(深圳)有限公司 Model interpretation method and device based on importance of input features
CN110660485A (en) * 2019-08-20 2020-01-07 南京医渡云医学技术有限公司 Method and device for acquiring influence of clinical index
CN111144718A (en) * 2019-12-12 2020-05-12 支付宝(杭州)信息技术有限公司 Risk decision method, device, system and equipment based on private data protection
CN114548300B (en) * 2019-12-20 2024-05-28 支付宝(杭州)信息技术有限公司 Method and device for explaining service processing result of service processing model
CN111523677B (en) * 2020-04-17 2024-02-09 第四范式(北京)技术有限公司 Method and device for realizing interpretation of prediction result of machine learning model
CN112001442B (en) * 2020-08-24 2024-03-19 北京达佳互联信息技术有限公司 Feature detection method, device, computer equipment and storage medium
CN112101574B (en) * 2020-11-20 2021-03-02 成都数联铭品科技有限公司 Machine learning supervised model interpretation method, system and equipment
CN112766415B (en) * 2021-02-09 2023-01-24 第四范式(北京)技术有限公司 Method, device and system for explaining artificial intelligence model

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105760950A (en) * 2016-02-05 2016-07-13 北京物思创想科技有限公司 Method for providing or obtaining prediction result and device thereof and prediction system
CN105930934A (en) * 2016-04-27 2016-09-07 北京物思创想科技有限公司 Prediction model demonstration method and device and prediction model adjustment method and device
CN107644375A (en) * 2016-07-22 2018-01-30 花生米浙江数据信息服务股份有限公司 Small trade company's credit estimation method that a kind of expert model merges with machine learning model
CN108090032A (en) * 2018-01-03 2018-05-29 第四范式(北京)技术有限公司 The Visual Explanation method and device of Logic Regression Models

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8595153B2 (en) * 2010-06-09 2013-11-26 Microsoft Corporation Exploring data using multiple machine-learning models
US9886670B2 (en) * 2014-06-30 2018-02-06 Amazon Technologies, Inc. Feature processing recipes for machine learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105760950A (en) * 2016-02-05 2016-07-13 北京物思创想科技有限公司 Method for providing or obtaining prediction result and device thereof and prediction system
CN105930934A (en) * 2016-04-27 2016-09-07 北京物思创想科技有限公司 Prediction model demonstration method and device and prediction model adjustment method and device
CN107644375A (en) * 2016-07-22 2018-01-30 花生米浙江数据信息服务股份有限公司 Small trade company's credit estimation method that a kind of expert model merges with machine learning model
CN108090032A (en) * 2018-01-03 2018-05-29 第四范式(北京)技术有限公司 The Visual Explanation method and device of Logic Regression Models

Also Published As

Publication number Publication date
CN108960434A (en) 2018-12-07

Similar Documents

Publication Publication Date Title
CN108960434B (en) Method and device for analyzing data based on machine learning model interpretation
CN108090032B (en) Visual interpretation method and device of logistic regression model
US20140046892A1 (en) Method and system for visualizing information extracted from big data
CN108900546A (en) The method and apparatus of time series Network anomaly detection based on LSTM
WO2024067387A1 (en) User portrait generation method based on characteristic variable scoring, device, vehicle, and storage medium
Sivakumar et al. Predictive modeling of students performance through the enhanced decision tree
CN112347352A (en) Course recommendation method and device and storage medium
KR102163718B1 (en) AI Program for Determining Survey Respondents
JP2012243125A (en) Causal word pair extraction device, causal word pair extraction method and program for causal word pair extraction
CN109460474B (en) User preference trend mining method
Aji et al. An implementation of C4. 5 classification algorithm to analyze student’s performance
CN112598405B (en) Business project data management method and system based on big data
CN106844765B (en) Significant information detection method and device based on convolutional neural network
CN116186331B (en) Graph interpretation method and system
JP2012037787A (en) Plant operation proficiency evaluation apparatus and method
CN113657726A (en) Personnel risk analysis method based on random forest
JP2022068690A (en) Decision-making supporting device
TWI694344B (en) Apparatus and method for detecting impact factor for an operating environment
JPWO2020085374A1 (en) Proficiency index providing device, proficiency index providing method, and program
US11934639B2 (en) Adaptive interface providing apparatus, adaptive interface providing method, and program
US8738303B2 (en) Identifying outliers among chemical assays
Fang et al. Method of Data Reduction Effectiveness Evaluation Based on Users' Interesting Degrees
Medalsan et al. ANALYSIS OF THE K-NEAREST NIGBOAR METHOD TO DETERMINE THE ELIGIBILITY OF INTERNSHIP STUDENTS AT PRIMA INDONESIA UNIVERSITY
CN116523289A (en) Real-time wind control method and system based on intelligent threshold and rule engine
Humaidi et al. Implementation of Machine Learning for Text Classification Using the Naive Bayes Algorithm in Academic Information Systems at Sebelas Maret University Indonesia

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant