CN108960434B

CN108960434B - Method and device for analyzing data based on machine learning model interpretation

Info

Publication number: CN108960434B
Application number: CN201810683818.6A
Authority: CN
Inventors: 方荣; 李福龙; 杨慧斌; 詹镇江
Original assignee: 4Paradigm Beijing Technology Co Ltd
Current assignee: 4Paradigm Beijing Technology Co Ltd
Priority date: 2018-06-28
Filing date: 2018-06-28
Publication date: 2021-07-20
Anticipated expiration: 2038-06-28
Also published as: CN108960434A

Abstract

The present disclosure relates to a method and apparatus for analyzing data based on machine learning model interpretation. A method of analyzing data based on machine learning model interpretation includes: obtaining and displaying model interpretation content, wherein the model interpretation content comprises at least one of model structure interpretation, model feature importance and model prediction interpretation; receiving a data analysis request made by a user for at least one of the feature names displayed by the model interpretation content; respectively calculating the data distribution of all the features in the training sample under each feature name in the at least one feature name; and outputting the data distribution in a visual manner.

Description

Method and device for analyzing data based on machine learning model interpretation

Technical Field

The invention relates to the field of machine learning, in particular to a method and a device for analyzing data based on machine learning model interpretation.

Background

The conventional machine learning modeling process requires a significant amount of time for data analysis to decide how to perform feature processing. However, for practical problems, the amount of data to be analyzed is often very large. At this time, if data analysis is performed indiscriminately, it is a very cumbersome and time-consuming process. However, if the data analysis is performed differently, sufficient business experience depending on the user may be required.

In addition, if the effect of the model is known only by the model index, the model is equivalent to a black box, and the user does not understand the meaning of the model.

Therefore, in the prior art, there is a lack of a solution that can effectively analyze data and understand a model during machine learning.

Disclosure of Invention

In order to solve the above problems, the present invention proposes a method of analyzing data based on machine learning model interpretation.

According to the present invention, there is provided a method of analyzing data based on machine learning model interpretation, the method may comprise: obtaining and displaying model interpretation content, wherein the model interpretation content comprises at least one of model structure interpretation, model feature importance and model prediction interpretation; receiving a data analysis request made by a user for at least one of the feature names displayed by the model interpretation content; respectively calculating the data distribution of all the features in the training sample under each feature name in the at least one feature name; and outputting the data distribution in a visual manner.

According to one embodiment of the invention, for discrete features, the data distribution may be a stacked histogram, a grouped histogram, and/or a scatter plot grouped by labels with respect to the occurrence of the feature under each group.

According to an embodiment of the present invention, for the continuous features, the data distribution may be an average value map, a point line map, and/or a kernel density estimation map obtained by counting the features grouped by the markers.

According to an embodiment of the present invention, the model may be a logistic regression model, and the model structure interpretation may be displayed as a feature name, distribution information of weight values of respective features under the same feature name, and/or dimension information of respective features under the same feature name.

According to one embodiment of the present invention, distribution information of non-zero weight values and/or all weight values of respective features under the same feature name may be represented by a box plot, wherein the box plot includes at least one of the following items: a minimum value, a first quartile, a median, a third quartile, and a maximum value.

According to an embodiment of the invention, the dimension information may indicate at least one of: (1) the ratio of the number of features with non-zero weight values under the same feature name to the total number of features under the same feature name; (2) the dimension number of all the features under the same feature name and/or the ratio of the dimension number to the total dimension number of the features of the model; (3) the dimension number of the feature with the weight value of non-zero under the same feature name and/or the proportion of the dimension number to the total dimension number of the feature with the weight value of non-zero of the model.

According to one embodiment of the invention, the model may be a logistic regression model, and the model prediction interpretation may be displayed as the features of the prediction samples and their corresponding weights; alternatively, the model may be a decision tree model, and the model prediction interpretation may be displayed as a decision path of the prediction sample and its corresponding weight.

According to an embodiment of the invention, the data analysis request may include: a hover operation or a click operation made for the at least one of the feature names displayed by the model interpretation content.

According to an embodiment of the invention, the method may further comprise: receiving a characteristic name searching instruction; searching a target characteristic name in the characteristic names displayed by the model interpretation content according to the characteristic name searching instruction; and displaying the searched target characteristic name and the corresponding data distribution.

According to an embodiment of the present invention, the step of visually outputting the data distribution may include: displaying the data distribution by establishing a correlation of model interpretation content with the data distribution.

According to an embodiment of the present invention, the step of visually outputting the data distribution may include: and displaying the data distribution near the model interpretation content.

According to the present invention, there is provided an apparatus for analyzing data based on machine learning model interpretation, the apparatus may include: the model interpretation unit is used for acquiring and displaying model interpretation contents, and the model interpretation contents comprise at least one of model structure interpretation, model feature importance and model prediction interpretation; a receiving unit configured to receive a data analysis request made by a user for at least one of the feature names displayed by the model interpretation content; the calculation unit is used for respectively calculating the data distribution of all the features in the training sample under each feature name in the at least one feature name; and an output unit for outputting the data distribution in a visualized manner.

According to an embodiment of the present invention, the model may be a logistic regression model, and the model interpretation unit may interpret the model structure as a feature name, distribution information of weight values of respective features under the same feature name, and/or dimension information of respective features under the same feature name.

According to an embodiment of the present invention, the model interpretation unit may represent distribution information of non-zero weight values and/or all weight values of respective features under the same feature name by box plots, respectively, wherein the box plots include at least one of: a minimum value, a first quartile, a median, a third quartile, and a maximum value.

According to an embodiment of the present invention, the model interpretation unit may display the dimension information as at least one of: (1) the ratio of the number of features with non-zero weight values under the same feature name to the total number of features under the same feature name; (2) the dimension number of all the features under the same feature name and/or the ratio of the dimension number to the total dimension number of the features of the model; (3) the dimension number of the feature with the weight value of non-zero under the same feature name and/or the proportion of the dimension number to the total dimension number of the feature with the weight value of non-zero of the model.

According to an embodiment of the present invention, the model may be a logistic regression model, and the model interpretation unit may display the model prediction interpretation as features of prediction samples and their corresponding weights; alternatively, the model may be a decision tree model, and the model interpretation unit may interpret the model prediction as a decision path of a prediction sample and its corresponding weight.

According to an embodiment of the present invention, the apparatus may further include a search unit configured to receive a feature name search instruction and search for a target feature name among the feature names displayed by the model interpreted content according to the feature name search instruction, and the output unit may further display the searched target feature name and a corresponding data distribution.

According to an embodiment of the present invention, the output unit may display the data distribution by establishing a correlation of model interpretation contents with the data distribution.

According to an embodiment of the present invention, the output unit may display the data distribution near model interpretation content.

According to the present invention, there is provided a computer-readable storage medium having stored thereon a computer program for executing the method of analyzing data based on machine learning model interpretation described in any of the preceding embodiments.

According to the present invention, there is provided a computing device comprising a storage component and a processor, wherein the storage component has stored therein computer-executable instructions that, when executed by the processor, perform the method of analyzing data based on machine learning model interpretation as described in any of the preceding embodiments.

By adopting the invention, the user can select data from a large amount of training data in a targeted manner for analysis, and can effectively combine data analysis with model understanding.

Drawings

The above and/or other objects and advantages of the present invention will become more apparent by describing embodiments below with reference to the accompanying drawings, in which:

FIG. 1 is a flow diagram of a method of analyzing data based on machine learning model interpretation, according to an embodiment of the invention;

FIG. 2 illustrates an example of analyzing data based on model feature importance, according to an embodiment of the invention;

3A-3C illustrate examples of analyzing data based on model structure interpretation, according to embodiments of the invention;

FIG. 4 shows an example of model prediction result interpretation;

5-8 illustrate examples of data distribution in a training sample for all features under a feature name;

FIG. 9 shows a block diagram of an apparatus for analyzing data based on machine learning model interpretation, according to an embodiment of the invention;

FIG. 10 shows a schematic block diagram of a computing device according to an embodiment of the invention.

Detailed Description

As required, detailed embodiments of the present invention are disclosed herein; however, it is to be understood that the disclosed embodiments are merely exemplary of the invention that may be embodied in various and alternative forms. The figures are not necessarily to scale; some features may be exaggerated or minimized to show details of particular components. Therefore, specific structural and functional details disclosed herein are not to be interpreted as limiting, but merely as a representative basis for teaching one skilled in the art to variously employ the present invention.

A method and apparatus for analyzing data based on machine learning model interpretation according to an embodiment of the present invention will be described below with reference to the accompanying drawings.

First, a method of analyzing data based on machine learning model interpretation according to an embodiment of the present invention will be described with reference to the accompanying drawings

FIG. 1 is a flow diagram of a method of analyzing data based on machine learning model interpretation, according to an embodiment of the invention.

As shown in fig. 1, in step S101, model explanatory content may be acquired and displayed. The model interpretation content may include at least one of a model structure interpretation, a model feature importance, and a model prediction interpretation.

Regarding the model structure interpretation, the logistic regression model is taken as an example for explanation, and the step of obtaining and displaying the model structure interpretation of the logistic regression model comprises the following steps: obtaining model parameters of a logistic regression model, wherein the model parameters comprise all features in the logistic regression model and weight values of all the features; aggregating all the obtained characteristics in the model parameters according to the characteristic names; performing feature statistics on each feature name to obtain feature statistical information of each feature name, wherein the feature statistical information indicates distribution information of weight values of each feature under the same feature name and/or dimension information of each feature under the same feature name; and displaying the feature name and the corresponding feature statistical information through a graphical interface. The model structure interpretation of the logistic regression model may be displayed as a feature name, distribution information of weight values of respective features under the same feature name, and/or dimension information of respective features under the same feature name. A model structure explanation of the logistic regression model will be described in detail below with reference to fig. 3A and 3B.

It should be noted that although the model structure interpretation is described using a logistic regression model as an example, the machine learning model may also be other models (such as a naive bayes model, a support vector machine model, a decision tree model, and the like).

Continuing with the example of the logistic regression model, the logistic regression model can be regarded as a series of features and their corresponding weights, and accordingly, the model parameters of the logistic regression model can include the features of the logistic regression model and the corresponding weight values of each feature in the model. As an example, in the training process of the machine learning model, the feature may be generally represented as a hashed value, and accordingly, the feature textual data in the system may further include a mapping relationship between an original value of the feature before being hashed and a hashed value of the feature after being hashed. Therefore, it is possible to obtain an original value of the feature from the feature original text data and further obtain a weight value of each feature.

The feature name may correspond to the result of feature engineering of a field or fields of the original data sheet, which is intended to describe the behavior of an aspect of the training sample. The features of the machine learning samples are classified into discrete features and continuous features according to a feature processing method. For discrete features, each feature name corresponds to a set of features, each feature indicating a value of the feature name and corresponding to a weight value. For continuous features, there is only one feature under the feature name and its corresponding weight, and the value of the feature is continuously changed.

Because the feature scale of the logistic regression model is huge (for example, feature dimensions of tens of millions are often achieved), in order to more intuitively show different influences of different features on the model prediction result, all features in the model parameters can be aggregated according to the feature names to which the features belong, so as to display the difference degree between the weight values of the features under the same feature name, and the classification function of the corresponding feature names is embodied through the difference degree. For example, in one embodiment of the present invention, the model parameters of the logistic regression model may include the following features: professions are blue collar workers, professions are technicians, professions are service personnel, professions are retired personnel, professions are managers, and these characteristics can be aggregated under the corresponding characteristic names "professions". As another example, the model parameters of the logistic regression model may include the following features: the age is 20 to 30 years, the age is 30 to 40 years, the age is 40 to 50 years, the age is 50 to 60 years, the age is 60 or more, and the like, and these characteristics can be aggregated under the corresponding characteristic name of "age". The features included under each feature name may be counted to obtain feature statistics for each feature name of the logical model. The distribution information of the weight values of the features under the same feature name may indicate how the weight values (all weight values or non-zero weight values) of the features under the same feature name are distributed, such as which ranges the weight values are mainly distributed in, whether the weight values of the features are distributed more or less intensively, where the key value-taking points are located, whether the distribution is uniform, and so on. The dimension information of the respective features under the same feature name may indicate the dimension information of all the features under each feature name, or the dimension information of a certain class of features (for example, valid features with a weight value different from zero) under each feature name. The dimension information can enable a user to know the effective feature dimension under the feature name, so that the user is helped to know the online performance of the model. The individual feature names and their feature statistics may be displayed via a graphical interface. Here, as an example, each feature name and its associated feature statistical information may be displayed via a chart, a graph, a character, or the like.

The dimension information of the respective features under the same feature name may indicate at least one of: (1) the ratio of the number of features with non-zero weight values under the same feature name to the total number of features under the same feature name; (2) the dimension number of all the features under the same feature name and/or the ratio of the dimension number to the total dimension number of the features of the model; (3) the dimension number of the feature with the weight value of non-zero under the same feature name and/or the proportion of the dimension number to the total dimension number of the feature with the weight value of non-zero of the model. The feature with a non-zero weight value is also referred to as a valid feature, and accordingly, the ratio of the number of features with a non-zero weight value in the same feature name to the total number of features in the same feature name is also referred to as a valid feature ratio. For example, if 10 features are included in a feature name, wherein 3 features correspond to a weight value of 0, and the remaining 7 features correspond to weight values other than 0, the percentage of valid features of the feature name is 70%. In addition, if the total dimension number of all the features under all the feature names of one model is 100, and the dimension number of all the features under one feature name a is 20, the dimension number of all the features under the feature name a is 20, and the ratio of the dimension number to the total dimension number of the features of the model is 20%. In another example, if the total dimension number of all features under all feature names of one model is 100, wherein the total dimension number of features with non-zero weight values is 80 and the dimension number of features with non-zero weight values under one feature name B is 10, the dimension number of features with non-zero weight values under the feature name B is 10, and the proportion of the dimension number to the total dimension number of features with non-zero weight values of the model is 10/80 × 100% ═ 12.5%.

Distribution information of non-zero weight values and/or all weight values of respective features under the same feature name may be represented by a box plot, respectively, wherein the box plot includes at least one of: a minimum value, a first quartile, a median, a third quartile, and a maximum value.

The specific form of importance of a model feature is the feature and its corresponding importance value. For Logistic Regression (LR) models, the features are typically discrete, so the feature importance of a logistic regression model refers to the importance of the set of features. However, for a Gradient Boosting Decision Tree (GBDT) model, its features are generally continuous, so the feature importance of the GBDT model refers to the importance of the features (the feature set and the features of the continuous features can be understood as the same concept). The method for calculating the importance of the model features is various, the logistic regression model and the GBDT model can calculate the importance of the model features only based on the models, the logistic regression model can use the variance of the weights as the measure of the importance of the model features, and the GBDT model can add all gains (gain) obtained by splitting the features at each time as the measure of the importance of the model features. In addition, logistic regression models and GBDT models can also compute model feature importance based on both the model and the sample. For the logistic regression model, the importance of a feature is calculated, the feature in all samples can be scattered (shuffle), then the area under the curve (AUC) calculation is performed based on the model, and the difference between the calculated AUC and the AUC of the original model is used as the importance measure of the feature. For the GBDT model, an assessment of feature importance may also be made based on the out-of-bag data error rate. In addition, the model feature importance may also be calculated based on the samples only. For example, an LR model is trained for each feature in the sample table, the AUC value of each feature is obtained by using the training data, and this AUC is used as the value of the importance of the feature.

With respect to model prediction interpretation, if the model is a logistic regression model, the model prediction interpretation is displayed as the features of the prediction samples and their corresponding weights. If the model is a decision tree model, the model prediction interpretation is displayed as a decision path of the prediction sample and its corresponding weight.

In step S102, a data analysis request made by a user for at least one of the feature names displayed by the model interpretation content may be received. The data analysis request may include a hover operation or a click operation made for the at least one of the feature names displayed by the model interpretation content. In one example, the data analysis request may be a hover operation made for a model structure interpretation of the logistic regression model for at least one of the displayed feature names. In another example, the data analysis request may be a click operation made on at least one of the displayed feature names for model predictive interpretation of the decision tree model.

In step S103, the data distribution of all the features in the training sample under each feature name in the at least one feature name can be calculated respectively. For discrete features, the data distribution is a stacked histogram, a grouped histogram, and/or a scatter plot grouped by labels (labels) for the occurrence of the features under each grouping. For continuous features, the data distribution is an average value graph, a point line graph and/or a kernel density estimation graph obtained by counting the features grouped according to marks. The marker is described by taking a two-term distribution problem as an example, and whether a person is diabetic or not can be determined by analyzing indicators (i.e., characteristic names) such as age, sex, body mass index, mean blood pressure, disease index, and the like, and label ═ 0 indicates that the person is not diseased and label ═ 1 indicates that the person is diseased. For the two groups of non-diseased and diseased, the data distribution of each characteristic (e.g., specific age, sex, body mass index, mean blood pressure, disease index, etc.) of the training data under the corresponding group can be counted respectively. Various forms of data distribution will be described below with reference to fig. 5 to 8.

In step S104, the data distribution may be output in a visualized manner. The step of visually outputting the data distribution may include: displaying the data distribution by establishing a correlation of model interpretation content with the data distribution. Additionally, the step of visually outputting the data distribution may include: and displaying the data distribution near the model interpretation content. In one example, the data distribution may be displayed in the vicinity of the model structure interpretation. In another example, the data distribution may be displayed near the importance of the model feature.

In addition, the method for analyzing data based on machine learning model interpretation according to the present invention may further include: receiving a characteristic name searching instruction; searching a target characteristic name in the characteristic names displayed by the model interpretation content according to the characteristic name searching instruction; and displaying the searched target feature names and the corresponding data distribution.

FIG. 2 illustrates an example of analyzing data based on model feature importance, according to an embodiment of the invention. Specifically, an example of calculating and displaying a kernel density estimation curve for a feature name based on the importance of the model feature is shown in fig. 2.

As shown in fig. 2, the kernel density estimation curve may be displayed below the model feature importance histogram. In the model feature importance histogram in fig. 2, the left column is a feature name (e.g., f _ pRank, f _ ussub, f _ ussex, etc.), wherein the labeled number is the feature importance coefficient of the corresponding feature name, e.g., the feature importance coefficient of the feature name f _ pRank is 0.5299. When a hover operation or a click operation is made for the feature name f _ pRank, a kernel density estimation graph in the lower half of fig. 2 may be displayed in which the ordinate is the kernel density value and the abscissa is the respective feature value under the feature name f _ pRank, and the kernel density function thereof is shown for both target values of 1 and 0.

The model feature importance provides a basis and guidance for a user to select which feature name to analyze, and also helps the user intuitively understand the model feature importance by associatively displaying data distribution according to the model feature importance. The calculation of the data distribution can be carried out in real time after the user selects one feature name, or the data distribution of all the features in the training sample can be calculated at one time. The importance of the model features requires some secondary processing calculations on the model. For example, for a tree model, one way to calculate the importance of a model feature is to calculate the reduction in weighted purities of the feature as all non-leaf nodes split, with more reduction indicating that the feature is more important. For a linear model, a calculation method of the importance of model features is to generate some disturbances on the features and calculate the average reduction of the classification accuracy after the disturbances and the classification accuracy before the disturbances.

The range of the feature importance coefficient depends on the calculation method of the feature importance. In the logistic regression model, the range of the characteristic importance coefficient obtained only by depending on the AUC of the training sample is 0-1. The logistic regression model utilizes a calculation method of auc change, and the characteristic importance coefficient range of the logistic regression model is-1. The GBDT model utilizes a split gain-summed computational method whose characteristic importance coefficients range between minimum partition gain and positive infinity. The model feature importance is not an absolute concept, and therefore the specific value of the feature importance coefficient is not important, the difference between feature importance coefficients between features being of interest.

It should be noted that the method of calculating and displaying the kernel density estimation curve for the feature name based on the importance of the model feature shown in fig. 2 is only an example, and the present disclosure is not limited thereto. In another embodiment, when a hover operation or a click operation is made for a feature name of a discrete feature, a stacked histogram, a grouped histogram, and/or a scatter diagram grouped by a marker with respect to the occurrence of the feature under each group may be visually output. In another embodiment, when a hover operation or a click operation is made for feature names of continuous features, an average value map and/or a point line map obtained by counting the features grouped by markers may be visually output. That is, the form of the data distribution output in a visualized manner is not limited to the kernel density estimation curve, and the kind of the data distribution output at the same time is not limited to one.

Fig. 3A to 3C illustrate examples of analyzing data based on model structure interpretation according to an embodiment of the present invention. In this example, a grouping histogram after grouping by the marker with respect to the occurrence of the feature under each grouping is calculated and visually output based on the model structure interpretation of the logistic regression model.

As shown in fig. 3A and 3B, the model structure interpretation of the logistic regression model is displayed as a feature name, distribution information of weight values of respective features under the same feature name, and dimension information of respective features under the same feature name. Respectively representing distribution information of non-zero weight values and/or all weight values of each feature under the same feature name through a box plot, wherein the box plot comprises at least one of the following items: a minimum value, a first quartile, a median, a third quartile, and a maximum value. The dimension information indicates at least one of: (1) the ratio of the number of features with non-zero weight values under the same feature name to the total number of features under the same feature name; (2) the dimension number of all the features under the same feature name and/or the ratio of the dimension number to the total dimension number of the features of the model; (3) the dimension number of the feature with the weight value of non-zero under the same feature name and/or the proportion of the dimension number to the total dimension number of the feature with the weight value of non-zero of the model.

Referring to fig. 3A to 3C, the feature name t _ pRank is described as an example. From fig. 3A and 3B, it is understood that the feature name number is 40, the total feature dimension is 119369, the feature dimension of the feature name t _ pRank is 71, the ratio of the valid features of the feature name t _ pRank (i.e., the ratio of the number of features with non-zero weight values under the feature name t _ pRank to the total number of features under the feature name t _ pRank) is 100%, and the feature dimension/total feature dimension (i.e., 71/110369) of the feature name t _ pRank is 0.06%.

In one embodiment of the present invention, the box plot process in fig. 3A and 3B may include: all weighted values or non-zero weighted values of the features under the same feature name are arranged in an ascending order or a descending order according to the weighted values; and drawing a box line graph of the feature name by taking the sorted first weighted value, last weighted value and/or weighted value of a preset quantile as key points. Wherein, the preset quantile can be appointed according to the requirement. For example, in one embodiment of the present invention, the preset quantiles may include a first quartile, a median, and a third quartile, and when the box-line graph is drawn, the non-zero weights under each feature name may be arranged from small to large, and the box-line graph may be drawn based on five points, i.e., the minimum value of the statistical weights, the 1/4 quantiles, the median, the 3/4 quantiles, and the maximum value. Of course, in other embodiments of the present invention, the above-mentioned 5 quantiles may be replaced by other quantiles, and the number of the preset quantiles may be further increased, for example, 10% and 90% quantiles may be increased. The box line graphs show the distribution of the weight values of the features under the same feature name, so that the distribution condition can be conveniently observed, optionally, a reference line with a weight value of 0 can be marked between the box line graphs of the feature names, and whether the model prediction is biased to positive influence or negative influence is determined by observing the offset condition of each box line graph relative to the reference line 0.

In fig. 3A and 3B, the feature names may be sorted based on the weight variance (box line graph in the graph), and when the user selects the feature name t _ pRank by a hover or click operation, data analysis is performed on the relevant features of the feature group t _ pRank in the training data. In this example, the user selects a feature set f _ pRank whose feature engineering method is f _ pRank ═ discrete (pRank), and whose feature engineering is to discretize pRank. As shown in fig. 3C, the data distribution may be displayed in the form of a histogram, specifically, the features may be grouped according to the value of label (label), and then the number of times each feature appears in each group may be counted. In addition, in an example, the data distribution may be displayed by establishing a correlation of the model structure interpretation and the data distribution as shown in FIG. 3A or FIG. 3B, and FIG. 3C may be displayed in proximity to the model structure interpretation as shown in FIG. 3A or FIG. 3B.

In addition, in order to facilitate the user to quickly acquire the data distribution concerned, in one embodiment of the present invention, a search function is also provided. Optionally, the search function may include a search for characteristic names. The specific steps of the search function may include: receiving a characteristic name searching instruction; searching a target characteristic name in the characteristic names displayed by the model interpretation content according to the characteristic name searching instruction; and displaying the searched target characteristic name and the corresponding data distribution. Specifically referring to fig. 3A and 3B, a user may input a target feature name to be searched in a search box, the system may search for the target feature name among all feature names of the logistic regression model, and after the target feature name is searched, the target feature name and corresponding data distribution may be displayed. It should be noted that the search strategy of the above feature name search may adopt fuzzy search or precise search.

Fig. 4 shows an example of model prediction result interpretation. The interpretation of the prediction results can also be described based on the logistic regression model described above. The interpretation of the prediction results of the logistic regression model may be expressed in terms of features and weights. In fig. 4, non-zero features and their corresponding weights are listed, and the weights may be ranked from high to low, finding features with high weights. A feature may be selected and the data distribution for that feature calculated and displayed using the methods described above. It should be noted that although the prediction interpretation is described using a logistic regression model as an example, the present disclosure is not limited to the model prediction interpretation shown in fig. 4, and one skilled in the art may expect to calculate and display data distributions based on other forms of model prediction interpretation for other models.

Fig. 5 to 8 show examples of data distribution of all features in the training sample under the feature name. For discrete features, the data distribution is a stacked histogram, a grouped histogram, and/or a scatter plot of the occurrences of the features under each grouping after grouping by the markers. For continuous features, the data distribution is an average value graph, a point line graph and/or a kernel density estimation graph obtained by counting the features grouped according to marks. FIG. 5 shows a stacked histogram for discrete features, where the number of occurrences of each feature is counted after being grouped by labels. Fig. 6 shows an average value map obtained by counting the features grouped by the mark. Fig. 7 shows a dot line graph obtained by counting the features grouped by the mark. In addition, the dotted dots shown in fig. 8 represent the relationship between the features and the marks.

Fig. 9 illustrates a block diagram of an apparatus for analyzing data based on machine learning model interpretation according to an embodiment of the present invention.

As shown in fig. 9, an apparatus for analyzing data based on machine learning model interpretation according to an embodiment of the present invention may include: a model interpretation unit 201, configured to obtain and display model interpretation content, where the model interpretation content includes at least one of model structure interpretation, model feature importance, and model prediction interpretation; a receiving unit 202, configured to receive a data analysis request made by a user for at least one of the feature names displayed by the model interpretation content; a calculating unit 203, configured to calculate data distributions of all features in the training sample under each feature name in the at least one feature name respectively; and an output unit 204 for outputting the data distribution in a visualized manner.

Wherein, for discrete features, the data distribution may be a stacked histogram, a grouped histogram, and/or a scatter plot grouped by labels with respect to the occurrence of the features under each grouping; for continuous features, the data distribution may be an average map, a point-line map, and/or a kernel density estimation map obtained by counting features grouped by labels.

For the logistic regression model, the model interpretation unit 201 may interpret the model structure as a feature name, distribution information of weight values of respective features under the same feature name, and/or dimension information of respective features under the same feature name. The model interpretation unit 201 may represent distribution information of non-zero weight values and/or all weight values of respective features under the same feature name, respectively, by box plots including at least one of: a minimum value, a first quartile, a median, a third quartile, and a maximum value. The model interpretation unit 201 may display the dimension information as at least one of: (1) the ratio of the number of features with non-zero weight values under the same feature name to the total number of features under the same feature name; (2) the dimension number of all the features under the same feature name and/or the ratio of the dimension number to the total dimension number of the features of the model; (3) the dimension number of the feature with the weight value of non-zero under the same feature name and/or the proportion of the dimension number to the total dimension number of the feature with the weight value of non-zero of the model. The model interpretation unit 201 may display the model prediction interpretation as the features of the prediction samples and their corresponding weights. For a decision tree model, model interpretation unit 201 may display the model prediction interpretation as a decision path of the prediction sample and its corresponding weight.

The data analysis request includes: a hover operation or a click operation made for the at least one of the feature names displayed by the model interpretation content.

The apparatus may further include a search unit 205, the search unit 205 being configured to receive a feature name search instruction and search for a target feature name among the feature names displayed by the model interpreted content according to the feature name search instruction, and the output unit 204 may display the searched target feature name and a corresponding data distribution. The output unit 204 may display the data distribution by establishing an association of model interpretation content with the data distribution. In one example, the output unit 204 may display the data distribution near the model interpretation content.

The specific operations shown above in conjunction with fig. 1 to 8 may be respectively performed by corresponding units in the apparatus shown in fig. 9, and details of the specific operations will not be described herein.

As shown in fig. 10, the computing apparatus 300 provided according to the embodiment of the present invention may include a processor 301 and a storage unit 302, and the storage unit 302 stores therein computer-executable instructions, which when executed by the processor 301, perform the method for analyzing data based on machine learning model interpretation according to any of the embodiments described above.

In addition, according to an embodiment of the present invention, there is also provided a computer-readable storage medium having stored thereon a computer program for executing the method for analyzing data based on machine learning model interpretation described in any of the foregoing embodiments.

The processes, methods or algorithms disclosed herein may be delivered to or implemented by a processing device, controller or computer, which may include any existing programmable or dedicated electronic control unit. Similarly, the processes, methods or algorithms may be stored as data and instructions executable by a controller or computer in a variety of forms including, but not limited to, information permanently stored on non-writable storage media such as ROM devices and information variably stored on writable storage media such as floppy diskettes, magnetic tape, CDs, RAM devices, and other magnetic and optical media. The processes, methods, or algorithms may also be implemented in software executable objects. Alternatively, the processes, methods or algorithms may be implemented in whole or in part using suitable hardware components (such as ASICs, FPGAs, state machines, controllers or other hardware components or devices), or a combination of hardware, software and firmware components.

While exemplary embodiments are described above, it is not intended that these embodiments describe all possible forms of the invention. Rather, the words used in the specification are words of description rather than limitation, and it is understood that various changes may be made without departing from the spirit and scope of the invention. Furthermore, features of various implementing embodiments may be combined to form further embodiments of the invention.

Claims

1. A method of analyzing data based on machine learning model interpretation, comprising:

obtaining and displaying model interpretation content, wherein the model interpretation content comprises at least one of model structure interpretation, model feature importance and model prediction interpretation;

receiving a data analysis request made by a user for at least one of the feature names displayed by the model interpretation content;

respectively calculating the data distribution of all the features in the training sample under each feature name in the at least one feature name; and

outputting the data distribution in a visual manner.

2. The method of claim 1, wherein for discrete features, the data distribution is a stacked histogram, a grouped histogram, and/or a scatter plot grouped by labels about feature occurrences under each grouping.

3. The method of claim 1, wherein the data distribution is an average map, a dot line map, and/or a kernel density estimation map obtained by counting the features grouped by markers for consecutive features.

4. The method according to claim 1, wherein the model is a logistic regression model, and the model structure interpretation is displayed as a feature name, distribution information of weight values of respective features under the same feature name, and/or dimension information of respective features under the same feature name.

5. The method according to claim 4, wherein distribution information of non-zero weight values and/or all weight values of respective features under the same feature name is represented by a box plot, respectively, wherein the box plot includes at least one of: a minimum value, a first quartile, a median, a third quartile, and a maximum value.

6. The method of claim 4, wherein the dimension information indicates at least one of: (1) the ratio of the number of features with non-zero weight values under the same feature name to the total number of features under the same feature name; (2) the dimension number of all the features under the same feature name and/or the ratio of the dimension number to the total dimension number of the features of the model; (3) the dimension number of the feature with the weight value of non-zero under the same feature name and/or the proportion of the dimension number to the total dimension number of the feature with the weight value of non-zero of the model.

7. The method of claim 1, wherein the model is a logistic regression model and the model prediction interpretation is displayed as features of the prediction samples and their corresponding weights; alternatively, the model is a decision tree model and the model prediction interpretation is displayed as a decision path of the prediction sample and its corresponding weight.

8. The method of claim 1, wherein the data analysis request comprises: a hover operation or a click operation made for the at least one of the feature names displayed by the model interpretation content.

9. The method of claim 1, further comprising:

receiving a characteristic name searching instruction;

searching a target characteristic name in the characteristic names displayed by the model interpretation content according to the characteristic name searching instruction;

and displaying the searched target characteristic name and the corresponding data distribution.

10. The method of claim 1, wherein visually outputting the data distribution comprises: displaying the data distribution by establishing a correlation of model interpretation content with the data distribution.

11. The method of claim 1, wherein visually outputting the data distribution comprises: and displaying the data distribution near the model interpretation content.

12. An apparatus for analyzing data based on machine learning model interpretation, comprising:

the model interpretation unit is used for acquiring and displaying model interpretation contents, and the model interpretation contents comprise at least one of model structure interpretation, model feature importance and model prediction interpretation;

a receiving unit configured to receive a data analysis request made by a user for at least one of the feature names displayed by the model interpretation content;

the calculation unit is used for respectively calculating the data distribution of all the features in the training sample under each feature name in the at least one feature name; and

an output unit for outputting the data distribution in a visualized manner.

13. The apparatus of claim 12, wherein for discrete features, the data distribution is a stacked histogram, a grouped histogram, and/or a scatter plot grouped by labels about feature occurrences under each grouping.

14. The apparatus of claim 12, wherein the data distribution is an average map, a dot line map, and/or a kernel density estimation map obtained by counting the features grouped by markers for consecutive features.

15. The apparatus according to claim 12, wherein the model is a logistic regression model, and the model interpretation unit interprets the model structure as a feature name, distribution information of weight values of respective features under the same feature name, and/or dimension information of respective features under the same feature name.

16. The apparatus of claim 15, wherein the model interpretation unit represents distribution information of non-zero weight values and/or all weight values of the respective features under the same feature name by box plots, respectively, wherein the box plots include at least one of: a minimum value, a first quartile, a median, a third quartile, and a maximum value.

17. The apparatus of claim 15, wherein the model interpretation unit displays the dimensional information as at least one of: (1) the ratio of the number of features with non-zero weight values under the same feature name to the total number of features under the same feature name; (2) the dimension number of all the features under the same feature name and/or the ratio of the dimension number to the total dimension number of the features of the model; (3) the dimension number of the feature with the weight value of non-zero under the same feature name and/or the proportion of the dimension number to the total dimension number of the feature with the weight value of non-zero of the model.

18. The apparatus of claim 12, wherein the model is a logistic regression model, and the model interpretation unit displays the model prediction interpretation as features of prediction samples and their corresponding weights; or, the model is a decision tree model, and the model interpretation unit interprets the model prediction as a decision path of a prediction sample and its corresponding weight.

19. The apparatus of claim 12, wherein the data analysis request comprises: a hover operation or a click operation made for the at least one of the feature names displayed by the model interpretation content.

20. The apparatus of claim 12, further comprising a search unit for receiving a feature name search instruction and searching for a target feature name among the feature names displayed by the model interpreted content according to the feature name search instruction, the output unit further displaying the searched target feature name and a corresponding data distribution.

21. The apparatus of claim 12, wherein the output unit displays the data distribution by establishing a correlation of model interpretation content with the data distribution.

22. The apparatus of claim 12, wherein the output unit displays the data distribution near model interpretation content.

23. A computer readable storage medium having stored thereon a computer program for executing the method of analyzing data based on machine learning model interpretation of any of claims 1 to 11.

24. A computing device comprising a storage component and a processor, wherein the storage component has stored therein computer-executable instructions that, when executed by the processor, perform a method of analyzing data based on machine learning model interpretation as claimed in any one of claims 1 to 11.