CN113159449A

CN113159449A - Structured data-based prediction method

Info

Publication number: CN113159449A
Application number: CN202110521123.XA
Authority: CN
Inventors: 蔡少峰; 郑凯平; 陈刚; 张美慧
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2021-05-13
Filing date: 2021-05-13
Publication date: 2021-07-23

Abstract

The invention relates to a prediction method based on structured data, belonging to the technical field of artificial intelligence learning prediction and comprising the steps of obtaining x ═ x of the structured data tuple₁，x₂，...x_j，...x_m>; will attribute value x_jConversion to an embedded vector representation e_j(ii) a Modeling feature interactions of the x based on the embedding vector using a plurality of exponential neurons; aggregating all of the feature interactions to construct a feature vector for the x; and performing classification prediction based on the feature vector. The invention overcomes the limitation that the input of the logarithmic neuron must be positive by modeling the cross characteristic of the exponential neuron, improves the flexibility and the applicable scene of the neuron, and improves the cross characteristicCharacterizing the effectiveness of modeling; the multi-head gating attention mechanism can dynamically and selectively model cross features of any order according to input data, improves accuracy and efficiency of feature modeling, and further improves accuracy and efficiency of target prediction; the dynamic capture of interactive items of input samples through a gating mechanism provides model decision interpretability and new insight.

Description

Structured data-based prediction method

Technical Field

The invention relates to prediction, in particular to a prediction method based on structured data, and belongs to the technical field of artificial intelligent learning prediction.

Background

To date, most enterprises rely on structured data for data storage and predictive analysis. Relational database management systems (RDBMS) have become the mainstream database systems employed in the industry, relational databases have become the standard for storing and querying structured data in practice, which is critical to the operation of most businesses. Structured data often contains a large amount of information that can be used to make data-driven decisions or to identify risks and opportunities. Extracting insights from the data for decision making requires advanced analysis, especially deep learning, which is much more complex than statistical aggregation.

Formally, structured data refers to the type of data that can be represented in a table. It can be seen as a logical table consisting of n rows (tuples/samples) and m columns (attributes/features), which is extracted from the relational database by core relational operations such as selection, projection and join. Predictive modeling is the learning of the functional dependence (predictive function) of the dependent property y on the decision property x, i.e., f: x → y. Where x is commonly referred to as a feature vector and y is the prediction target. The main challenge in predicting for structured data is in fact how to model dependencies and correlations between these properties by crossing features, so-called feature interactions. These cross features create new features by capturing the interactions of the original input features. In particular, a cross feature may be defined as

I.e. the product of the input features and their corresponding respective interaction weights. Weight w_iRepresenting the contribution of the ith feature to the cross feature; in the feature interaction, w_i0 corresponds to a feature x_iFailure, the interaction order of the cross feature isMeans its non-zero interaction weight w_iThe number of (2). This cross-feature for relational modeling is the core of structured data learning, which enables the learning model to represent more complex functions, not just a simple linear aggregation for predictive analysis of input features.

The existing methods for performing relational modeling on data and for predicting targets are mainly classified into 2 types: implicit modeling and explicit modeling. Typical implicit modeling methods are Deep Neural Networks (DNNs), such as CNNs, LSTM, etc. DNNs are only suitable for some specific data types, for example, CNNs for image applications and LSTM for sequence data applications. However, applying DNNs to structured data in relational tables may not produce meaningful results. In particular, there are inherent dependencies and dependencies between attribute values of structured data, and the interaction relationships between such properties are essential for predictive analysis. Although in theory, DNN can approximate any objective function as long as there is sufficient data and capacity, conventional DNN network layers are additive in capturing interactions, and therefore, to model such multiplied interactions requires excessively large and increasingly difficult to understand models, often built up from multiple layers with nonlinear activation functions between the layers. Previous studies also suggest that implicitly modeling such cross-features with DNNs may require a large number of hidden units, which greatly increases computational cost and also makes DNNs more difficult to interpret; as described in the document Alexandr Andoni, Rina panogrhy, Gregory valid, and Li zhang.2014, Learning polymers with Neural networks in Proceedings of the 31th International Conference on Machine Learning, icml.

In relational analysis, a preferred alternative to DNNs is to explicitly model feature interactions to achieve better performance and interpretability in feature attribution. However, the number of possible feature interactions is large in combination. Thus, the core problem of explicit cross feature modeling is how to identify the correct feature set while determining the corresponding interaction weights. Most existing studies are limited to pre-definition by capturing the order of interactionBy a range of numbers of crossing features. However, as the maximum order increases, the number of cross features still grows nearly exponentially. AFN (Weiyu Cheng, Yanyan Shen, and Linpen Huang. 2020.Adaptive facial interpretation Network: Learning Adaptive-Order features interaction. In 34th AAAI Conference on Adaptive interpretation.) further, it models cross-features using logarithmic neurons (J.Wesley Hines.1996.A logarithmic Network architecture for non-bundled non-linear function adaptation. in Proceedings of International Conference Networks (ICNN' 96). IEEE, 1245. each neuron converts a Feature into a logarithmic space, thereby converting the features into a learnable coefficient 1250, in particular, a power of a plurality of features

In this way, each log neuron can capture a specific arbitrary order feature interaction term, but AFN has its inherent limitations, as the input features of the interaction term are limited to positive values due to the use of log-transforms. In addition, the interaction order of each interaction term is unconstrained and remains static after training.

We believe that cross-features should only consider certain input features and that feature interactions should dynamically model a single input. The rationale is that not all input features are constructive to cross terms, and modeling with uncorrelated features may introduce noise, thereby reducing effectiveness and interpretability. In particular, the deployment of the learning model in practical applications not only emphasizes accuracy, but also emphasizes efficiency and interpretability. It is noteworthy that understanding the general behavior and overall logic of the learning model (global interpretability), and providing reasons for the particular decisions made (local interpretability), is crucial for critical decision making in high-risk applications, such as the healthcare or financial industry. Although many black-box models (e.g., DNNs) have strong predictive capabilities, they model the input in an implicit way that is confusing and sometimes may learn some unexpected patterns. In this regard, explicitly adaptively modeling feature relationships with a minimal component feature set provides reasonable a priori knowledge in terms of effectiveness, efficiency, and interpretability.

Disclosure of Invention

The present invention is directed to a method for predicting structured data, which includes the following steps:

obtaining the structured data tuple x ═<x₁,x₂,…x_j,…x_m>，x_jRepresenting the jth attribute value, and m representing the number of the structured data attributes;

will attribute value x_jConversion to an embedded vector representation e_j，j∈{1,2,…,m}；

Modeling feature interactions of the x based on the embedding vector using a plurality of exponential neurons;

aggregating all of the feature interactions to construct a feature vector for the x;

and performing classification prediction based on the feature vector.

Preferably, the attribute value x is set_jConversion to an embedded vector representation e_jThe process of (2) is as follows: when said x is_jWhen the value is numerical, the value range is firstly scaled to (0, 1) according to the attribute value range]Multiplying the interval with the pre-learned embedded vector; when said x is_jWhen the type is classified, the corresponding pre-learned embedded vector is directly indexed according to the value of the embedded vector.

Preferably, the order is not fixed when the features modeling the x interact with each other.

Preferably, the number of the exponential neurons is K × o, where K denotes the number of the attention heads, o denotes the number of the exponential neurons per attention head, and K and o are both natural numbers; all the exponential neurons of each attention head share the weight matrix W of their bilinear attention function_att；

The ith index neuron y of each attention head_iIs represented as follows:

wherein i, < > represents a Hadamard product, an exp (·) function and a corresponding exponent w_ijApplication by element, e_jRepresents the embedded vector, i, j, m, n, corresponding to the jth attribute value of the structured data_eI is more than or equal to 1 and less than or equal to o, j is more than or equal to 1 and less than or equal to m, m represents the number of the structured data attributes,

n_ethe size of the embedding is indicated,

represents y_iTo e_jThe derivative is taken as a function of the time,

denotes y_iTo w_ijTaking the derivative, diag (·) is a diagonal matrix function;

represents said y_iIs obtained by the following formula:

w_i＝z_i⊙v_i；

wherein,

represents a learnable attention weight vector, z_iAs a gate, the attention re-alignment weights are represented, dynamically generated by bilinear attention alignment scores, as follows:

wherein,

indicating that attention is directed to the query vector, T denotes the transpose operation,

a weight matrix representing a bi-linear attention function, α -entmax (·) representing a sparse softmax, sparsity increasing with increasing α, a being a hyper-parameter for controlling sparsity,

preferably, the aggregation is vector stitching.

Preferably, before performing classification prediction based on the feature vectors, nonlinear feature interaction of elements is captured through a multi-layer perceptron MLP, and a vector representation h of a coding relation is obtained:

wherein n is_hThe size of the nonlinear feature interaction is represented and is a natural number.

Preferably, the classification prediction is performed by the following formula:

wherein,

and

respectively representing weight and offset, n_pRepresenting the number of predicted targets.

Preferably, the target prediction is performed by combining the prediction method with DNN.

Preferably, the plurality of indices areV of neuron_iAnd ranking the added and averaged sequence as the importance ranking of each attribute in the structured data to the target prediction.

Preferably, w of the plurality of exponential neurons is_iAnd the added and averaged sequence is used as the importance ranking of each attribute value in the current meta group to the current target prediction result.

In another aspect, the present invention further provides an electronic device, including:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method of structured data based prediction as described above.

In another aspect, the present invention also provides a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform a method of structured data based prediction as described above.

In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, cause the computer to perform the aforementioned structured data based prediction method.

Advantageous effects

Compared with the prior art, the prediction method based on the structured data provided by the invention has the following characteristics:

1. the cross feature is modeled by the exponential neuron, so that the limitation that the input of the logarithmic neuron must be positive is overcome, and the applicable scene of the neuron is improved;

2. the provided exponential neuron can model cross features of any order, and the effectiveness of cross feature modeling is improved;

3. through the exponential neuron and the multi-head gating attention mechanism, the cross features of any order can be dynamically and selectively modeled according to input data, and the accuracy and the efficiency of feature modeling are improved;

4. the cross feature modeling method follows a white box design, and the modeling process is more transparent, so that the method is more explanatory in the relation analysis processing;

5. by paying attention to the gating mechanism of the recalibration weight, the interaction item corresponding to the input sample can be captured dynamically, the model decision interpretability is provided, the trust of people is obtained, new insights are provided, and the understanding of people in some fields is promoted.

6. By global weighting v of all index neurons_iAnd the addition, the average and the sequencing can deepen the understanding of the influence factors and the importance degree of the decision.

7. Through the dynamic feature interaction weight w of all index neurons_iAnd the average and the sorting can deepen the understanding of the influence factors and the importance degree of the current input decision.

Drawings

FIG. 1 is a flow chart of a method according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of a preferred embodiment of the method according to the first embodiment of the present invention;

FIG. 3 is a representation of global feature attributes for the same, Shape, and methods of the present invention for datasets Frappe and Diabetes130, respectively;

FIG. 4 is a graph of ARM-Net local feature attributes and local feature importance weights given by Lime (top right) and Shap (bottom right) for a representative input example on the Frappe dataset;

FIG. 5 is a graph of ARM-Net (left) local feature attributes and local feature importance weights given by Lime (top right) and Shap (bottom right) for a representative input instance on the Diabetes130 dataset; .

Detailed Description

Preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

For convenience of the following description, the structured data is represented as a logicThe table T includes n rows and m columns, and each row may be expressed as a tuple (x, y) ═ x₁,x₂,…x_j,…x_mY), where y is the dependent attribute (prediction target), x (x ═ x<x₁,x₂,…x_j,…x_m>) Is to determine the attribute (feature vector), x_jRepresenting the jth attribute value.

The embodiment of the invention realizes the prediction method based on the structured data, and specifically comprises the following contents:

s1, obtaining the structural data tuple x ═<x₁,x₂,…x_j,…x_m>，x_jRepresenting the jth attribute value, and m representing the number of the structured data attributes;

for example, when a company wants to predict monthly sales, x is provided to include attribute fields (month, regionID, storenid, productnid), and m is 4, and 4 attributes are month, region ID, store ID, and product ID, respectively.

S2, matching the attribute value x_jConversion to an embedded vector representation e_j，j∈{1,2,…,m}；

Any existing method may be used herein to convert each attribute value of the current tuple into an embedding vector, such as an FM method, a bi-directional embedding method, etc.

Preferably, the numerical attribute and the classification attribute of the structured data can be processed respectively: for the x, the four attributes are all classified attributes, all embedded vectors corresponding to the classification can be obtained through training, such as embedded vectors of 1-12 months, when a prediction task is executed, if month is 3, the embedded vectors corresponding to 3 months can be directly used.

S3, modeling the feature interaction of x based on the embedded vector using a plurality of exponential neurons;

exponential neurons, unlike logarithmic neurons, do not require that the input must be positive, thereby reducing the requirement for input data, and one exponential neuron models one feature interaction, i.e., cross-features.

Furthermore, the interaction order is not limited during modeling, but is determined in a self-adaptive mode according to the current data, and therefore the accuracy and the efficiency of the acquired feature interaction can be improved.

Further, setting the number of index neurons as K multiplied by o, wherein K represents the number of attention heads, o represents the number of index neurons of each attention head, and K and o are natural numbers; all the exponential neurons of each attention head share their bilinear attention function φ_attWeight matrix W of_att；

The ith index neuron y of each attention head_iIs represented as follows:

n_ethe size of the embedding is indicated,

represents y_iTo e_jThe derivative is taken as a function of the time,

and representing the dynamic feature interaction weight of the yi, which is obtained by the following formula:

w_i＝z_i⊙v_i； (3)

wherein,

wherein,

representing a learnable attention query vector, T represents a transpose operation,

a weight matrix representing a bilinear attention function, α -entmax (·) representing a sparse softmax, sparsity increasing with increasing α, a being a hyper-parameter for controlling sparsity,

s4, aggregating all the feature interactions to construct a feature vector of the x;

the polymerization can be carried out by various methods, such as addition and averaging, additionWeights, etc. the embodiment adopts a splicing method, namely, the feature interaction vectors output by all the exponential neurons are spliced to obtain a large vector, and for the exponential neurons, the obtained feature vector dimension is K.o.n_e. The vector is too large, the nonlinear feature interaction of the vector can be further captured, and the vector dimension is reduced, for example, the vector representation h of the coding relation is obtained by using the nonlinear feature interaction of the multi-layer perceptron MLP capture element:

wherein n is_hThe feature embedding size representing the nonlinear feature interaction is a natural number.

And S5, performing classified prediction based on the feature vectors.

The classification prediction can be performed by the following formula:

wherein,

and

respectively representing weight and offset, n_pRepresenting the number of predicted targets. For the monthly sales prediction task, the total sales of the prediction target amount can be set to be in multiple categories, for example, the specific sales amount is divided into 5 intervals. For other application scenarios, such as cancer prediction, we can set the classification as binary. I.e. the number of classes (prediction targets) is set according to the specific application scenario. Taking a binary classification task as an example, the corresponding objective function is a binary cross entropy:

wherein

And

respectively, a predicted label and a true label, N is the training instance, i.e. the number of training tuples, and σ (-) is the sigmoid function. With the objective function specified, popular gradient-based optimizers (e.g., SGD, Adam (D. P. Kingma and Jimmy Ba.2015.Adam: A Method for storage optimization. In 3rd International Conference on Learning responses, ICLR.), etc.) can be used to efficiently train the network of the present invention, such as to train the network shown in FIG. 2, and then predict the input data tuples (examples) based on the trained network.

Furthermore, the method is used in a certain scene, and after the network is trained by using the corresponding structured training data, the global weight v of all the exponential neurons_iWhen a definite value is obtained, v of all index neurons is determined_iAfter the sum and the average, the m elements are sorted, and the sorting embodies the importance of each attribute on the target prediction, namely the global interpretability. Similarly, the attribute combination related to the characteristic interaction of all index neurons is output in an ordered way after statistics of the occurrence frequency, namely z_iThe attribute combination corresponding to the related non-zero element can obtain the high-frequency interactive item (interactive attribute, frequency and order) corresponding to the target prediction data set, the interactive attribute reflects the attribute combination with close influence relation, the frequency reflects the influence degree of the corresponding high-frequency interactive item on target prediction, the order reflects the attribute which is mostly irrelevant to interaction and is automatically filtered out as noise, and the efficiency of exponential neuron interactive modeling is effectively improved.

Further, when the trained network is used for prediction, due to a gating mechanism, noise filtering is performed on input data, and an attribute (z) concerned by each interaction can be obtained_iAttribute corresponding to medium non-zero element) and its proportion weight (w)_iElement value of corresponding attribute in) of all index neurons_iSum and averageThen, the m elements are sorted, and the sorting shows the influence degree of each attribute value of the current input element group data on the current target prediction, namely the local interpretability.

Furthermore, the deep neural network DNN with enough hidden units is a general approximator and has strong capability in capturing nonlinear feature interaction, so that the method (ARM-Net for short) can be combined with the DNN to carry out more effective prediction, and at the moment, the prediction result is predicted

Comprises the following steps:

wherein w₁And w₂Are the integrated weights of ARM-Net and DNN respectively,

is an offset, and n_pAs is the number of predicted targets for the learning task. The entire integrated model can then be easily trained end-to-end by optimizing the objective function (e.g., equation 7 above). We represent the integrated model of ARM-Net and DNN as ARM-Net +.

The effectiveness, interpretability and efficiency of the structured data relation modeling are improved by the prediction method provided by the invention:

1. effectiveness of

Most existing feature interaction modeling studies either statically capture the possible cross features at a predefined maximum interaction order or model the cross features in an implicit manner. However, in different input instances, different relationships should have different composition properties. Some relationships are informative, while others may be noise only. Therefore, modeling cross-features in a static manner is not only parameter and computationally inefficient, but may also be inefficient. In particular, the output of each exponential neuron

Capture a particular cross feature of arbitrary order and represent any combination of interacting features, possibly by deactivating other features. By utilizing the proposed exponential neurons and multi-head gating attention mechanism, the invention can model the feature interaction in a self-adaptive manner, thereby obtaining better prediction performance.

2. Interpretability of

Interpretability measures how well decisions made by the model can be understood by humans, resulting in user confidence and providing new insights. There have been post-hoc interpretation methods to explain how the black-box model works, including perturbation-based methods, gradient-based methods, and attention-based methods. However, the interpretation given by another model is often unreliable, which may be misleading. In addition, the present invention follows a white-box design and the modeling process is more transparent and thus more explanatory in the relational analysis process.

In particular, each feature interaction item

Interaction weight of

Is an attention value globally shared from among instances

And dynamically recalibrated by attention alignment of each instance. Thus, the shared attention weight value vector encodes the global interaction weights in the instance population and prior to attribute domain alignment. Thus, we can vector all values of the exponential neurons

Are aggregated to obtain global interpretability. E.g. of all index neurons

To addAnd averaging, this result may indicate the general interest of the invention for each attribute domain in the population, i.e., the characteristic importance of the attribute domain, i.e., the result ranking may indicate the importance ranking of different attributes to the predicted target. At the same time, the proposed gated attention mechanism also adds to the local interpretability, i.e. providing feature attribution on a per input basis. Notably, each index neuron specifies a sparse set of attribute fields that are dynamically used by attention alignment. Thus, we can identify cross features that are dynamically captured, while for each instance (i.e., one tuple of structured data), a relative feature importance table can be obtained by aggregating the interaction weights of all the exponential neurons. To understand the internal modeling process, a global/local analysis of the captured cross feature terms may also be performed.

3. Efficiency of

In addition to effectiveness and interpretability, model complexity is another important criterion for model deployment in practical applications. To simplify the analysis and reduce the number of hyper-parameters, we set the size of all the embedding, attention vectors to n_eAnd the parameter scale of all MLPs in the ARM network is expressed as n_w. Recall that m, K, o represent the attribute field of each attention head, the attention head, and the number of exponential neurons of each attention head, respectively. The vector is embedded with O (Mn)_e) A feature embedding parameter, each instance being embedded using only M attribute fields, where M is the number of distinct features,

then the overall sparsity. Since m is usually small and vector embedding is simply embedding lookups and rescaling, the complexity is negligible.

For ARM modules, K.o exponential neurons can be at complexity O (Komn)_e) Calculating; the parameter size of the value/query vector is O (Kon)_e) The complexity of the computation of bilinear attention alignment for all m input embedded is O (Komn)_e). For the prediction module, the complexity is O (n)_w) This is mainly the nonlinear feature interaction function of equation 7Number phi_MLPBrought about. Thus, the overall parameter size and computational complexity for processing each input are O (mn), respectively_e+n_w) And O (Komn)_e+n_w). This is linear with the number of attribute fields and is therefore efficient and scalable.

Test results

The inventive methods (ARM-Net, ARM-Net +) and the existing five-class feature interaction modeling methods were compared using five real datasets (app recommendations (Frappe), movie recommendations (MovieLens), classified click-through rate predictions (Avazu, Criteo), and health of medicine (Diabetes 130)).

The statistical data of the five data sets and the optimal hyper-parameters searched in the ARM network of the method are shown in the table1: data set statistics and ARM-Net optimal parameter configuration (Table1: Dataset statistics and best ARM-Net configurations), the number of Tuples (instances) of different datasets (datasets), the number of attribute Fields (Fields) and different feature numbers (Features), and the optimal hyper-parameters (ARM-Net hyper-parameters) of the network of the invention corresponding to the datasets are given in the Table.

Table 1:Dataset statistics and best ARM-Net configurations.

The five-type feature interactive modeling method comprises the following steps:

(1) linear Regression (LR) linearly aggregating the input attributes with their respective importance weights without considering feature interactions;

(2) methods for modeling second order feature interactions, i.e., FM, AFM;

(3) methods to capture higher order feature interactions, namely HOFM, DCN, CIN and AFN;

(4) the neural network based approach, i.e., DNN, and the graph neural networks GCN and GAT.

(5) Models of explicit cross feature modeling and implicit feature interaction modeling, namely Wide & Deep, KPNN, NFM, Deep fm, DCN +, xDeepFM and AFN +, are integrated by DNNs.

AUC (area under ROC curve, larger is better) and Logloss (cross entropy, smaller is better) are used as evaluation indexes. For AUC and Logloss, the improvement at the 0.001 level was considered significant on the baseline dataset used. We split the data set into 8:1:1 for training, validation and testing, respectively, report the average of five independently run evaluation metrics, and take a strategy of early stop on the validation set.

In the test, an Adam optimizer is adopted, the learning rate search range is 0.1-1 e-3, and the batch size of all models is 4096. In particular, we take a batch size of 1024 for the smaller dataset Diabetes130 and an evaluation every 1000 training steps for the larger dataset Avazu. The experiment was performed on a Xeon (R) Silver 4114CPU @2.2GHz (10 cores), 256G memory and GeForce RTX 2080Ti server. The model was implemented in PyTorch 1.6.0 and cuda 10.2.

The results of the comparison are shown in Table 2: overall prediction performance with the same training data set (Table 2: over prediction performance with the same training settings).

As can be seen from table 2:

1. explicit interactive modeling using a single model.

The ARM network is compared to a baseline model of a single structure, which can explicitly capture first, second and higher order cross features. Based on the results in table2, we have the following findings:

first, ARM-Net consistently outperforms the baseline model of explicit modeling interactions in AUC. Better predictive properties demonstrate the effectiveness of ARM-Net across datasets and domains, including application recommendations (Frappe), movie tag recommendations (MovieLens), click-through rate predictions (Avazu and Criteo), and medical readmission predictions (Diabetes 130).

Secondly, higher order models (e.g., HOFM and CIN) generally have better prediction performance than lower order models (e.g., LR and FM), which verifies the importance of higher order cross features to prediction, and the absence of higher order cross features can greatly reduce the modeling capability of the models.

Third, both AFN and ARM-Net are significantly superior to the fixed-order baseline model, which verifies the effectiveness of modeling arbitrary-order feature interactions in an adaptive and data-driven manner.

Finally, the AUC of ARM-Net is significantly higher than the baseline model AFN, which generally performs best.

Table 2：Overall prediction performance with the same training settings.

The good performance of the ARM network is mainly attributed to the exponential neurons and the gated attention mechanism. Specifically, the limitation of the positive input of the logarithmic transformation in AFN limits its representation, whereas ARM-Net avoids this problem by modeling feature interactions in the exponential space. Furthermore, the multi-headed gating attention of ARM-Net does not statically model interactions as in AFN, but selectively filters noise characteristics and dynamically generates interaction weights to reflect the characteristics of each input instance. Thus, ARM-Net can capture more efficient cross-signatures to achieve better prediction performance on a per-input basis, and the parameters of ARM-Net are more efficient due to this runtime flexibility. As shown in table1, the best ARM-Net requires only tens to hundreds of exponential neurons for datasets of different sizes, while the best AFN typically requires more than a thousand neurons to achieve the best results, e.g., on large datasets Avazu, the ARM network and the AFN require 32 and 1600 neurons, respectively.

2. Neural network based models and integrated models.

Based on the results in table2, we have the following findings:

(1) although feature interactions are not explicitly modeled, optimal neural network-based models generally have stronger predictive performance relative to other single-structure baseline models. In particular, the attention-based graph network GAT obtained significantly higher AUC on Avazu and Diabetes130 than other single structure models. However, its performance is not as stable as ARM-Net, and varies widely between different data sets, e.g., GAT performs much worse on Frappe and MovieLens than on DNN and ARM-Net.

(2) Model integration of DNNs significantly improves their respective prediction performance. This can be consistently observed throughout the baseline model, e.g., DCN +, xDeepFM, and AFN +, suggesting that the nonlinear interaction captured by DNNs is complementary to the interaction captured explicitly.

(3) ARM-Net achieves performance comparable to DNN, and ARM-Net + further improves performance, achieving the best overall performance on all benchmark datasets.

Taken together, these results further demonstrate the effectiveness of ARM-Net for selectively and dynamically modeling arbitrary order feature interactions.

For the results of the explanatory tests

The present invention demonstrates the interpretability results of ARMOR through user application usage prediction on Frappe and readmission prediction of diabetic patients on Diabetes130 in two representative areas. In particular, the learning task on Frappe is to predict the usage state of an application based on a given usage context. The context comprises 10 attribute fields, { user _ id, item _ id, daytime, weekday, weekkend, location, is _ free, weather, county, city }, and mainly describes the use mode of the mobile terminal user; for Diabetes130, the learning task is to predict the likelihood of readmission by analyzing factors and other information related to the diabetic patient's readmission. There are 43 attribute fields for prediction, and we show 10 most important attribute fields for illustration. Interpretations of The attribute fields of both datasets are public (Linas Baltrunas, Karen Church, Alexandros Karatzoglou, and Nuria Oliver.2015.Frappe: interpreting The use and permission of Mobile App Recommendations In-The-wild. arXiv prepropressin: 1505.03014(2015). also Beata Strack, Jonathan P Desharzo, Chris genings, Juan L Olmo, Sebastan Venturi, Krzztof J Cios, and John Clore.2014. Immun HbA1c statistical responses: interpretation of 70,000 datatract).

For both data sets, the global feature importance of the various attribute fields obtained by aggregating the value vectors of the index neurons is first demonstrated and the global features of ARM-Net are attributed to a comparison with two widely adopted interpretation methods, Lime (Marco T-lio Ribeiro, Samer Singh, and cars Guestin.2016. "where cover I Trust Young": displaying the prediction of the same Classification. In Proceedings of the 22nd ACM SIGKDD.ACM, 1135. In 1144.) and Shap (Scott M. Lundberg and Su-In Lee.2017.A field applied prediction Model prediction In prediction of Information 30: analysis of the System 4765. the global features of ARM-Net are compared with the two widely adopted interpretation methods, Libero, Samer Singh, and Carlo, and the name of the same. The two methods adopt an interpretation method of input disturbance based on linear regression and game theory to identify the characteristic importance of the model to be interpreted. Specifically, the results of the interpretation of Lime and Shap on the Frappe and Diabetes130 datasets were based on best performing single structure baseline models DNN and GAT (Petar Velickovic, Guillem Cucurull, Aranta Casanova, Adriana Romero, Pietro, respectively

and Yoshua Bengio.2018.graph attachment networks.In 6th International Conference on Learning retrieval, ICLR.), and the importance of the global characteristics given by the two methods is obtained by aggregating the local characteristic attributes of all the examples of the test data set. Then, we display the top level interactivity terms (Interaction Term) captured by ARM-Net at the corresponding Frequency (Frequency) and order (Orders), which represent the average number of occurrences of each instance and the number of features captured for each interactivity Term, respectively. We also account for local interpretations by showing the ARM module by aggregating assigned feature interaction weights, and again compare the ARM-Net local feature attribution results to Lime and Shap.

Global interpretability. We illustrate global feature attribution in fig. 3 and summarize the high frequency interaction terms of the two datasets captured by ARM-Net in table 3 and table 4, respectively.

Table 3：Top Global Interaction Terms for Frappe.

Table 4Top Global Interaction Terms for Diabetes130.

From FIG. 3, it can be seen that the most important features identified by ARM-Net on the Frappe data set are { user _ id, item _ id, is _ free }. Global attention to these attributes is justified because user _ id and item _ id identify the user and item, two main features used in learning tasks such as collaborative filtering, and is _ free indicates whether the user pays for the application, which is highly related to the user's preference for the application. Similarly, on the Diabetes130 dataset, the most important features determined by ARM-Net include { emergency score, hospitalization score, number of diagnoses }, which is consistent with the attribute domain coefficients estimated for the logistic regression model in the literature (Beata Strack, Jonathan P Deshazo, Chris genings, Juan L Olmo, Sebastian Ventura, Krzysztof J Cios, and John Clore.2014.Impact of HbA1c medial regression on regional recommendations: analysis of 70,000clinical database performance records. BioMed research international 2014 (2014)). We also note that the global feature importance provided by ARM-Net is consistent with the two common interpretation methods (i.e., Lime and Shap). At the same time, we note that the importance of the global features provided by ARM-Net is relatively more reliable, because ARM-Net essentially supports global feature attribution, its modeling process is more transparent, and Lime and Shap are generally used as a medium to interpret other "black box" models by approximation.

From the top level global interaction item on the Frappe dataset in table 3, it can be found that: first, the attribute fields that interact item modeling most frequently include use _ id, item _ id, and is _ free, which is consistent with the global feature importance in FIG. 3. Second, these interaction terms often occur in interaction modeling, such as the frequency of interaction terms (workday, place, is _ free), (item _ id, is _ free, city) and (user _ id, is _ free) being 3.71, 3.36 and 2.88, respectively, indicating that these cross features (with different interaction weights) are used multiple times in each instance (note that the inference of each instance has K.o interaction terms). Third, the order of the interaction terms is at most 2 and 3, which suggests that it is necessary to identify a suitable set of attributes for interaction modeling, and capturing cross features by enumerating all possible combinations of features is extremely inefficient and ineffective, which may introduce noise.

From the top-level global interaction terms listed in table 4 for the Diabetes130 datasets, it can be observed that the most commonly modeled property fields in the interaction terms are quite diverse, indicating that different exponential neurons do capture different cross features, which is more parameter efficient when modeling feature interactions. Furthermore, the order of the top-level interaction terms is less than 3, and there are many first-order terms, which indicates that for some datasets, such as Diabetes130, it may not be necessary to model the high-order cross features.

Local interpretability. FIG. 4 shows the local characteristic attribution of ARM-Net for one representative input example on the Frappe dataset, where the interaction weights for three representative exponential neurons and the average weight for all neurons are shown. We can note that different exponential neurons selectively capture different cross features in a sparse manner. For example, Neuron3 captures feature interaction terms (item \ id, weekend, count), which indicates that Neuron3 responds to these three attributes for this particular instance. In addition, the aggregate interaction weight display item _ id, is _ free, and user _ id of this example are the three most distinctive attributes, consistent with the global interpretation result in FIG. 3. We also demonstrated local attribute attribution by Lime (Marco T lio Ribeiro, Sameer Singh, and cars Guestin.2016. "while Should I Trust You". We may note that while both Home and Shap are the same as ARM-Net, taking item _ id, user _ id, and city as the three most important features, Home also gives other features a large importance weight, such as is _ free, county. This indicates that the external interpretation methods may not be consistent nor necessarily reliable, as they are only approximations of the model to be interpreted.

Figure 5 shows similar local feature ascribing results on the Diabetes130 data set. We can see that different exponential neurons focus on different cross features. Specifically, Neuron1 and Neuron2 focus more on emergency _ score and diag _1_ category, respectively, and Neuron3 focuses more on num _ diagnoses. Additionally, for this particular diabetic patient, the last five features, namely, emergency _ score, inpatient _ score, diag _1_ category, num _ diagnoses, and diabetes _ med, are the most useful attributes in the prediction of readmission. With this local interpretation, ARM-Net can support more personalized analysis and management.

As the machine learning model plays more and more important roles in various fields such as medical care, financial investment and recommendation systems, the demands on the transparency and the interpretability of the model are higher and higher, which is beneficial to debugging the learning model and is also beneficial to the verification and the improvement of the model. Furthermore, an interpretable model may also facilitate understanding in certain areas, so that trust can be generated in the analysis results.

A simple and effective method, either global or local interpretability, is feature attribution, which determines the feature importance of an input instance based on the weight and size of the features used. It is worth mentioning that based on the game theory model, the sharey value assesses the importance of each feature in the prediction, and LIME uses a linear model to locally approximate the model by input perturbation, thereby providing a local interpretation that is not limited to a particular model. The Grad-CAM provides a visual interpretation of the gradient-weighted class-activation-based mapping for the CNN-based model to highlight local regions.

Meanwhile, a model interpretation method aiming at a specific field is provided by combining with the professional knowledge of the field. For example, in the fields of medical analysis and finance, depth models are increasingly employed to achieve high prediction performance; however, this critical and high risk application underscores the need for interpretability. In particular, attention machines are widely employed to facilitate interpretability of depth models by visualizing attention weights. By integrating the attention mechanism into the model design, many studies successfully achieved interpretable medical analysis. In particular, Dipole supports access level interpretation in diagnostic prognostics with three attention mechanisms. Retain and TRACER may support interpretation of access levels and feature levels. However, one inherent limitation of most existing methods is that their interpretability is based on a single input feature, ignoring the feature interactions necessary for relational analysis.

And (5) feature interactive modeling. Cross-feature explicitly models feature interactions between attribute domains by multiplying the corresponding constituent features, which is important for predictive analysis of different applications, such as application recommendations and click predictions. Many existing efforts use DNNs to implicitly capture cross-signatures. However, implicitly modeling the multiplied feature interactions with DNNs requires a large number of hidden units, which makes the modeling process inefficient and difficult to interpret in practice.

Many models propose explicitly capturing cross-features, which generally results in better prediction performance. In these studies, some models captured second-order feature interactions, others modeled higher-order feature interactions within a predefined maximum order. Recent working AFN proposes to model arbitrary order cross features with logarithmic neurons, but this also has input limitations for logarithmic transformation and flexibility in operation. The ARM-Net of the invention provides a method for modeling characteristic interaction by using index neurons in a self-adaptive manner based on a gating multi-head attention mechanism, and the model is accurate, efficient and strong in interpretability. The core idea is to selectively and dynamically establish attribute dependence and correlation models through cross features. The input features are first converted into an exponential space, and then the interaction weight and the interaction order of each cross feature are adaptively determined. To dynamically model the arbitrary order cross-signatures and selectively filter the noise signature, we propose a new sparse attention mechanism to generate interaction weights for a given input tuple. Therefore, the ARM-Net can identify the cross features with the largest information quantity in an input perception mode, so that more accurate prediction and better interpretability are obtained in the reasoning process. Extensive experimental studies on real datasets confirm that ARM-Net consistently has superior predictive performance, global interpretability, and local interpretability for a single instance, compared to existing models.

The units described in the embodiments of the present disclosure may be implemented by software, and may also be implemented by hardware. Where the name of an element does not constitute a limitation on the element itself.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof.

The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A prediction method based on structured data is characterized by comprising the following contents:

and performing classification prediction based on the feature vector.

2. The method of claim 1, wherein the attribute value x is a function of a number of attributes_jConversion to an embedded vector representation e_jThe process of (2) is as follows: when said x is_jWhen the value is numerical, the value range is firstly scaled to (0, 1) according to the attribute value range]Multiplying the interval with the pre-learned embedded vector; when said x is_jWhen the type is classified, the corresponding pre-learned embedded vector is directly indexed according to the value of the embedded vector.

3. The method of claim 1 or 2, wherein the order is non-fixed when modeling the feature interaction of x.

4. The method of claim 3, wherein the number of exponential neurons is K x o, where K denotes the number of attention heads, o denotes the number of exponential neurons per attention head, and K and o are both natural numbers; all the exponential neurons of each attention head share the weight matrix W of their bilinear attention function_att；

The ith exponential neuron y of each attention head_iIs represented as follows:

wherein i, < > represents a Hadamard product, an exp (·) function and a corresponding exponent w_ijApplication by element, e_jRepresenting the embedded vector i, j, m, n corresponding to the jth attribute value of the structured data_eAre natural numbers, i is more than or equal to 1 and less than or equal to o, j is more than or equal to 1 and less than or equal to m, m represents theThe number of structured data attributes,

n_ethe size of the embedding is indicated,

denotes y_iTo e_jThe derivative is taken as a function of the time,

represents said y_iIs obtained by the following formula:

w_i＝z_i⊙v_i；

wherein,

wherein,

5. the method of claim 4, wherein the aggregation is vector stitching.

6. The method of claim 5, wherein the non-linear feature interaction of the elements is captured by a multi-layer perceptron MLP before performing the classified prediction based on the feature vectors, and a vector representation h of the coding relationship is obtained:

7. The method of claim 6, wherein the classification prediction is performed by:

wherein,

and

8. The method of claim 7, wherein the method is combined with DNN for target prediction.

9. The method of any one of claims 3-8, wherein v of the plurality of exponential neurons is measured_iAnd ranking the added and averaged sequence as the influence degree of each attribute in the structured data on the target prediction.

10. The method of any one of claims 3-8, wherein w of the plurality of exponential neurons is determined_iAnd ranking the added and averaged sequence as the influence degree of each attribute value in the current tuple on the target prediction result.