CN116467930A

CN116467930A - Transformer-based structured data general modeling method

Info

Publication number: CN116467930A
Application number: CN202310239904.9A
Authority: CN
Inventors: 郭颖; 熊媛媛; 李喜武; 刁克红; 孙广源; 梁浩然; 梁荣华
Original assignee: Zhejiang University of Technology ZJUT; Cosmoplat Industrial Intelligent Research Institute Qingdao Co Ltd
Current assignee: Zhejiang University of Technology ZJUT; Cosmoplat Industrial Intelligent Research Institute Qingdao Co Ltd
Priority date: 2023-03-07
Filing date: 2023-03-07
Publication date: 2023-07-21

Abstract

The invention relates to a general modeling method of structured data based on a transducer, which comprises the steps of firstly removing irrelevant features from original data, then using different embedding methods for category features and numerical features, then splicing feature vectors after embedding the category features and the numerical features, and inputting the spliced feature vectors into a transducer+neural network (improved transducer) and an MLP+neural network, wherein the transducer+neural network is formed by adding a leakage Gate before the original transducer and an MLP+neural network after the original transducer, and finally distributing different weights for output values of two modules. The invention is applicable to both classification problems and multi-classification problems.

Description

Transformer-based structured data general modeling method

Technical Field

The invention belongs to the field of structured data processing, and particularly relates to a general structured data modeling method based on a Transformer

Background

Form data is the most common form of data and is ubiquitous in a variety of applications, such as medical diagnosis based on medical records, predictive analysis in the financial field, network security, and the like. At present, a tree-based integration method, such as a gradient lifting decision tree GBDT, is generally used, has a good effect in processing table data, mainly realizes better learning ability on continuous numerical characteristics, can automatically select and combine useful numerical characteristics, and effectively constructs the decision tree by calculating information gain. However, since class features are generally converted into high-dimensional sparse one-hot codes, the gradient boost decision tree GBDT will obtain very little information gain when processing such data, and such features cannot be effectively learned.

In recent years, the method based on a transducer has been greatly successful in the field of computer vision and the field of natural language processing. In the field of computer vision, the size of the receptive field is limited by the arrangement of the convolution kernel, so that a network often needs multiple layers of stacking to pay attention to the whole feature map; in the field of natural language processing, RNN or LSTM can link information accumulation over several time steps, and the farther the distance, the less likely effective capture. Whereas self-attention in the transducer can capture global attention information. Besides this, there is also a direct help to increase the parallelism of the computation, which is also the main reason that transfomers are widely used.

Multi-layer perceptron MLPs, which may be the simplest and most versatile neural networks, typically learn parameter embedding to encode classification data features, but are not robust to missing and noisy data due to their relatively shallow architecture and use of context-free embedding, most importantly, in most cases, multi-layer perceptron MLPs do not behave like tree-based models.

In view of the foregoing, it is an urgent need to solve the problem of deep learning application in the form field to effectively learn form data and overcome the above problems.

Disclosure of Invention

In order to solve the defects of the existing tree-based integration method in form prediction, the invention provides a general modeling method for structured data based on a transducer.

In order to solve the technical problems, the invention is realized by adopting the following technical scheme:

a general modeling method for structured data based on a transducer comprises the following steps:

(1) The input public data set is subjected to characteristic processing: after the original data is obtained, irrelevant features are required to be removed, category features in the data are encoded into identifiable digital forms, and the digital features are scaled according to standardized operation;

(2) Word Embedding is carried out on the feature vectors after feature processing, and the feature vectors are embedded into the Embedding: projecting the high-dimensional discrete data of the numerical features and the class features into a low-dimensional dense d-dimensional space by word Embedding before passing through a transducer+encoder;

(3) The word obtained in the last step is embedded into an Embedding vector and is input into two branches of a model: the model is divided into two branches of a transducer+neural network and an MLP+neural network, the feature vector of training data after word Embedding is input into the transducer+neural network for learning, the original output of the neural network is obtained, the same input is input into the MLP+neural network for modeling learning, a trained MLP+neural network is obtained, the transducer+neural network and the MLP+neural network are fused into a classification model, so that the two parts of original outputs are weighted and summed to form the integral output value of the model, and then the integral prediction result given by the classification model is obtained through an activation function;

(4) Training is guided by adopting Focal Loss as an objective function: training the classification model by utilizing the preprocessed training data, guiding the training process by adopting Focal Loss Focal Loss as an objective function, and searching the optimal parameters to obtain a trained classification model;

(5) Receive other tabular data for prediction: and preprocessing the form data to be classified, and inputting the form data to the trained classification model for classification prediction.

Further, in the step (1), the method for processing the input features includes the following steps:

(1-1) culling the useless features: carrying out feature recognition on each data set according to priori knowledge, and eliminating useless features;

(1-2) processing the sequential features: the continuous features are standardized by a standard scaler, and the numerical features are scaled;

(1-3) processing category characteristics: the class features are coded into digital form by a label coder LabelEncoder, so that the calculation cost is increased in order to avoid sparse coding, and single-hot one-hot coding is not carried out.

Further, in the step (2), word Embedding is a technique of mapping feature vectors to low-dimensional space vectors, discrete feature vectors may be converted into continuous vector representations, general word Embedding processing is performed for category features, a separate full-connection layer is used for numerical features, each numerical feature has ReLU nonlinearity, so that 1-dimensional input is projected into d-dimensional space, and then Embedding of category features and numerical features is performed in a first dimension.

Further, in the step (3), the neural network model includes the following parts:

(3-1) the neural network fransformer+ is improved over the original converter fransformer by adding a Leaky Gate before the original converter fransformer encoder and adding an mlp+ neural network after, the Leaky Gate being a combination of two simple elements, namely a linear transformation based on element level and a Leaky relu activation function;

(3-2) the neural network mlp+ is improved relative to the multi-layer perceptron MLP by starting with a sub-block of the multi-layer perceptron MLP, replacing the normal Batch normalization Batch Norm with a Ghost normalization Ghost Batch Norm (GBN), adding a linear jump layer on the right side of the sub-block, the jump layer being just a fully connected linear layer, then a LeakyRelu activation function, and finally adding a Leaky Gate before the multi-layer perceptron MLP sub-block and the linear jump layer. Ghost normalized Ghost Batch North (GBN) allows training using large volumes of data, and one great motivation for the present invention to use Ghost normalized Ghost Batch North (GBN) is to expedite training.

The invention provides a general modeling method for structured data based on a transducer, which is characterized in that the transducer is adopted to process category characteristics and numerical characteristics simultaneously, and the transducer and a multi-layer perceptron MLP are fused into a model on the premise of fully retaining the performance of the transducer model, rather than respectively giving weighted voting after category prediction, so that the model can be optimized by introducing a loss function in end-to-end training, and the recognition capability of the model is effectively enhanced. Compared with the prior art, the invention has the positive effects that:

1. the invention proposes a data processing method that enters class features and numerical features together into a transducer, which means that any information of the correlation between class features and numerical features is not lost.

2. The invention provides a general modeling method for structured data based on a Transformer, which can effectively fuse a simpler MLP neural network with a more complex Transformer neural network based on attention, thereby learning category characteristics and numerical characteristics.

3. The invention adopts seven public data sets of the public data set of the fault, the blast, the shrutime and the like to evaluate the proposed new model, and experimental results show that the method of the invention is superior to other advanced methods under the two classification scenes.

Drawings

FIG. 1 is an overall framework of the method of the present invention.

FIG. 2 is a process flow of the MLP+.

Detailed description of the preferred embodiments

The technical scheme of the invention will be clearly and completely described below with reference to the drawings in the embodiments of the application. The specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Example 1

Fig. 1 is an overall architecture diagram of the present invention, which is a general modeling method for structured data based on a transducer, and specifically includes the following steps:

step (1) input characteristic processing;

in the data layer, public data sets such as an add, a blast, a spambise and the like are used, some data sets only have numerical characteristics, some data sets contain both numerical characteristics and category characteristics, and meanwhile, the data is divided into a training set and a testing set. For different data sets, we use a priori knowledge to cull out a portion of the useless features. Since most class features are in the form of strings, they are encoded into a digital (1, 2, 3) form that the model can recognize; for numerical features, scaling is performed by taking a normalization operation.

For the original data (comprising a training set and a test set), unnecessary features are removed, numeric codes are carried out on the category type features, and normalization processing is carried out on the numeric features, so that a data set D= { (x) is obtained _i ，y _i )，y _i E 0, classnum), i=1, 2,3, and N (where x _i Is the eigenvector of each sample, y is x _i Corresponding tag, classnum is the number of categories, N is the number of samples), distinguish different types of features, divide the data into category type features x _cat And numerical feature x _cont 。

Step (2) embedding category characteristics and numerical characteristics;

the embedding layer E embeds each feature into d-dimensional space, and in order to effectively process table data, the invention discriminates discrete type features and continuous numerical features. The invention obtains a new category characteristic by word Embedding technologyAn embedded representation, a new embedded representation of the numerical feature is obtained by using the fully connected layer,is a single sample with class or numerical features, the embedding layer e uses different embedding functions for different types of features for a given +.>Obtain->Then splice in the feature dimension, E _Φ (X) is the result of all features being represented by the embedding.

E _Φ (x)＝{e _Φ1 (x ₁ )，...，e _ΦN (x _N )} (1)

Step (3), inputting the characteristic vector embedded into the model;

(3-1) the feature vector outputted in the previous step is first entered into a leakage Gate, which is a combination of two simple elements, one linear transformation at element level, followed by a leakage Relu activation function that will let any positive value pass without change and compress any negative value to almost zero, in other words, if w _i And b _i Is the linear layer parameter of the ith column, then the leakage Gate leak Gate of the ith column is:

the Leaky Gate is intended to act as a simple filter with different behavior for each column, with or without masking or passing depending on each individual value.

The converter layer takes as input the output of the Leaky Gate and passes the output to the second converter layer and so on, as shown in fig. 1, the output of the last converter layer will be input directly to the mlp+ neural network (modified multi-layerPerceptron MLP), the MLP+ neural network is shown in FIG. 2, and the output value y of the model is obtained _Transformer+ . Wherein θ is ₁ ，θ ₂ And theta ₃ The model parameters of the Leaky Gate, transducer, mlp+ neural network, respectively.

y _Transformer+ (x)＝M(f _transformer (G _Θ (E _Φ (x)；θ ₁ )；θ ₂ )；θ ₃ ) (3)

(3-2) similarly, the feature vector outputted in the previous step is inputted into an MLP+ neural network (right branch of FIG. 1) to obtain an output value y of the model _mlp+ 。

y _mlp+ ＝M(E _Φ (X)；θ ₁ ) (4)

Step (4) merging left and right branches;

specifically, to combine the improved converter and the improved multi-layer perceptron MLP to obtain predictions of the overall model and perform end-to-end training, the present invention assigns different weights w to the output values of the two modules ₁ And w ₂ (the two weights can be obtained by back propagation training learning), and the prediction probability of the final model outputAs in equation (5), σ represents the activation function (two categories are sigmoid, multiple categories are softmax).

Step (5) training a classification model based on Focal Loss;

the preprocessed data is utilized to train the model, focal Loss is used as a Loss function to guide the training process, so that the model can pay more attention to few types of samples which are difficult to classify, and deviation caused by the majority of types is reduced.

According to equation (5), the loss of the model can be expressed as equation (6),representing the loss function, y is the true label of sample x.

In order to solve the problem of unbalanced data types, the invention adopts the idea of cost sensitivity, and introduces Focal Loss Focal Loss as a Loss function of the model. Focal Loss was originally used to solve the problem of class imbalance in the object detection task and was an improvement over conventional cross entropy Loss. The invention introduces the method into the form classification field. For the two-classification problem, the Focal Loss Focal Loss can be expressed in the form of equation (7), whereIs a probability prediction defined in equation (5), y _i Is a label of the input sample, alpha is a balance factor, and gamma is equal to or greater than 0 and is called a focusing parameter.

For the multi-classification problem, one-to-many ideas can be taken to extend equation (7) to equation (8), where y is the one-hot representation of the class label,probability outputs in the form of (m, n) (m is the number of samples and n is the number of categories).

Based on the loss functions defined by the formulas (7) and (8), end-to-end model training can be performed, and a gradient shelving method is adopted to select a model with minimum loss.

Example 2

The invention provides a commodity recommendation method based on a transform structured data general modeling method.

FIG. 1 is a diagram of the overall architecture of the present invention, and the method comprises the following specific steps:

and (3) inputting feature processing.

In an application scenario of a recommendation system, taking a commodity recommendation system as an example, a general modeling method of structured data based on a Transformer is used for classifying the commodity recommendation system according to the behavior of a user, so as to recommend commodities of corresponding types. At the data level, an online_choppers is used to disclose a data set, wherein the data set contains both numerical characteristics and category characteristics, and meanwhile, the data is divided into a training set and a testing set. For the present dataset, we use a priori knowledge to cull out a portion of the useless features. Since most class features are in the form of strings, they are encoded into a digital (1, 2, 3) form that the model can recognize; for numerical features, scaling is performed by taking a normalization operation.

And (3) embedding category characteristics and numerical characteristics.

The embedding layer E embeds each feature into d-dimensional space, and in order to effectively process table data, the invention discriminates discrete type features and continuous numerical features. The invention obtains a new embedded representation of the category characteristics through the word Embedding technology, obtains a new embedded representation of the numerical characteristics through using the full connection layer,is a single sample with class or numerical features, the embedding layer e uses different embedding functions for different types of features for a given +.>Obtain->Then splice in the feature dimension, E _Φ (X) is the result of all features being represented by the embedding:

E _Φ (x)＝{e _Φ1 (x ₁ )，...，e _ΦN (x _N )} (1)

and (3) inputting the characteristic vector embedded into the model.

The converter layer takes the output of the Leaky Gate as input and passes the output to the second converter layer, and so on, as shown in FIG. 1, the output of the last converter layer will be directly input to an MLP+ neural network (modified multi-layer perceptron MLP), as shown in FIG. 2, resulting in the output value y of the neural network _Transformer+ . Wherein θ is ₁ ，θ ₂ And theta ₃ Leak gate Gate, converter, model parameters of mlp+ neural network.

(3-2) similarly, the feature vector outputted in the previous step is inputted into an MLP+ neural network (right branch of FIG. 1) to obtain an output value y of the neural network _mlp+ ：

y _mlp+ ＝M(E _Φ (X)；θ ₁ ) (4)

And (4) fusing the left branch and the right branch.

Specifically, to combine the improved transform and the improved multi-layer perceptron MLP to obtain predictions of the overall model and perform end-to-end training, the present invention assigns different weights w to the output values of the two modules ₁ And w ₂ (the two weights can be obtained by back propagation training learning), and the prediction probability of the final model outputAs in equation (5), σ represents the activation function (two categories are sigmoid, multiple categories are softmax).

And (5) training a classification model based on Focal Loss Focal Loss.

And (6) inputting the user characteristics into the model to realize commodity recommendation.

When the commodity recommendation system acquires the user behavior or adds or modifies the behavior based on the original behavior, the commodity recommendation system inputs the newly constructed user behavior into the model to obtain a new classification result, and then recommends the corresponding commodity.

The numerical feature and category feature embedding module is used for collecting feature vectors embedded by new user behaviors and used for later model input.

And the feature vector input model module for embedding and outputting is used for carrying out parameter adjustment on the new feature vector input model.

The focus Loss Focal Loss-based training classification model module is used for training a new model after parameter changes.

It will be appreciated by persons skilled in the art that the foregoing description is a preferred embodiment of the invention, and is not intended to limit the invention, but rather to limit the invention to the specific embodiments described, and that modifications may be made to the technical solutions described in the foregoing embodiments, or equivalents may be substituted for elements thereof, for the purposes of those skilled in the art. Modifications, equivalents, and alternatives falling within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. The general modeling method for the structured data based on the Transformer is characterized by comprising the following steps of:

(2) Word Embedding is carried out on the feature vectors after feature processing, and the feature vectors are embedded into the Embedding: projecting the high-dimensional discrete data of the numerical features and the class features into a low-dimensional dense d-dimensional space by word Embedding before passing through an encoder of a transfomer+neural network;

(3) Inputting the word embedded coding vector obtained in the step (2) into two branches of a model: the model is divided into two branches of a transducer+neural network and an MLP+neural network, the feature vector of training data after word Embedding is input into the transducer+neural network for learning, the original output of the neural network is obtained, the same input is input into the MLP+neural network for modeling learning, a trained MLP+neural network is obtained, the transducer+neural network and the MLP+neural network are fused into a classification model, so that the two parts of original outputs are weighted and summed to form the integral output value of the model, and then the integral prediction result given by the classification model is obtained through an activation function;

2. The method of claim 1, wherein the method of input feature processing of step (1) comprises the steps of:

3. The method of claim 1, wherein the word Embedding method of step (2) is a technique of mapping feature vectors to low dimensional space vectors, wherein discrete feature vectors are converted into continuous vector representations, wherein general word Embedding method processing is performed for category features, wherein a separate full-connected layer is used for numerical features, wherein each numerical feature has ReLU nonlinearity, thereby projecting 1-dimensional input into d-dimensional space, and wherein Embedding of category features and numerical features is performed in a first dimension.

4. A method as claimed in claim 3, wherein: the embedding of the category characteristics and the numerical characteristics in the first dimension is connected, and the method specifically comprises the following steps: the embedding layer E embeds each feature into d-dimensional space, and in order to effectively process table data, the invention discriminates discrete type features and continuous numerical features. The invention obtains new embedded representation of category characteristics through word Embedding technology, obtains new embedded representation of numerical characteristics through using a full connection layer, and obtains x _i ＝[f _i ^{1} ,f _i ^{2} ,...,f _i ^{n} ]Is a single sample with class or numerical features, the embedding layer e uses different embedding functions for different types of features, for a givenObtain->Then splice in the feature dimension, E _Φ (X) is the result of all features being represented by the embedding:

E _Φ (x)＝{e _Φ1 (x ₁ ),...,e _ΦN (x _N )} (1)

5. the method of claim 1, wherein the model of step (3) comprises the following parts:

(3-1) the neural network fransformer+ is improved over the converter fransformer by adding a Leaky Gate before the encoder of the converter fransformer and adding an mlp+ neural network after, the Leaky Gate being a combination of two simple elements, namely a linear transformation based on element level and a Leaky relu activation function;

(3-2) the neural network mlp+ is improved relative to the multi-layer perceptron MLP by starting with a sub-block of the multi-layer perceptron MLP, replacing the normal Batch normalization Batch Norm with a Ghost normalization Ghost Batch Norm (GBN), adding a linear jump layer on the right side of the sub-block, the jump layer being just a fully connected linear layer, then a LeakyRelu activation function, and finally adding a Leaky Gate before the multi-layer perceptron MLP sub-block and the linear jump layer.

6. The method of claim 5, wherein: the steps (3-1) and (3-2) specifically comprise: (3-1) the feature vector outputted in the previous step is first entered into a leakage Gate, which is a combination of two simple elements, one linear transformation at element level, followed by a leakage Relu activation function that will let any positive value pass without change and compress any negative value to almost zero, in other words, if w _i And b _i Is the linear layer parameter of the ith column, then the leakage Gate leak Gate of the ith column is:

the Leaky Gate is intended to act as a simple filter, with different behavior for each column, with or without masking or passing depending on each individual value;

the converter layer takes the output of the Leaky Gate as input and passes the output to the second converter layer, and so on, as shown in FIG. 1, the output of the last converter layer will be directly input to the MLP+ neural network (modified multi-layer perceptron MLP), which is shown in FIG. 2, to obtain the model output value y _Transformer+ The method comprises the steps of carrying out a first treatment on the surface of the Wherein θ is ₁ ，θ ₂ And theta ₃ The model parameters of the Leaky Gate, transducer, mlp+ neural network are:

(3-2) similarly, the feature vector outputted in the previous stepInto MLP+ neural network (right branch of FIG. 1) to obtain model output value y _mlp+ ：

y _mlp+ ＝M(E _Φ (X)；θ ₁ ) (4)

7. The method of claim 1, wherein: the step (4) specifically comprises: to combine the improved converter and the improved multi-layer perceptron MLP to obtain predictions of the whole model and perform end-to-end training, the output values of the two modules are assigned different weights w ₁ And w ₂ (the two weights can be obtained by back propagation training learning), and the prediction probability of the final model outputAs in equation (5), σ represents an activation function, two is classified as sigmoid, and multiple is classified as softmax: