CN116127175A

CN116127175A - Mobile application classification and recommendation method based on multi-modal feature fusion

Info

Publication number: CN116127175A
Application number: CN202210751368.6A
Authority: CN
Inventors: 曹步清; 钟为是
Original assignee: Hunan University of Science and Technology
Current assignee: Hunan University of Science and Technology
Priority date: 2022-06-28
Filing date: 2022-06-28
Publication date: 2023-05-16

Abstract

The invention discloses a mobile application classification and recommendation method based on multi-mode feature fusion, which comprises the following steps: (1) a mobile application feature extraction layer; (2) a mobile application classification layer; (3) a mobile application recommendation layer. The invention belongs to the technical field of computer networks, and particularly relates to a mobile application classification and recommendation method based on multi-mode feature fusion, which has better recommendation precision and quality and is superior to other methods in indexes such as Macro F1, accurac, AUC, loglos and the like.

Description

Mobile application classification and recommendation method based on multi-modal feature fusion

Technical Field

The invention belongs to the technical field of computer networks, and particularly relates to a mobile application classification and recommendation method based on multi-mode feature fusion.

Background

According to Statista statistics, the application quantity of Chinese mobile phones is close to 399 ten thousand by 2021, and the Chinese mobile phones are first worldwide. The rich applications of electronic commerce, online take-out, games, self-media and the like can comprehensively influence the clothing and eating residence of people, and the life style of people is changed. In recent years, the number of mobile applications on the internet has grown exponentially. In the face of these massive mobile applications, although there are already a large number of sample data available for training, when there is new data to process, the problems of cold start, data sparseness and the like still face. How to train the model by using the existing large-scale classified data samples has the main problem of selecting a proper model. When a new mobile application appears, the mobile application contains information such as pictures, descriptions and publishers. On the one hand, for practitioners, it is difficult for them to perform overall Bench Mark and related analysis on the mobile application market, so that the mobile application needs to be accurately classified to complete subsequent tasks such as risk control, data analysis and the like; on the other hand, it is difficult for users to select a mobile application that is suitable for their own personalized preferences and needs. Therefore, it is necessary to provide a high-quality mobile application recommendation mechanism, so as to improve the user's good experience.

Traditional mobile application classification methods, such as a multi-layer perceptron and a support vector machine, wherein the performance of most classification models depends on the quality of the labeling data set, and acquiring high-quality labeling data requires a great deal of labor cost. However, the method depends on manual design, is affected by human factors, has poor popularization capability, and has excellent characteristics in one field and not necessarily other fields. Traditional mobile application recommendation methods, such as collaborative filtering and matrix decomposition, generally convert mobile application recommendation problems into supervised learning problems. Essentially, such models are first embedded into the user and application programs, respectively, and then the interaction information between them is used to optimize the model and to execute recommendations. These methods perform well in many recommended and ranked tasks. However, the above methods also suffer from drawbacks, for example, they are sensitive to sparse data, have limited predictive power for new users, and only learn linear interactions between users and services.

With the growth of multimodal data over networks, content information from different modalities (visual, auditory, etc.) has recently been used to provide complementary signature signals for traditional text features. Most of the existing research in this area focuses on emotion classification in conversations. Specifically, pora et al propose a multi-core learning method and LSTM based sequential architecture to fuse text features, visual features, and audio features, respectively, in 2015 and 2017. According to this work, zaeh et al and zaeh et al further designed tensor fusion networks and memory fusion networks to better capture interactions between different modes. However, these approaches are designed for coarse-grained classification, which may not be very effective for our fine-grained, object-oriented mobile application classification.

Disclosure of Invention

In order to solve the problems, the invention provides a mobile application classification and recommendation method based on multi-mode feature fusion, which has better recommendation precision and quality and is superior to other methods in indexes such as Macro F1, accurac, AUC, loglos and the like.

In order to realize the functions, the novel technical scheme adopted by the invention is as follows: a mobile application classification and recommendation method based on multi-modal feature fusion comprises the following steps:

(1) Mobile application feature extraction layer

Extracting a set of multimodal samples D from the mobile application dataset, for each sample c e D, comprising a sentence S of n mobile application descriptive information words (w 1, …, wn) and an associated mobile application image I; taking the D as a training corpus, training and learning in a mobile application classifier, and correctly predicting the class labels of the mobile application in a sample which is not learned; after the initial normalization and the self-coding tokenization preprocessing are completed, extracting the mobile application description characteristics by using a Bert model in a characteristic extraction layer, and extracting the image characteristics by using a residual error network (RedNet) of an inner coil module;

(2) Mobile application classification layer

Distinguishing and fusing the feature importance of different modes by using a self-attention and multi-head attention mechanism in a transducer, and classifying the mobile application according to the fused feature information by using a Softmax classifier;

(3) Mobile application recommendation layer

Inputting the classified data into a FiBiNet model according to the category of the data, and dynamically learning the importance of the features through the relation between the weight fitting features and the samples; for more important features, more weight will be given and the weight of non-critical features will be weakened; utilizing bilinear operation to simultaneously consider the importance of each dimension so as to finish mobile application recommendation; the upper half part of the FiBiNet model is a deep part, mainly an MLP network integrates the output connection of a bilinear interaction layer into a dense vector through a connection layer, then cross combination characteristics are input into a neural network, and a prediction score is obtained in a prediction layer; the lower shallow part is the core of the FiBiNet and mainly processes the input features.

Further, the extracting the characteristics described in the step 1 includes the following steps:

selecting a pre-trained double expected library BERT as an initial model, and adjusting and learning parameters of the initial model in a Fine-Tune mode; converting each position in the input sequence into a weighted sum of the input layers using a multi-headed self-attention layer; specifically, for the ith head note, the input layer X εRd N is transformed based on the dot product attention mechanism:

wherein ,{W_Qi ，W _Ki ，W _Vi }∈R _d/m×d Are learnable parameters corresponding to the query, key, and value, respectively; then, the outputs of the m attention mechanisms are connected in series for linear transformation;

characterizing the description information of each mobile application in a self-coding mode and inputting the description information into a pre-trained BERT; in addition to token of the word, a specific classification token ([ CLS ]) is inserted at the beginning of each sequence of the input, the last transducer layer output corresponding to the classification token is used to aggregate the whole sequence characterization information, and the [ CLS ] vector and the extracted semantic vector are reserved as output O to improve model accuracy:

O＝[H ⁰ ,H ^[CLS] ]

and then, carrying out linear change on the output O through a Softmax function to obtain D-dimension N-dimension text information of the mobile application I and finally representing the vector HS.

Further, the image feature extraction in the step 1 includes the following steps:

inner coil core H _,j The e Rv is generated by a function phi, taking a single pixel at (i, j) as a condition, then rearranging channels to space, decomposing a closed multiply-add operation into two steps, wherein the multiply operation is to multiply tensors of C channels with an inner core H respectively, the add operation is to add elements in the range of the inner core to the inner core, the inner core is specially customized by pixels Xi, j at corresponding coordinates (i, j), but shared on the channels, G calculates the number of groups of the inner core, which are shared by each group, and the multiply-add operation is performed by using inner core check input, so that the characterization output definition of the inner core module is obtained:

the kernel generation function is signed as phi and the function mapping for each location (i, j) is abstracted as:

inputting the mobile application image I in the data set into the visual model RedNet-152 to obtain the final layer convolution layer output:

ResNet(I)＝{r _j ∣r _j ∈R ²⁰⁴⁸ ,j＝1,2,...,49}

the original mobile application image is segmented into 7×7=49 regions, each represented by a 2048-dimensional vector rj, the mobile application visual features are projected to the same space of text features using a linear transformation function: g=wvranet (I), where Wv e rd×2048 is a learnable parameter, and then linearly varying the output resanet (I) by Softmax function, resulting in a final characterization vector g=wtresanet (I) of d×2048 dimensions of mobile application image information.

Further, the transform formula in the step 2 is as follows:

wherein Lm is the number of layers of the multi-mode encoder, and the final hidden state marked by "[ CLS ]" is used for the mobile application classification task to effectively capture dynamic attention within and between modes of the mobile application.

The softmax generalized function formula is as follows:

further, the shallow layer part in the step 3 includes the following steps:

the classified mobile application is input into an initial embedding layer in the FibiNet according to the category, sparse features can be embedded into low-dimensional continuous real-valued vectors, the sparse matrix is converted into a dense matrix through linear transformation, hidden features of the matrix are extracted, and generalization capability of the model is improved. The output of the embedded layer is expressed as follows:

E＝[e ₁ ,e ₂ ,..,e _i ,…,e]

and introducing a SENET network to perform training learning, obtaining the embedding weight and outputting a final embedding result. And performing dimension reduction operation on the embedded features obtained in the embedded layer to obtain global features. Then, the Sigmoid activating operation is carried out on the embedded weights, and the relation between each embedded weights are learned to obtain the embedded weights of different domains. And finally, multiplying the original embedding results to obtain a final embedding result.

Further, the dimension reduction includes the following steps:

compressing the original embedded E to a statistical vector z= [ Z ] by using an average pooling operation ₁ ，…，z _i ，…，Z _f ]In z _i The calculation can be performed by the following formula:

wherein ,z_i Is global information about the i-th feature representation, k is the embedding size.

Further, the activating includes the steps of:

the embedded weights for each domain are learned based on the statistical vector Z and two full-connected layer learning weights are used. The first fully connected layer is the parameter W ₁ Is used for reducing dimension of (a) using sigma ₁ As a nonlinear function. The second full connection layer is achieved by using the parameter W ₂ The dimension is added to restore the original dimension. Formally, the domain-embedded weights can be calculated as follows:

A＝σ ₂ (W ₂ σ ₁ (W ₁ Z))

wherein

To characterize the vector, σ ₁ and σ₂ Is an activation function.

Further, the re-weighting includes the steps of:

each field of the embedding layer is multiplied by a corresponding weight to obtain a final embedding result v= { V ₁ ，…，V _f }. The overall operation can be seen as learning the weight coefficients embedded per domain, which makes the model more discriminative to the features embedded per domain. The SENET mechanism is utilized to increase the weight of important features and reduce the weight of the features with insufficient information, so that the output V of the SENET layer is obtained, and the output V is expressed as follows:

V＝[a ₁ ·e ₁ ,...,a _f ·e _f ]＝[v ₁ ,...,v _f ]

after obtaining the mobile application characterization embedding of the initial embedding layer and the SENET layer, performing second-order and higher-order feature interaction on sparse features and dense features;

the interaction vectors p and q of the output E of the embedded layer and the output V of the SENET layer are obtained through calculation:

p _ij ＝v _i ·W _ij ⊙v _j

p＝[p ₁ ,...,p _i ,...,p _n ]

q＝[q ₁ ,...,q _i ,...,q _n ]

and connecting the two obtained interaction vectors and inputting the two interaction vectors into a deep part.

Further, the deep part calculation formula in the step 3 is as follows:

wherein ,

recommended predicted values are applied to the model movement of the section epsilon (0, 1), sigma is a sigmoid function, m is a characteristic size, and the rest is a linear regression part;

optimization objective function was recommended using logoss as model:

where y is the actual tag of the ith mobile application,

n is the total number of mobile applications corresponding to the predictive label of the ith mobile application.

The invention adopts the structure to obtain the beneficial effects as follows:

1. the residual error network held by the inner coil module is introduced into the mobile application image feature extraction for the first time, so that the attention to the local features in the mobile application Logo image is facilitated, and the image feature extraction performance is improved;

2. the importance of feature dynamics of different modes is learned by using an attention mechanism, feature interaction is learned in a fine granularity mode, and the accuracy of service classification and recommendation is improved;

3. the performance of the method provided by the invention in terms of Macro F1, accurac, AUC and Loglos is superior to that of all comparison models.

Drawings

FIG. 1 is a method framework diagram of a mobile application classification and recommendation method based on multi-modal feature fusion provided by the invention;

FIG. 2 is a diagram of a FiBiNet model of the mobile application classification and recommendation method based on multimodal feature fusion provided by the invention;

FIG. 3 is a view of mobile application classification Accurcy of the method for mobile application classification and recommendation based on multi-modal feature fusion provided by the invention;

FIG. 4 is a Macro-F1 diagram of mobile application classification and recommendation method based on multi-modal feature fusion;

FIG. 5 is a graph of mobile application recommendation Loglos for the mobile application classification and recommendation method based on multimodal feature fusion provided by the invention;

fig. 6 is a mobile application recommendation AUC chart of the mobile application classification and recommendation method based on multi-modal feature fusion provided by the invention.

Detailed Description

The following description of the embodiments of the present invention will be made apparent and fully in view of the accompanying drawings, in which some, but not all embodiments of the invention are shown. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In the description of the present invention, it should be noted that the directions or positional relationships indicated by the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. are based on the directions or positional relationships shown in the drawings, are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. The present invention will be described in further detail with reference to the accompanying drawings.

As shown in fig. 1, the mobile application classification and recommendation method based on multi-modal feature fusion provided by the invention mainly comprises three parts: (1) The mobile application feature extraction layer is used for extracting features of images and description information of the mobile application node; (2) The mobile application classification layer uses a self-attention and multi-head attention mechanism in a transducer to distinguish and fuse the feature importance of different modes, and uses a Softmax classifier to classify mobile applications according to the fused feature information; (3) And the mobile application recommendation layer inputs the classified data into a FiBiNet model according to the classification of the data, and dynamically learns the importance of the features by fitting the relation between the features and the samples through weights. For more important features, more weight will be given and the weight of non-critical features will be weakened; mobile application recommendation is accomplished using bilinear operations while considering the importance of each dimension.

As shown in fig. 2, the upper half of the bilinear feature interaction model (FiBiNet) is a deep layer part, mainly, the MLP network integrates the output connection of the bilinear interaction layer into a dense vector through a connection layer, and then inputs the cross combination features into the neural network to obtain a prediction score at a prediction layer; the lower shallow part is the core of the FiBiNet and mainly processes the input features. First, in the lower left part of the graph, high-dimensional sparse input features (Sparse features of APP) are mapped to low-dimensional dense vector representations after passing through an initial embedding layer, and the vector representations are embedded with the importance of the dynamic learning features passing through a SENET layer, so that SENET-Like embedding is obtained. And then, respectively inputting the initial characterization embedding and the SENET-Like embedding into a bilinear interaction layer for feature intersection, and finally inputting the output intersection feature establishment into the MLP to finish mobile application recommendation.

Specific example 1:

1. mobile application classification experiment and analysis

The top 5, 10, 15, and 20 categories with the largest number of mobile applications were selected as experimental data, with the distribution of the top 20 categories with the largest number of mobile applications shown in table 1. 60% of the experimental data were selected as training set, 20% as validation set, and 20% as test set.

Table 1K _AGGLE Data set information

2. Mobile application classification experiment and analysis

(1) Evaluation index

To evaluate the effectiveness of mobile application classification, two commonly used evaluation criteria were used in the experiment, namely Macro F1 and Accuracy.

Accuracy: the ratio of the number of times of determination to the number of times of all determinations is shown. The number of times of correct judgment is the sum of the true example TP and the true negative example TN, the number of times of all judgment is the sum of four judgment possibilities (false positive example FP, false negative example FN, true example TP, true negative example TN), and the calculation formula of Accuracy is as follows:

macro F1: by calculating the recall rate (Rec _i ) And accuracy (Pre) _i ) Get the average recall of all N categories (Rec _ma ) And average accuracy (Pre) _ma ) Finally, macro F1 is calculated. Wherein, recall rate Rec _i Describing the proportion of correctly classified mobile applications to all such mobile applications; accuracy Pre _i The proportion of mobile applications that do belong to that class in the final classification result of the description model. Macro F1 is a harmonic mean of recall and accuracy, and the calculation formula is as follows:

(2) Contrast method

TResBert: the character part is characterized by text features and position codes extracted by BERT, the picture part is input by adding corresponding position codes to picture region features extracted by original ResNet, vector splicing operation is carried out on the two characterization vectors, the two characterization vectors are input into an Encoder layer in a transducer, the weights among multiple modes and among the modes are dynamically distributed by using the attention mechanism of the characterization vectors, and finally the mobile application is classified according to the final characterization by a Softmax classifier.

Res-bert: the character part is characterized by text features and position codes extracted by BERT, the picture part is input by adopting picture region features extracted by original ResNet and corresponding position codes, the two characterization vectors are subjected to vector splicing operation only, and the two characterization vectors are directly input into a Softmax classifier to obtain mobile application classification.

Red-bert: the character part is characterized by text features and position codes extracted by BERT, the picture part is input by adopting picture region features extracted by an inner coil residual error network RedNet and corresponding position codes, the two characterization vectors are subjected to vector splicing operation only, and the two characterization vectors are directly input into a Softmax classifier to obtain mobile application classification.

Bert: mobile applications are classified only by the mobile application description features extracted by Bert.

(3) Experimental results and analysis

The relevant parameter settings include: the BatchSize was 32, the learning rate was 5e-5, and the hot start rate was 0.1. The experimental results of all methods are shown in tables 2 and 3 and fig. 3 and 4, and it can be found that:

when the data is preprocessed, the mobile application text data is not subjected to desensitization, and only the mobile application document is subjected to token and self-coding processing, so that the overall experimental precision is not high.

Of all the comparison methods, only Bert was used with the worst performance. Namely, the precision of classifying the mobile application by using the text information alone is the worst, so that the related information between different mode data, such as image-text feature interaction, is more finely mined, the model is enabled to establish the correlation between words and object objects, and the multi-mode pre-training model and the single-mode model can obtain better precision under the same experimental setting.

In most cases, the model accuracy of mobile application image feature extraction using resolution instead of CNN is higher, thus it is seen that the importance level of different features can be better distinguished using the attention mechanism when multi-modal features are fused.

Overall, TRedBert maintains better performance. In particular, TRedBert has 50.77%, 66.55%, 76.75% and 83.6% improvement over TResbert, redbert, resbert and Bert on Accuracy, respectively, when the category number is 20. Compared with a model using vector stitching only, the model using the transducer for feature fusion has higher precision, so that the importance degree of different features can be better distinguished by using an attention mechanism during multi-mode feature fusion, and the characterization fine granularity is closer to a downstream task.

Table 2 mobile application classification Accuracy

Table 3 mobile application class M _ACRO -F1

3. Mobile application recommendation experiment and analysis

(1) Evaluation index

AUC: generally, for binary classification problems, we can set a threshold to classify the sample into positive and negative classes. And calculating corresponding coordinate points in the ROC according to different thresholds to form an ROC curve. AUC is the area under the ROC curve. When 0.5< auc <1, the model is superior to the random classifier. In particular, the closer the AUC is to 1.0, the higher the authenticity; when it is equal to 0.5, the authenticity is the lowest, and the calculation formula is as follows:

where fpr represents the false positive rate and tpr represents the true positive rate. In ROC space, the coordinate points describe the trade-off between FP (false positive case) and TP (true positive case).

Logloss: the accuracy of the classifier is measured by punishing the classification of errors. Minimizing the log loss is substantially equivalent to maximizing the accuracy of the classifier. The average deviation of the samples is reflected by the loglos, and is often used as a model to optimize the loss function, and the calculation formula is as follows:

(4) Contrast method

MLR: LR is a regression analysis that models the relationship between one or more independent and dependent variables using a least squares function called a linear regression equation. LR cannot fit nonlinear data, and MLR can fit nonlinear data by multivariate variables.

FNN: the FNN model only comprises a deep part for mobile application of high-order feature extraction, and interaction between spliced features is achieved, low-order features cannot be fitted due to the lack of a low part (machine learning model), and a pre-training model is needed.

AFM: AFM introduces an attention mechanism into the factorer model, which can give weights to different feature combinations. The overall idea is to give different attention to different combinations of mobile application features and refine the processing of the cross features.

NFM: the neural factorization machine is a neural networking attempt of an FM model, and the expression capacity of the model is enhanced by taking a second-order cross term of FM as an input of the Deep model.

Deep fm: deep fm is divided into two parts, wide & Deep. The Wide part extracts low-order features from FM and the Deep part extracts high-order features from DNN. In the mobile application recommendation scenario, either the low-order combination feature or the high-order combination feature may have an influence on the final recommendation result. Therefore, it is most important to learn the feature combinations underlying the user's click behavior.

(5) Mobile application recommended experimental result

The relevant parameter settings include: test_Size is 0.2, learning rate is 1e-5, and batch_Size is 32. The experimental results of all methods are shown in tables 4, 5, 6, and 7 and fig. 5 and 6, and it can be found that:

when the number of categories of the data set is increased and other experimental settings are unchanged, the overall recommendation performance is reduced along with the increase of the categories, particularly the FM-like model, and the feature interaction performance of the factorizer is also reduced along with the increase of the sparseness of the feature matrix, but the performance gap is not obvious when the number of the categories reaches more than 15 due to the larger data set.

In all comparative methods, the performance of MLR and AFM was poor. This is because they cannot learn higher order interaction features, resulting in the performance of mobile application recommendations being impacted. The NFM model and the deep FM model have better overall performance, which shows that the learned low-order and high-order characteristic interaction is beneficial to the improvement of recommended quality.

The performance of the depth models such as NFM and deep FM is superior to that of MLR. When the input is 20 categories, the performance of FNN and deep FM is improved by 15.88% and 13.72%, respectively. The results show that the depth model can better model and mine effective information when features are sparse.

Overall, tredbert+fibinet maintains better performance. In particular, fiBiNET has a 166.55%, 20.83%, 26.75% and 113.6% improvement in AUC compared to AFM, deepFM, NFM and MLR, respectively, when the number of categories is 20. Therefore, the importance degree of the multidimensional mobile application features and the learning of the fine-granularity high-low order feature interaction are distinguished through the attention mechanism, so that a model considering the high-low order feature interaction can obtain better recommendation performance under the same experimental setting.

Table 4 mobile application recommendation under five categories

Table 5 mobile application recommendation under ten categories

Table 6 mobile application recommendation under fifteen categories

Table 7 mobile application recommendation under twenty categories

The invention and its embodiments have been described above with no limitation, and the actual construction is not limited to the embodiments of the invention as shown in the drawings. In summary, if one of ordinary skill in the art is informed by this disclosure, a structural manner and an embodiment similar to the technical solution should not be creatively devised without departing from the gist of the present invention.

Claims

1. The mobile application classification and recommendation method based on the multi-mode feature fusion is characterized by comprising the following steps:

(1) Mobile application feature extraction layer

(2) Mobile application classification layer

(3) Mobile application recommendation layer

2. The mobile application classification and recommendation method based on multi-modal feature fusion according to claim 1, wherein the extracting of the description features in step 1 includes the steps of:

O＝[H ⁰ ,H ^[CLS] ]

3. The mobile application classification and recommendation method based on multi-modal feature fusion according to claim 2, wherein the image feature extraction in step 1 comprises the steps of:

inner coil core H _,j E Rv by function

Generating, taking single pixel at (i, j) as a condition, then rearranging channels to space, decomposing the combined multiplication and addition operation into two steps, wherein the multiplication operation is to multiply tensors of C channels with an inner core H respectively, the addition operation is to add elements in the range of the inner core to the inner core, the inner core is specially customized by pixels Xi, j at corresponding coordinates (i, j), but shared on the channels, G calculates the number of groups of the inner core, which are shared by each group, and the multiplication and addition operation is carried out by using inner core check input, so that the characterization output definition of the inner core module is defined as follows:

symbolize a kernel generation function as

And abstracting the function mapping for each location (i, j) as:

H _i,j ＝φ(X _Ψi,j )

ResNet(I)＝{r _j ∣r _j ∈R ²⁰⁴⁸ ,j＝1,2,...,49}

4. The method for classifying and recommending mobile applications based on multi-modal feature fusion according to claim 3, wherein the transform formula in step 2 is as follows:

The softmax generalized function formula is as follows:

5. the method for classifying and recommending mobile applications based on multi-modal feature fusion according to claim 4, wherein the shallow part in step 3 comprises the steps of:

E＝[e ₁ ,e ₂ ,..,e _i ,…,e _f ]

6. The mobile application classification and recommendation method based on multi-modal feature fusion of claim 5, wherein the dimension reduction comprises the steps of:

7. The mobile application classification and recommendation method based on multimodal feature fusion as claimed in claim 6, wherein said activating comprises the steps of: