CN117271824A

CN117271824A - User perception multi-mode cartoon picture recommendation system with style characteristics

Info

Publication number: CN117271824A
Application number: CN202311171937.0A
Authority: CN
Inventors: 康雁; 林豪; 李卓伦; 范宝辰; 李天靖; 杨明健; 郑敬宇
Original assignee: Yunnan University YNU
Current assignee: Yunnan University YNU
Priority date: 2023-09-12
Filing date: 2023-09-12
Publication date: 2023-12-22

Abstract

The invention discloses a user perception multi-mode cartoon recommendation system with style characteristics, which is characterized by comprising the following steps: acquiring a data set pair of a user and an illustration; extracting semantic features by using an inserting image feature encoder, extracting painting style feature vectors, and adding multi-angle text semantic features fused with knowledge in the cartoon field; modeling the multi-modal contribution degree by using an attention mechanism, and identifying the influence of a user on the interaction between different modalities of the work; the fused multi-mode features and the user features are subjected to feature intersection to obtain user-painting characterization; finally, activating and outputting the probability of collecting the picture-inserting object by the user through DNN and Sigmoid. Training an image multi-mode encoder with style characteristics and semantic characteristics output and a text mode encoder integrating domain knowledge, adding a multi-mode characteristic crossing module based on a attention mechanism and user perception of a DCN mode, constructing a multi-mode cartoon picture recommendation model, and improving picture recommendation effect.

Description

User perception multi-mode cartoon picture recommendation system with style characteristics

Technical Field

The invention relates to the field of recommendation algorithms, in particular to a user-perceived multi-mode cartoon picture recommendation system with style characteristics.

Background

With the vigorous development of the cartoon industry and the increasing love of users for cartoon, the study related to the cartoon becomes a study field with great attention, and attracts the attention of wide cartoon lovers. With the recent rise of generative models and the increasing maturity of their fine tuning schemes, high quality animation clips can be generated in a near threshold-free manner even without expert drawing knowledge, which will cause an exponential increase in the clips. Users are positively challenged to find a vast number of drawings that meet their interests and preferences.

With the development of Web and storage systems, the amount of information available increases rapidly. In solving the problem of information overload, the recommendation system is a good solution in various fields such as electronic commerce, identify applicable funding agency here. If none, delete this. Collaborative filtering (Collaborative filtering, CF) is a method commonly used in recommendation systems that exploits past interaction behavior of users, however, due to the very sparse interaction data, it is difficult to accurately capture user preferences and item attributes when CF methods involve little or no interaction. To alleviate this difficulty, many studies have not only utilized additional information about users and items, such as social networks and contextual features between users. In recent years, multimodal recommendation systems (multimodalrecommendation System, MRS) have received widespread attention in academia and industry. The multi-modal recommendation system utilizes different modal characteristics (e.g., visual and textual morphology) of the item, as well as the interaction information, to better understand the properties of items not disclosed in the interaction information, thereby capturing the user's preferences more accurately. There have been many previous work demonstrating the effectiveness of multi-modal recommendation systems. Such as fashion fields, news, short videos, music recommendations, food recommendations, and short video recommendations.

Although recommendation systems have been used in various fields, there is currently no effective illustration recommendation system. The cartoon has huge number and various types, and is used as an important component of cartoon works, and has rich and various styles and themes. In the picture recommendation, not only the text features of the entities of the picture such as the title, the text label, the creator information and the like, but also the semantics of the image, the style of the image and the like are required to be paid attention to. Therefore, the invention provides a user perception multi-mode cartoon recommendation system with style characteristics.

Disclosure of Invention

The invention aims at: aiming at the problem that an existing and not effective picture recommendation system helps users find pictures in massive pictures, the system is provided with a user perception multi-mode cartoon picture recommendation system with style characteristics, an image multi-mode encoder with style characteristics and semantic characteristics output and a text mode encoder integrating field knowledge are trained, and a multi-mode characteristic crossing module containing user perception based on an attention mechanism and a DCN mode is added, so that a multi-mode cartoon picture recommendation model is built, and picture recommendation effects are improved.

The technical scheme of the invention is as follows:

a user perception multi-mode cartoon recommendation system with style characteristics is characterized by comprising the following steps:

user and illustration object input: acquiring a data set pair of a user and an illustration, and dividing the data set pair into a training set and a testing set;

multi-modal feature extraction: extracting semantic features by using an inserting image feature encoder, extracting painting style feature vectors, and adding multi-angle text semantic features fused with knowledge in the cartoon field;

user-aware multimodal feature contribution fusion mechanism: modeling the multi-modal contribution degree by using an attention mechanism, carrying out attention weighting on the user characteristic vector and the multi-modal characteristics of the artwork, and identifying the influence of the user on the interaction between different modalities of the work;

multimodal feature crossing: the fused multi-mode features and the user features are subjected to feature crossing through a DCN module, and user-painting characterization after feature crossing is obtained;

finally, activating and outputting the probability of collecting the picture-inserting object by the user through DNN and Sigmoid.

Further, the multi-modal feature extraction specifically includes the following steps:

constructing a large cartoon image-multi-class multi-label data set by taking the artificial text labels corresponding to the picture images and the picture itself as sample pairs, and training based on label prediction tasks; constructing a ResNet classification model of a proxy task predicted by multiple classes and labels of a supervised large-scale cartoon picture;

after model convergence is completed, the model structure is modified, the original classification head is deleted, and semantic feature output and style feature output are added.

Further, the semantic feature output includes: after the output of a layer of convolution layer close to the tail is accessed into a parameter-free global pooling layer, the output is reduced to 1024-dimension as the image semantic feature output;

the style characteristic output includes: starting from the input layer of the network, the activation output of the first layers represents the edge and texture low-level characteristics, and the image content is represented by the values of the characteristic diagram;

by calculating the outer product of feature vectors at each location and averaging the outer product at all locations, a Gram matrix containing this information can be calculated, and for a particular layer of Gram matrix, the specific calculation method is:

representing the pixel value on the feature map c at layer 1, position (i, j), of the network, ++>Representing pixel values on the feature map d at a first layer of the network, position (i, j); IJ represents the size of the feature map, i.e. the number of (i, j) on the feature map.

Further, the multi-modal feature extraction further includes the steps of:

multi-angle domain text pair construction: constructing a sentence characteristic data set by acquiring public information on a website; the website includes Bangumi, pixiv, animeList, wikipedia, moegirl; the sentence characteristics comprise multilingual comparison of domain nouns, domain nouns and paraphrasing and links between entities in the domain; the contact between the entities in the field comprises the contact between the works and the roles and the contact between the roles;

fine text encoder: the text encoder model structure is sender-transformers, pretrained weights are used for discrete-base-multilangual-based-v 2, fine tuning is performed under the dataset at a learning rate of 5e-5, and the loss function is multiplenegotives-RankingLoss, which approximates the distance between text pairs.

Further, the user-perceived multimodal feature contribution degree fusion mechanism specifically includes the following steps:

by user feature vectorsAs a query, four multi-modal features drawn are scaled to the same dimension as the user feature vector and stacked into a multi-modal feature list:

wherein the method comprises the steps ofRepresenting text semantic features->Representing the semantic features of the image,representing image style semantic features,/->Representing the remaining features of the image;

each modality is calculated by the attention mechanismFor user u _i Sum work i _j Effects of each interaction between:

wherein%Respectively representing dot product and scaling factor;

then, by fusing u _i Multi-modal embedded list of (a)Obtain->Relative to->Is embedded personalized->

Further, the multi-modal feature crossover specifically includes the following steps:

after drawing feature vectors are obtained through a multi-mode contribution measurement fusion mechanism perceived by a user, the drawing feature vectors are spliced with the user feature vectors and then input into a feature crossing module based on DCNv2, and user-drawing representation after n-order crossing is obtained:

v ₀ ＝Concate(v _{user_final} ,v _{item_final} )v _n ＝DCN(v ₀ )，

wherein v is _{user_final} Representing user feature vectors, v _{item_final} Representing the user perceived pictorial multimodal fusion feature vector;

the calculation method of the first+1st cross layer of the DCNv2 comprises the following steps:

v _l+1 ＝v ₀ ⊙(W _l v _l +b _l )+v _l ，

wherein,is a base layer comprising 1 st order original features, < >>Input and output, respectively denoted as (l+1) th cross layer,/o>Is a weight matrix and bias vector for subsequent learning, the highest polynomial order is l+1 for an l-layer crossover network, and the network contains all feature crossings from first order to highest order.

Further, the training set includes n instances (x, y), where n instances represent n behavior records of the user for the drawing, where x is a data record of m fields, involving a pair of the user and the drawing, y e {0,1} is a related label=1 indicating the clicking behavior of the user, indicating that the user browses and collects the item, and y=0 indicates browsing only the uncollected item.

Compared with the prior art, the invention has the beneficial effects that:

1. a user perception multi-mode cartoon recommendation system with style characteristics trains an image multi-mode encoder with style characteristics and semantic characteristics output and a text mode encoder integrating domain knowledge, and a multi-mode characteristic crossing module comprising a user perception based on an attention mechanism and a DCN module is added, so that a multi-mode cartoon recommendation model is constructed, and personalized and accurate cartoon recommendation is provided for users;

2. a user perception multi-mode cartoon picture recommending system with style characteristics is added with a DCNv2 module to explicitly simulate a multi-mode automatic characteristic cross of a limited order to replace a manual characteristic cross;

3. a user perception multi-mode cartoon recommendation system with style characteristics is introduced, a multi-mode characteristic crossing module which is based on a attention mechanism and user perception of a DCN module is included, and the influence of preference behaviors and modes of a user on preference is comprehensively considered.

Drawings

FIG. 1 is a flow chart of a user-aware multimodal cartoon recommendation system with style features.

FIG. 2 is a statistical plot of interaction frequency of user-perceived multi-modal cartoon recommendation system experimental verification with style features.

Detailed Description

It is noted that relational terms such as "first" and "second", and the like, are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The features and capabilities of the present invention are described in further detail below in connection with examples.

Referring to fig. 1-2, a user-perceived multi-mode cartoon recommendation system with style features, comprising the steps of:

user and illustration object input: acquiring a data set pair of a user and an illustration, and dividing the data set pair into a training set and a testing set; the training set includes n instances (x, y) representing n behavior records of the user for the drawing, where x is a data record of m fields, involving a pair of the user and the drawing, y e {0,1} is a related label=1 representing the clicking behavior of the user, indicating that the user browses and favorites the item, and y=0 indicates browsing only the uncollected item.

the multi-modal feature extraction specifically includes the following steps:

constructing a large cartoon image-multi-class multi-label data set by taking the artificial text labels corresponding to the picture images and the picture itself as sample pairs, and training based on label prediction tasks; constructing a ResNet classification model of a proxy task predicted by multiple classes and labels of a supervised large-scale cartoon picture; the labels mainly comprise gestures, roles, painting style classification, physical characteristics such as works, clothes, color development and the like of the characters, descriptions of other image contents and the like.

The semantic feature output includes: after the output of a layer of convolution layer close to the tail is accessed into a parameter-free global pooling layer, the output is reduced to 1024-dimension as the image semantic feature output;

representing the pixel value on the feature map c at layer 1, position (i, j), of the network, ++>Representing pixel values on the feature map d at a first layer of the network, position (i, j); IJ represents the size of the feature map, i.e. the number of (i, j) on the feature map. The purpose of the formula is to calculate the correlation between the feature maps c and d, in particular it calculates the sum of the dot product (pixel-wise product) of the convolution output two feature maps at all positions, and then divides by the total number of positions IJ on the feature maps.

Specifically, in our model, we use the output feature graphs of the first three convolution layers of the model to calculate through Gram matrix, and then do the biggest pooling to stack the three 256-dimensional feature outputs, and then do biggest pooling to finally obtain 256-dimensional image style features.

multiple modalities have different contributions to the user's preferential behavior, e.g. user u _i Liking a work i _j It may be 80% because of style, 10% because of its image semantic content, 10% because of the semantic features of the text labels that its image is manually annotated.

The multi-mode feature contribution degree fusion mechanism perceived by the user specifically comprises the following steps:

wherein the method comprises the steps ofRepresenting text semantic features->Representing the semantic features of the image,representing image style languageSense feature (s)/(s)>Representing the remaining features of the image;

wherein%Respectively representing dot product and scaling factor;

Through a User-perceived multimodal contribution measurement mechanism (User-Aware multimodal contribution measurement, UAMCM) based on an attention mechanism, attention weighting is carried out on the User feature vector and the multi-modal feature set of the picture, and finally the picture multi-modal fusion feature vector perceived by the User is obtained, and the vector is subsequently called a picture feature vector.

v _{item_final} ＝UAMCM(v _{user_final} ,fi _{item_multi_modal} ,fi _{item_multi_modal} )。

Modeling only item features with user-specific weights is not sufficient because the effect of the modalities on the user's preferences is not single, just discussing different weights of a single modality may still not express the user preferences well, and the combined intersection between multiple modalities can better model the user preferences. For example, a user may collect an illustration because the work has both class A and class B semantic modalities, even though the favorites for class A may not be as good as for class C, the user still produces preferential behavior under the simultaneous action of the class A and class B semantic modality combinations, which suggests that the second and fourth order intersection of features such as style and content should be an important component of preferential behavior. The above is merely to use the second-order intersection of the multi-modal features to illustrate the necessity of multi-modal feature combination intersection in click rate estimation, in an actual life scene, we need to model the influence of multiple features on a target when the features appear simultaneously, and such pattern cannot be fully mined by means of expert manual, so that it is very necessary for the model to realize automatic feature intersection.

the multi-mode feature crossover specifically comprises the following steps:

v ₀ ＝Concate(v _{user_final} ,v _{item_final} )v _n ＝DCN(v ₀ )，

wherein v is _{user_final} Representing user feature vectors, v _{item_final} Multi-modal fusion feature directions representing user perceived artworkAn amount of;

v _l+1 ＝v ₀ ⊙(W _l v _l +b _l )+v _l ，

In another specific embodiment, the trained dataset is made up of n instances (x, y) representing n behavioral records of the user for the drawing. Where x is a data record of m fields, typically related to a pair of users and a drawing, y e {0,1} is the relevant label representing the user's click behavior (y=1 represents the user browses and stows the item, y=0 represents browsing only the uncollected item).

Each instance is converted to (x, y), where the feature set x= [ x _{user_interest_label} ,f _{image_semantics} ,…]Is a dictionary format input in which each feature is, for example, { feature_name } ₁ :feature_value ₁ Key-value pairs of feature names and feature values. The specific inputs are shown in table 1.

TABLE 1 input data

Table 2 input data processing

The above examples are treated as vectors that are acceptable to the model. The multi-mode input of the drawings is provided with a drawing manual label x _{illustration_label} Drawing image x _{illustration_image} And drawing residual discrete features x _{illustration_other} 。x _{illustration_image} Via an image encoder W _image Output image style characteristics f _{image_semantics} With image semantic feature f _{image_semantics} ，x _{illustration_label} Via a text encoder W _text Outputting text semantic feature f _{label_semantics} The remaining discrete features are also scaled by ebedding as input Embedding (f) _{item_other} ). To sum up, these raw inputs are transformed in Table 2, table 3 to obtain the final model input vector v _{item_multi_modal} 。

TABLE 3 discrete input processing

The multi-modal feature set of the final artwork is represented as

v _{item_multi_modal} ＝{f _{label_semantics} ,

f _{image_semantics} ,

f _{image_style} ,

Embedding(f _{item_other} )}

The personalized recommendation problem is regarded as a binary classification problem, and the aim is to establishTo estimate the probability that a user will collect a particular drawing in a given feature environment.

Experiment verification

Experimental setup

Data set: since there is no high quality cartoon recommendation public data set containing multimodal information at present, we construct a cartoon recommendation data set based on the log information of a commercial cartoon website for one year (2021, 1 month, 1 day to 2022, 1 month, 1 day) and the drawing metadata. The data set is divided according to time, the data of the first 10 months is used as a training set, and the data of the last 2 months is used as a test set (note that the data is not specially processed so that users and pictures in the test set are contained in the training set, that is, new users and new objects exist in the test set). Some statistical analyses were performed on the data sets, and table 4 counts the number of users and the number of drawings and their number of interactions for the entire data set, and fig. 2 counts the number of interactions and the number of users interacting with them under the view of the users, respectively.

Table 4 dataset statistics

The comparative evaluation indexes were Area Under the Curve (AUC), binary CrossEntropy Loss (BCE Loss) and model Parameter number (Parameter).

For all models, adam (Adaptive Moment Estimation) is used as an optimizer to optimize all models, the learning rate is set to be 1e-3, the phenomenon of over-fitting of the recommended system One-epochs is avoided, and all models are tested after only 1 Epoch is trained on a training set. All model enabling layers to output dimensions of 8, enabling L2 regularization parameters to be 1e-3, enabling the super parameters of the other model structures to be set by default, and maintaining parameter quantities of all comparison models to be in the same magnitude.

Since the cartoon recommendation field does not have much work at present, and we model the problem as a click rate-like estimation problem, in order to evaluate the performance of the model, we compare the model with the SOTA model of the general click rate field published in many years below on the set index:

conventional FM and deep learning combined improved dual stream networks:

1) xDeepFM [ KDD 2018] proposes a Compressed Interaction Network (CIN) to learn the interactions of high-order features. It can learn feature interactions of some degree of demarcation effectively and at the vector level.

2) DIFM [ IJCAI 2020] adaptively learns flexible representations of a given feature from different input instances by means of Dual-Factor Esti-matching Network (Dual-FEN). It can learn the input perceptual factors (for re-weighting the original feature representation) efficiently at the same time, not only at the bit level, but also at the vector level.

3) The FinalMLP [ AAAI 2023] proposes that a stream specific feature gating and multi-headed bilinear fusion module enhance input differentiation of dual streams and enable stream level interactions, and the surprising results challenge the validity and necessity of existing explicit feature interaction modeling studies.

Automatic feature crossover single flow network:

1) DCN [ ADKDD 2017] proposes that the Cross Network with linear time and space complexity replaces the Wide part of the Wide & Deep model, and the intersection between features is automatically carried out, so that a Deep-intersection Network (Deep & Cross Network) is formed.

2) The FiBiNET [ RecSys 2019] proposes that the SENET module can dynamically learn the importance of features, introduces three types of bilinear interaction layers to learn the feature interactions, rather than using Hadamard products or inner products to calculate the feature interactions.

3) DCN V2 WWW 2021 proposes a expressive but simple way to model explicit cross-features. The low rank nature of the weight matrix in the crossover network was observed, and a hybrid model of low rank DCN (DCN-Mix) was also proposed to achieve a more appropriate trade-off between model performance and delay.

We implement our model based on the TensorFlow framework and the above comparative model.

Comparison results

Table 5 baseline comparison

The results of performance comparison experiments are summarized in Table 5, and we find that by integrating the multi-modal information, the model issued by us is obviously superior to other non-multi-modal recommendation systems in terms of AUC index and BCELoss index, and has the lifting range of 14.95% and 5.152% respectively compared with the most excellent comparison model, thereby verifying the advantages of integrating different modalities. Different modality information can generate a more comprehensive representation for an item, thereby enhancing the user's preference learning and to some extent resisting the data sparseness caused by long-tailed data (as can be seen from statistics of the data set, which belongs to a typical long-tailed distribution). Users often have more rich content information about the animation, not just from tags of particular interest, but also the attractiveness of the animation, which is an important part of the preference behavior. Therefore, the semantic information and style information of the picture-inserting image can enrich the recommended picture-inserting characterization. The method combines the text and the visual semantic style information into the picture-inserting expression learning, models the internal correlation of the picture-inserting expression learning, improves the understanding of picture-inserting contents, realizes more accurate interest matching, and obtains better performance.

It is noted that the improvement of BCE Loss index is much larger than that of other models, which is believed to be caused by the easy-to-fit property of the recommended field being more obvious than other fields, and despite the addition of the strong-robustness multi-modal feature, there is still the phenomenon of over-fit, which is shown that the improvement of BCE Loss does not bring about the same magnitude of improvement of AUC, and not only is the problem of the model issued by us, but also the behavior of DIFM and DCN V2 are compared with other models.

The model which we release is not optimal in terms of parameter comparison, but still belongs to the same order as other models, because the increase of layer parameters of the EMBedding layer mainly caused by adding the multi-mode features leads to the increase of the overall parameter quantity compared with other models, but the problem of model operation efficiency is not caused because the extracted features belong to dense features and have limited orders (the maximum image semantic features are 1024 dimensions).

The foregoing examples merely represent specific embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the present application. It should be noted that, for those skilled in the art, several variations and modifications can be made without departing from the technical solution of the present application, which fall within the protection scope of the present application.

Claims

1. A user perception multi-mode cartoon recommendation system with style characteristics is characterized by comprising the following steps:

2. The user-perceived multimodal cartoon recommendation system with style characteristics of claim 1 wherein said multimodal feature extraction specifically comprises the steps of:

3. The user-perceived multimodal cartoon recommendation system with style characteristics of claim 2 wherein said semantic characteristics output comprises: after the output of a layer of convolution layer close to the tail is accessed into a parameter-free global pooling layer, the output is reduced to 1024-dimension as the image semantic feature output;

4. A user-perceived multimodal cartoon recommendation system with style characteristics as claimed in claim 2 or 3 wherein said multimodal characteristics extraction further comprises the steps of:

5. The system of claim 1, wherein the user-perceived multimodal feature contribution fusion mechanism comprises the steps of:

wherein%Respectively representing dot product and scaling factor;

then, by fusing u _i Multi-modal embedded list of (a)Obtain->Relative toIs embedded personalized->

6. The user-perceived multimodal cartoon recommendation system with style characteristics of claim 1 wherein said multimodal characteristics cross specifically comprises the steps of:

v ₀ ＝Concate(v _{user_final} ,v _{item_final} )v _n ＝DCN(v ₀ )，

v _l+1 ＝v ₀ ⊙(W _l v _l +b _l )+v _l ，

7. The user-perceived multimodal cartoon recommendation system with style characteristics of claim 1 wherein said training set includes n instances (x, y) representing n behavioral records of a user for a drawing, where x is a data record of m fields, relating to a pair of user and drawing, y e {0,1} is the associated label = 1 representing the user's click behavior indicating that the user browses and favorites the item, y = 0 indicating browsing only uncollected items.