CN117271824A - User perception multi-mode cartoon picture recommendation system with style characteristics - Google Patents

User perception multi-mode cartoon picture recommendation system with style characteristics Download PDF

Info

Publication number
CN117271824A
CN117271824A CN202311171937.0A CN202311171937A CN117271824A CN 117271824 A CN117271824 A CN 117271824A CN 202311171937 A CN202311171937 A CN 202311171937A CN 117271824 A CN117271824 A CN 117271824A
Authority
CN
China
Prior art keywords
user
feature
cartoon
features
style
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311171937.0A
Other languages
Chinese (zh)
Inventor
康雁
林豪
李卓伦
范宝辰
李天靖
杨明健
郑敬宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yunnan University YNU
Original Assignee
Yunnan University YNU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yunnan University YNU filed Critical Yunnan University YNU
Priority to CN202311171937.0A priority Critical patent/CN117271824A/en
Publication of CN117271824A publication Critical patent/CN117271824A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Abstract

The invention discloses a user perception multi-mode cartoon recommendation system with style characteristics, which is characterized by comprising the following steps: acquiring a data set pair of a user and an illustration; extracting semantic features by using an inserting image feature encoder, extracting painting style feature vectors, and adding multi-angle text semantic features fused with knowledge in the cartoon field; modeling the multi-modal contribution degree by using an attention mechanism, and identifying the influence of a user on the interaction between different modalities of the work; the fused multi-mode features and the user features are subjected to feature intersection to obtain user-painting characterization; finally, activating and outputting the probability of collecting the picture-inserting object by the user through DNN and Sigmoid. Training an image multi-mode encoder with style characteristics and semantic characteristics output and a text mode encoder integrating domain knowledge, adding a multi-mode characteristic crossing module based on a attention mechanism and user perception of a DCN mode, constructing a multi-mode cartoon picture recommendation model, and improving picture recommendation effect.

Description

User perception multi-mode cartoon picture recommendation system with style characteristics
Technical Field
The invention relates to the field of recommendation algorithms, in particular to a user-perceived multi-mode cartoon picture recommendation system with style characteristics.
Background
With the vigorous development of the cartoon industry and the increasing love of users for cartoon, the study related to the cartoon becomes a study field with great attention, and attracts the attention of wide cartoon lovers. With the recent rise of generative models and the increasing maturity of their fine tuning schemes, high quality animation clips can be generated in a near threshold-free manner even without expert drawing knowledge, which will cause an exponential increase in the clips. Users are positively challenged to find a vast number of drawings that meet their interests and preferences.
With the development of Web and storage systems, the amount of information available increases rapidly. In solving the problem of information overload, the recommendation system is a good solution in various fields such as electronic commerce, identify applicable funding agency here. If none, delete this. Collaborative filtering (Collaborative filtering, CF) is a method commonly used in recommendation systems that exploits past interaction behavior of users, however, due to the very sparse interaction data, it is difficult to accurately capture user preferences and item attributes when CF methods involve little or no interaction. To alleviate this difficulty, many studies have not only utilized additional information about users and items, such as social networks and contextual features between users. In recent years, multimodal recommendation systems (multimodalrecommendation System, MRS) have received widespread attention in academia and industry. The multi-modal recommendation system utilizes different modal characteristics (e.g., visual and textual morphology) of the item, as well as the interaction information, to better understand the properties of items not disclosed in the interaction information, thereby capturing the user's preferences more accurately. There have been many previous work demonstrating the effectiveness of multi-modal recommendation systems. Such as fashion fields, news, short videos, music recommendations, food recommendations, and short video recommendations.
Although recommendation systems have been used in various fields, there is currently no effective illustration recommendation system. The cartoon has huge number and various types, and is used as an important component of cartoon works, and has rich and various styles and themes. In the picture recommendation, not only the text features of the entities of the picture such as the title, the text label, the creator information and the like, but also the semantics of the image, the style of the image and the like are required to be paid attention to. Therefore, the invention provides a user perception multi-mode cartoon recommendation system with style characteristics.
Disclosure of Invention
The invention aims at: aiming at the problem that an existing and not effective picture recommendation system helps users find pictures in massive pictures, the system is provided with a user perception multi-mode cartoon picture recommendation system with style characteristics, an image multi-mode encoder with style characteristics and semantic characteristics output and a text mode encoder integrating field knowledge are trained, and a multi-mode characteristic crossing module containing user perception based on an attention mechanism and a DCN mode is added, so that a multi-mode cartoon picture recommendation model is built, and picture recommendation effects are improved.
The technical scheme of the invention is as follows:
a user perception multi-mode cartoon recommendation system with style characteristics is characterized by comprising the following steps:
user and illustration object input: acquiring a data set pair of a user and an illustration, and dividing the data set pair into a training set and a testing set;
multi-modal feature extraction: extracting semantic features by using an inserting image feature encoder, extracting painting style feature vectors, and adding multi-angle text semantic features fused with knowledge in the cartoon field;
user-aware multimodal feature contribution fusion mechanism: modeling the multi-modal contribution degree by using an attention mechanism, carrying out attention weighting on the user characteristic vector and the multi-modal characteristics of the artwork, and identifying the influence of the user on the interaction between different modalities of the work;
multimodal feature crossing: the fused multi-mode features and the user features are subjected to feature crossing through a DCN module, and user-painting characterization after feature crossing is obtained;
finally, activating and outputting the probability of collecting the picture-inserting object by the user through DNN and Sigmoid.
Further, the multi-modal feature extraction specifically includes the following steps:
constructing a large cartoon image-multi-class multi-label data set by taking the artificial text labels corresponding to the picture images and the picture itself as sample pairs, and training based on label prediction tasks; constructing a ResNet classification model of a proxy task predicted by multiple classes and labels of a supervised large-scale cartoon picture;
after model convergence is completed, the model structure is modified, the original classification head is deleted, and semantic feature output and style feature output are added.
Further, the semantic feature output includes: after the output of a layer of convolution layer close to the tail is accessed into a parameter-free global pooling layer, the output is reduced to 1024-dimension as the image semantic feature output;
the style characteristic output includes: starting from the input layer of the network, the activation output of the first layers represents the edge and texture low-level characteristics, and the image content is represented by the values of the characteristic diagram;
by calculating the outer product of feature vectors at each location and averaging the outer product at all locations, a Gram matrix containing this information can be calculated, and for a particular layer of Gram matrix, the specific calculation method is:
representing the pixel value on the feature map c at layer 1, position (i, j), of the network, ++>Representing pixel values on the feature map d at a first layer of the network, position (i, j); IJ represents the size of the feature map, i.e. the number of (i, j) on the feature map.
Further, the multi-modal feature extraction further includes the steps of:
multi-angle domain text pair construction: constructing a sentence characteristic data set by acquiring public information on a website; the website includes Bangumi, pixiv, animeList, wikipedia, moegirl; the sentence characteristics comprise multilingual comparison of domain nouns, domain nouns and paraphrasing and links between entities in the domain; the contact between the entities in the field comprises the contact between the works and the roles and the contact between the roles;
fine text encoder: the text encoder model structure is sender-transformers, pretrained weights are used for discrete-base-multilangual-based-v 2, fine tuning is performed under the dataset at a learning rate of 5e-5, and the loss function is multiplenegotives-RankingLoss, which approximates the distance between text pairs.
Further, the user-perceived multimodal feature contribution degree fusion mechanism specifically includes the following steps:
by user feature vectorsAs a query, four multi-modal features drawn are scaled to the same dimension as the user feature vector and stacked into a multi-modal feature list:
wherein the method comprises the steps ofRepresenting text semantic features->Representing the semantic features of the image,representing image style semantic features,/->Representing the remaining features of the image;
each modality is calculated by the attention mechanismFor user u i Sum work i j Effects of each interaction between:
wherein%Respectively representing dot product and scaling factor;
then, by fusing u i Multi-modal embedded list of (a)Obtain->Relative to->Is embedded personalized->
Further, the multi-modal feature crossover specifically includes the following steps:
after drawing feature vectors are obtained through a multi-mode contribution measurement fusion mechanism perceived by a user, the drawing feature vectors are spliced with the user feature vectors and then input into a feature crossing module based on DCNv2, and user-drawing representation after n-order crossing is obtained:
v 0 =Concate(v user_final ,v item_final )v n =DCN(v 0 ),
wherein v is user_final Representing user feature vectors, v item_final Representing the user perceived pictorial multimodal fusion feature vector;
the calculation method of the first+1st cross layer of the DCNv2 comprises the following steps:
v l+1 =v 0 ⊙(W l v l +b l )+v l
wherein,is a base layer comprising 1 st order original features, < >>Input and output, respectively denoted as (l+1) th cross layer,/o>Is a weight matrix and bias vector for subsequent learning, the highest polynomial order is l+1 for an l-layer crossover network, and the network contains all feature crossings from first order to highest order.
Further, the training set includes n instances (x, y), where n instances represent n behavior records of the user for the drawing, where x is a data record of m fields, involving a pair of the user and the drawing, y e {0,1} is a related label=1 indicating the clicking behavior of the user, indicating that the user browses and collects the item, and y=0 indicates browsing only the uncollected item.
Compared with the prior art, the invention has the beneficial effects that:
1. a user perception multi-mode cartoon recommendation system with style characteristics trains an image multi-mode encoder with style characteristics and semantic characteristics output and a text mode encoder integrating domain knowledge, and a multi-mode characteristic crossing module comprising a user perception based on an attention mechanism and a DCN module is added, so that a multi-mode cartoon recommendation model is constructed, and personalized and accurate cartoon recommendation is provided for users;
2. a user perception multi-mode cartoon picture recommending system with style characteristics is added with a DCNv2 module to explicitly simulate a multi-mode automatic characteristic cross of a limited order to replace a manual characteristic cross;
3. a user perception multi-mode cartoon recommendation system with style characteristics is introduced, a multi-mode characteristic crossing module which is based on a attention mechanism and user perception of a DCN module is included, and the influence of preference behaviors and modes of a user on preference is comprehensively considered.
Drawings
FIG. 1 is a flow chart of a user-aware multimodal cartoon recommendation system with style features.
FIG. 2 is a statistical plot of interaction frequency of user-perceived multi-modal cartoon recommendation system experimental verification with style features.
Detailed Description
It is noted that relational terms such as "first" and "second", and the like, are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The features and capabilities of the present invention are described in further detail below in connection with examples.
Referring to fig. 1-2, a user-perceived multi-mode cartoon recommendation system with style features, comprising the steps of:
user and illustration object input: acquiring a data set pair of a user and an illustration, and dividing the data set pair into a training set and a testing set; the training set includes n instances (x, y) representing n behavior records of the user for the drawing, where x is a data record of m fields, involving a pair of the user and the drawing, y e {0,1} is a related label=1 representing the clicking behavior of the user, indicating that the user browses and favorites the item, and y=0 indicates browsing only the uncollected item.
Multi-modal feature extraction: extracting semantic features by using an inserting image feature encoder, extracting painting style feature vectors, and adding multi-angle text semantic features fused with knowledge in the cartoon field;
the multi-modal feature extraction specifically includes the following steps:
constructing a large cartoon image-multi-class multi-label data set by taking the artificial text labels corresponding to the picture images and the picture itself as sample pairs, and training based on label prediction tasks; constructing a ResNet classification model of a proxy task predicted by multiple classes and labels of a supervised large-scale cartoon picture; the labels mainly comprise gestures, roles, painting style classification, physical characteristics such as works, clothes, color development and the like of the characters, descriptions of other image contents and the like.
After model convergence is completed, the model structure is modified, the original classification head is deleted, and semantic feature output and style feature output are added.
The semantic feature output includes: after the output of a layer of convolution layer close to the tail is accessed into a parameter-free global pooling layer, the output is reduced to 1024-dimension as the image semantic feature output;
the style characteristic output includes: starting from the input layer of the network, the activation output of the first layers represents the edge and texture low-level characteristics, and the image content is represented by the values of the characteristic diagram;
by calculating the outer product of feature vectors at each location and averaging the outer product at all locations, a Gram matrix containing this information can be calculated, and for a particular layer of Gram matrix, the specific calculation method is:
representing the pixel value on the feature map c at layer 1, position (i, j), of the network, ++>Representing pixel values on the feature map d at a first layer of the network, position (i, j); IJ represents the size of the feature map, i.e. the number of (i, j) on the feature map. The purpose of the formula is to calculate the correlation between the feature maps c and d, in particular it calculates the sum of the dot product (pixel-wise product) of the convolution output two feature maps at all positions, and then divides by the total number of positions IJ on the feature maps.
Specifically, in our model, we use the output feature graphs of the first three convolution layers of the model to calculate through Gram matrix, and then do the biggest pooling to stack the three 256-dimensional feature outputs, and then do biggest pooling to finally obtain 256-dimensional image style features.
Multi-angle domain text pair construction: constructing a sentence characteristic data set by acquiring public information on a website; the website includes Bangumi, pixiv, animeList, wikipedia, moegirl; the sentence characteristics comprise multilingual comparison of domain nouns, domain nouns and paraphrasing and links between entities in the domain; the contact between the entities in the field comprises the contact between the works and the roles and the contact between the roles;
fine text encoder: the text encoder model structure is sender-transformers, pretrained weights are used for discrete-base-multilangual-based-v 2, fine tuning is performed under the dataset at a learning rate of 5e-5, and the loss function is multiplenegotives-RankingLoss, which approximates the distance between text pairs.
User-aware multimodal feature contribution fusion mechanism: modeling the multi-modal contribution degree by using an attention mechanism, carrying out attention weighting on the user characteristic vector and the multi-modal characteristics of the artwork, and identifying the influence of the user on the interaction between different modalities of the work;
multiple modalities have different contributions to the user's preferential behavior, e.g. user u i Liking a work i j It may be 80% because of style, 10% because of its image semantic content, 10% because of the semantic features of the text labels that its image is manually annotated.
The multi-mode feature contribution degree fusion mechanism perceived by the user specifically comprises the following steps:
by user feature vectorsAs a query, four multi-modal features drawn are scaled to the same dimension as the user feature vector and stacked into a multi-modal feature list:
wherein the method comprises the steps ofRepresenting text semantic features->Representing the semantic features of the image,representing image style languageSense feature (s)/(s)>Representing the remaining features of the image;
each modality is calculated by the attention mechanismFor user u i Sum work i j Effects of each interaction between:
wherein%Respectively representing dot product and scaling factor;
then, by fusing u i Multi-modal embedded list of (a)Obtain->Relative to->Is embedded personalized->
Through a User-perceived multimodal contribution measurement mechanism (User-Aware multimodal contribution measurement, UAMCM) based on an attention mechanism, attention weighting is carried out on the User feature vector and the multi-modal feature set of the picture, and finally the picture multi-modal fusion feature vector perceived by the User is obtained, and the vector is subsequently called a picture feature vector.
v item_final =UAMCM(v user_final ,fi item_multi_modal ,fi item_multi_modal )。
Modeling only item features with user-specific weights is not sufficient because the effect of the modalities on the user's preferences is not single, just discussing different weights of a single modality may still not express the user preferences well, and the combined intersection between multiple modalities can better model the user preferences. For example, a user may collect an illustration because the work has both class A and class B semantic modalities, even though the favorites for class A may not be as good as for class C, the user still produces preferential behavior under the simultaneous action of the class A and class B semantic modality combinations, which suggests that the second and fourth order intersection of features such as style and content should be an important component of preferential behavior. The above is merely to use the second-order intersection of the multi-modal features to illustrate the necessity of multi-modal feature combination intersection in click rate estimation, in an actual life scene, we need to model the influence of multiple features on a target when the features appear simultaneously, and such pattern cannot be fully mined by means of expert manual, so that it is very necessary for the model to realize automatic feature intersection.
Multimodal feature crossing: the fused multi-mode features and the user features are subjected to feature crossing through a DCN module, and user-painting characterization after feature crossing is obtained;
the multi-mode feature crossover specifically comprises the following steps:
after drawing feature vectors are obtained through a multi-mode contribution measurement fusion mechanism perceived by a user, the drawing feature vectors are spliced with the user feature vectors and then input into a feature crossing module based on DCNv2, and user-drawing representation after n-order crossing is obtained:
v 0 =Concate(v user_final ,v item_final )v n =DCN(v 0 ),
wherein v is user_final Representing user feature vectors, v item_final Multi-modal fusion feature directions representing user perceived artworkAn amount of;
the calculation method of the first+1st cross layer of the DCNv2 comprises the following steps:
v l+1 =v 0 ⊙(W l v l +b l )+v l
wherein,is a base layer comprising 1 st order original features, < >>Input and output, respectively denoted as (l+1) th cross layer,/o>Is a weight matrix and bias vector for subsequent learning, the highest polynomial order is l+1 for an l-layer crossover network, and the network contains all feature crossings from first order to highest order.
Finally, activating and outputting the probability of collecting the picture-inserting object by the user through DNN and Sigmoid.
In another specific embodiment, the trained dataset is made up of n instances (x, y) representing n behavioral records of the user for the drawing. Where x is a data record of m fields, typically related to a pair of users and a drawing, y e {0,1} is the relevant label representing the user's click behavior (y=1 represents the user browses and stows the item, y=0 represents browsing only the uncollected item).
Each instance is converted to (x, y), where the feature set x= [ x user_interest_label ,f image_semantics ,…]Is a dictionary format input in which each feature is, for example, { feature_name } 1 :feature_value 1 Key-value pairs of feature names and feature values. The specific inputs are shown in table 1.
TABLE 1 input data
Table 2 input data processing
The above examples are treated as vectors that are acceptable to the model. The multi-mode input of the drawings is provided with a drawing manual label x illustration_label Drawing image x illustration_image And drawing residual discrete features x illustration_other 。x illustration_image Via an image encoder W image Output image style characteristics f image_semantics With image semantic feature f image_semantics ,x illustration_label Via a text encoder W text Outputting text semantic feature f label_semantics The remaining discrete features are also scaled by ebedding as input Embedding (f) item_other ). To sum up, these raw inputs are transformed in Table 2, table 3 to obtain the final model input vector v item_multi_modal
TABLE 3 discrete input processing
The multi-modal feature set of the final artwork is represented as
v item_multi_modal ={f label_semantics ,
f image_semantics ,
f image_style ,
Embedding(f item_other )}
The personalized recommendation problem is regarded as a binary classification problem, and the aim is to establishTo estimate the probability that a user will collect a particular drawing in a given feature environment.
Experiment verification
Experimental setup
Data set: since there is no high quality cartoon recommendation public data set containing multimodal information at present, we construct a cartoon recommendation data set based on the log information of a commercial cartoon website for one year (2021, 1 month, 1 day to 2022, 1 month, 1 day) and the drawing metadata. The data set is divided according to time, the data of the first 10 months is used as a training set, and the data of the last 2 months is used as a test set (note that the data is not specially processed so that users and pictures in the test set are contained in the training set, that is, new users and new objects exist in the test set). Some statistical analyses were performed on the data sets, and table 4 counts the number of users and the number of drawings and their number of interactions for the entire data set, and fig. 2 counts the number of interactions and the number of users interacting with them under the view of the users, respectively.
Table 4 dataset statistics
The comparative evaluation indexes were Area Under the Curve (AUC), binary CrossEntropy Loss (BCE Loss) and model Parameter number (Parameter).
For all models, adam (Adaptive Moment Estimation) is used as an optimizer to optimize all models, the learning rate is set to be 1e-3, the phenomenon of over-fitting of the recommended system One-epochs is avoided, and all models are tested after only 1 Epoch is trained on a training set. All model enabling layers to output dimensions of 8, enabling L2 regularization parameters to be 1e-3, enabling the super parameters of the other model structures to be set by default, and maintaining parameter quantities of all comparison models to be in the same magnitude.
Since the cartoon recommendation field does not have much work at present, and we model the problem as a click rate-like estimation problem, in order to evaluate the performance of the model, we compare the model with the SOTA model of the general click rate field published in many years below on the set index:
conventional FM and deep learning combined improved dual stream networks:
1) xDeepFM [ KDD 2018] proposes a Compressed Interaction Network (CIN) to learn the interactions of high-order features. It can learn feature interactions of some degree of demarcation effectively and at the vector level.
2) DIFM [ IJCAI 2020] adaptively learns flexible representations of a given feature from different input instances by means of Dual-Factor Esti-matching Network (Dual-FEN). It can learn the input perceptual factors (for re-weighting the original feature representation) efficiently at the same time, not only at the bit level, but also at the vector level.
3) The FinalMLP [ AAAI 2023] proposes that a stream specific feature gating and multi-headed bilinear fusion module enhance input differentiation of dual streams and enable stream level interactions, and the surprising results challenge the validity and necessity of existing explicit feature interaction modeling studies.
Automatic feature crossover single flow network:
1) DCN [ ADKDD 2017] proposes that the Cross Network with linear time and space complexity replaces the Wide part of the Wide & Deep model, and the intersection between features is automatically carried out, so that a Deep-intersection Network (Deep & Cross Network) is formed.
2) The FiBiNET [ RecSys 2019] proposes that the SENET module can dynamically learn the importance of features, introduces three types of bilinear interaction layers to learn the feature interactions, rather than using Hadamard products or inner products to calculate the feature interactions.
3) DCN V2 WWW 2021 proposes a expressive but simple way to model explicit cross-features. The low rank nature of the weight matrix in the crossover network was observed, and a hybrid model of low rank DCN (DCN-Mix) was also proposed to achieve a more appropriate trade-off between model performance and delay.
We implement our model based on the TensorFlow framework and the above comparative model.
Comparison results
Table 5 baseline comparison
The results of performance comparison experiments are summarized in Table 5, and we find that by integrating the multi-modal information, the model issued by us is obviously superior to other non-multi-modal recommendation systems in terms of AUC index and BCELoss index, and has the lifting range of 14.95% and 5.152% respectively compared with the most excellent comparison model, thereby verifying the advantages of integrating different modalities. Different modality information can generate a more comprehensive representation for an item, thereby enhancing the user's preference learning and to some extent resisting the data sparseness caused by long-tailed data (as can be seen from statistics of the data set, which belongs to a typical long-tailed distribution). Users often have more rich content information about the animation, not just from tags of particular interest, but also the attractiveness of the animation, which is an important part of the preference behavior. Therefore, the semantic information and style information of the picture-inserting image can enrich the recommended picture-inserting characterization. The method combines the text and the visual semantic style information into the picture-inserting expression learning, models the internal correlation of the picture-inserting expression learning, improves the understanding of picture-inserting contents, realizes more accurate interest matching, and obtains better performance.
It is noted that the improvement of BCE Loss index is much larger than that of other models, which is believed to be caused by the easy-to-fit property of the recommended field being more obvious than other fields, and despite the addition of the strong-robustness multi-modal feature, there is still the phenomenon of over-fit, which is shown that the improvement of BCE Loss does not bring about the same magnitude of improvement of AUC, and not only is the problem of the model issued by us, but also the behavior of DIFM and DCN V2 are compared with other models.
The model which we release is not optimal in terms of parameter comparison, but still belongs to the same order as other models, because the increase of layer parameters of the EMBedding layer mainly caused by adding the multi-mode features leads to the increase of the overall parameter quantity compared with other models, but the problem of model operation efficiency is not caused because the extracted features belong to dense features and have limited orders (the maximum image semantic features are 1024 dimensions).
The foregoing examples merely represent specific embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the present application. It should be noted that, for those skilled in the art, several variations and modifications can be made without departing from the technical solution of the present application, which fall within the protection scope of the present application.

Claims (7)

1. A user perception multi-mode cartoon recommendation system with style characteristics is characterized by comprising the following steps:
user and illustration object input: acquiring a data set pair of a user and an illustration, and dividing the data set pair into a training set and a testing set;
multi-modal feature extraction: extracting semantic features by using an inserting image feature encoder, extracting painting style feature vectors, and adding multi-angle text semantic features fused with knowledge in the cartoon field;
user-aware multimodal feature contribution fusion mechanism: modeling the multi-modal contribution degree by using an attention mechanism, carrying out attention weighting on the user characteristic vector and the multi-modal characteristics of the artwork, and identifying the influence of the user on the interaction between different modalities of the work;
multimodal feature crossing: the fused multi-mode features and the user features are subjected to feature crossing through a DCN module, and user-painting characterization after feature crossing is obtained;
finally, activating and outputting the probability of collecting the picture-inserting object by the user through DNN and Sigmoid.
2. The user-perceived multimodal cartoon recommendation system with style characteristics of claim 1 wherein said multimodal feature extraction specifically comprises the steps of:
constructing a large cartoon image-multi-class multi-label data set by taking the artificial text labels corresponding to the picture images and the picture itself as sample pairs, and training based on label prediction tasks; constructing a ResNet classification model of a proxy task predicted by multiple classes and labels of a supervised large-scale cartoon picture;
after model convergence is completed, the model structure is modified, the original classification head is deleted, and semantic feature output and style feature output are added.
3. The user-perceived multimodal cartoon recommendation system with style characteristics of claim 2 wherein said semantic characteristics output comprises: after the output of a layer of convolution layer close to the tail is accessed into a parameter-free global pooling layer, the output is reduced to 1024-dimension as the image semantic feature output;
the style characteristic output includes: starting from the input layer of the network, the activation output of the first layers represents the edge and texture low-level characteristics, and the image content is represented by the values of the characteristic diagram;
by calculating the outer product of feature vectors at each location and averaging the outer product at all locations, a Gram matrix containing this information can be calculated, and for a particular layer of Gram matrix, the specific calculation method is:
representing the pixel value on the feature map c at layer 1, position (i, j), of the network, ++>Representing pixel values on the feature map d at a first layer of the network, position (i, j); IJ represents the size of the feature map, i.e. the number of (i, j) on the feature map.
4. A user-perceived multimodal cartoon recommendation system with style characteristics as claimed in claim 2 or 3 wherein said multimodal characteristics extraction further comprises the steps of:
multi-angle domain text pair construction: constructing a sentence characteristic data set by acquiring public information on a website; the website includes Bangumi, pixiv, animeList, wikipedia, moegirl; the sentence characteristics comprise multilingual comparison of domain nouns, domain nouns and paraphrasing and links between entities in the domain; the contact between the entities in the field comprises the contact between the works and the roles and the contact between the roles;
fine text encoder: the text encoder model structure is sender-transformers, pretrained weights are used for discrete-base-multilangual-based-v 2, fine tuning is performed under the dataset at a learning rate of 5e-5, and the loss function is multiplenegotives-RankingLoss, which approximates the distance between text pairs.
5. The system of claim 1, wherein the user-perceived multimodal feature contribution fusion mechanism comprises the steps of:
by user feature vectorsAs a query, four multi-modal features drawn are scaled to the same dimension as the user feature vector and stacked into a multi-modal feature list:
wherein the method comprises the steps ofRepresenting text semantic features->Representing the semantic features of the image,representing image style semantic features,/->Representing the remaining features of the image;
each modality is calculated by the attention mechanismFor user u i Sum work i j Effects of each interaction between:
wherein%Respectively representing dot product and scaling factor;
then, by fusing u i Multi-modal embedded list of (a)Obtain->Relative toIs embedded personalized->
6. The user-perceived multimodal cartoon recommendation system with style characteristics of claim 1 wherein said multimodal characteristics cross specifically comprises the steps of:
after drawing feature vectors are obtained through a multi-mode contribution measurement fusion mechanism perceived by a user, the drawing feature vectors are spliced with the user feature vectors and then input into a feature crossing module based on DCNv2, and user-drawing representation after n-order crossing is obtained:
v 0 =Concate(v user_final ,v item_final )v n =DCN(v 0 ),
wherein v is user_final Representing user feature vectors, v item_final Representing the user perceived pictorial multimodal fusion feature vector;
the calculation method of the first+1st cross layer of the DCNv2 comprises the following steps:
v l+1 =v 0 ⊙(W l v l +b l )+v l
wherein,is a base layer comprising 1 st order original features, < >>Input and output, respectively denoted as (l+1) th cross layer,/o>Is a weight matrix and bias vector for subsequent learning, the highest polynomial order is l+1 for an l-layer crossover network, and the network contains all feature crossings from first order to highest order.
7. The user-perceived multimodal cartoon recommendation system with style characteristics of claim 1 wherein said training set includes n instances (x, y) representing n behavioral records of a user for a drawing, where x is a data record of m fields, relating to a pair of user and drawing, y e {0,1} is the associated label = 1 representing the user's click behavior indicating that the user browses and favorites the item, y = 0 indicating browsing only uncollected items.
CN202311171937.0A 2023-09-12 2023-09-12 User perception multi-mode cartoon picture recommendation system with style characteristics Pending CN117271824A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311171937.0A CN117271824A (en) 2023-09-12 2023-09-12 User perception multi-mode cartoon picture recommendation system with style characteristics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311171937.0A CN117271824A (en) 2023-09-12 2023-09-12 User perception multi-mode cartoon picture recommendation system with style characteristics

Publications (1)

Publication Number Publication Date
CN117271824A true CN117271824A (en) 2023-12-22

Family

ID=89201887

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311171937.0A Pending CN117271824A (en) 2023-09-12 2023-09-12 User perception multi-mode cartoon picture recommendation system with style characteristics

Country Status (1)

Country Link
CN (1) CN117271824A (en)

Similar Documents

Publication Publication Date Title
WO2020207196A1 (en) Method and apparatus for generating user tag, storage medium and computer device
Raza et al. Progress in context-aware recommender systems—An overview
Khan et al. CNN with depthwise separable convolutions and combined kernels for rating prediction
Luo et al. Personalized recommendation by matrix co-factorization with tags and time information
CN108154395B (en) Big data-based customer network behavior portrait method
CN106599022B (en) User portrait forming method based on user access data
CN107220365B (en) Accurate recommendation system and method based on collaborative filtering and association rule parallel processing
EP3143523B1 (en) Visual interactive search
Ding et al. Learning topical translation model for microblog hashtag suggestion
US10606883B2 (en) Selection of initial document collection for visual interactive search
CN106599226A (en) Content recommendation method and content recommendation system
CN103544216A (en) Information recommendation method and system combining image content and keywords
JP2017054214A (en) Determination device, learning device, information distribution device, determination method, and determination program
Duan et al. A hybrid intelligent service recommendation by latent semantics and explicit ratings
CN116975615A (en) Task prediction method and device based on video multi-mode information
Xu et al. Do adjective features from user reviews address sparsity and transparency in recommender systems?
CN112749330A (en) Information pushing method and device, computer equipment and storage medium
Li et al. From edge data to recommendation: A double attention-based deformable convolutional network
JP2017201535A (en) Determination device, learning device, determination method, and determination program
CN112989182A (en) Information processing method, information processing apparatus, information processing device, and storage medium
WO2023185320A1 (en) Cold start object recommendation method and apparatus, computer device and storage medium
Chen et al. Exploiting visual contents in posters and still frames for movie recommendation
CN115269984A (en) Professional information recommendation method and system
CN115238191A (en) Object recommendation method and device
CN117271824A (en) User perception multi-mode cartoon picture recommendation system with style characteristics

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination