CN112948708B - Short video recommendation method - Google Patents

Short video recommendation method Download PDF

Info

Publication number
CN112948708B
CN112948708B CN202110242999.0A CN202110242999A CN112948708B CN 112948708 B CN112948708 B CN 112948708B CN 202110242999 A CN202110242999 A CN 202110242999A CN 112948708 B CN112948708 B CN 112948708B
Authority
CN
China
Prior art keywords
short video
short
neural network
user
social
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110242999.0A
Other languages
Chinese (zh)
Other versions
CN112948708A (en
Inventor
肖喜
吴偌灏
夏树涛
毛科龙
江勇
王兴军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen International Graduate School of Tsinghua University
Original Assignee
Shenzhen International Graduate School of Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen International Graduate School of Tsinghua University filed Critical Shenzhen International Graduate School of Tsinghua University
Priority to CN202110242999.0A priority Critical patent/CN112948708B/en
Publication of CN112948708A publication Critical patent/CN112948708A/en
Application granted granted Critical
Publication of CN112948708B publication Critical patent/CN112948708B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9536Search customisation based on social or collaborative filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/43Querying
    • G06F16/435Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/45Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2135Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking

Abstract

The invention provides a short video recommendation method, which comprises the following steps: constructing a multi-source abnormal composition, and extracting data characteristics from data of different modes; constructing a hierarchical graph neural network model, and inputting the data characteristics into the hierarchical graph neural network model for training; and recommending short videos to target users by using the trained neural network model of the hierarchical graph. By combining the data characteristics extracted by different modalities, a social network is comprehensively considered to construct a multi-source heteromorphic graph; the hierarchical graph neural network model constructed based on the heterogeneous graph can capture different connection relations and incidence relations among users, short videos and labels, the customized graph neural network can learn feature representations of high-level short videos and users, the characterization capability is stronger, and short video recommendation is facilitated.

Description

Short video recommendation method
Technical Field
The invention relates to the technical field of video recommendation, in particular to a short video recommendation method.
Background
With the increasing amount of information in the internet, recommendation systems have become an effective strategy to overcome information overload. The recommendation system is widely used in a plurality of web applications, and is used for helping a user to conveniently find interested articles (commodities, information and the like) in information overload in the internet era, so that the problem of excessive selection of the user is solved. The main idea of the recommendation system is to establish a relationship between the items and the user and to generate the most suitable item list for a specific user. The method aims to provide accurate item prediction and recommendation for a user by fully utilizing different information sources.
Meanwhile, the machine learning technology based on the graph structure is a new technology and shows a huge development prospect. The image learning technique can acquire knowledge embedded in different images to be splendid in some fields. Most data in the recommendation system is a graph structure in nature, and users, articles and user articles are connected with each other in an explicit or implicit way. Because the image learning technology has the learning capability of complex relationship data, a plurality of recommendation systems based on the image learning technology are continuously emerged. Furthermore, the graph learning techniques help to improve interpretability in the recommendation system.
The diversity of graph types is brought about by the diversity of objects in the recommendation system and the complex relationships between objects. From a simple tree diagram to a single diagram of a user or an article, to a two-part diagram reflecting interactive information and a multi-source abnormal diagram fully combining auxiliary information, a new vitality is injected into the development of a recommendation system. But how to efficiently extract information from these graphs and use it for recommendation is a new challenge that follows.
With respect to some of the above challenges, various image learning techniques have been proposed successively in recent years. Although the recommendation system based on the random walk technology can well capture the complex relationships among various nodes on the graph, the efficiency is low, and the system is difficult to expand to a large-scale graph. Meanwhile, random walk is different from a learning-based method, and parameters of an optimization target are lacked, so that recommendation performance is greatly reduced. The graph factorization based recommendation system, while simple and easy to understand, is susceptible to sparsity of observed data. Graph neural network-based recommendation systems have resulted in a wide variety of effective recommendation systems, benefiting from the rapid development of graph neural network models. Such as GCN-based recommendation systems, embedding of users and items is learned using convolution and pooling operations to effectively aggregate neighborhood information for users or items. But GCN cannot efficiently aggregate neighborhood information. A GAT-based recommendation system is then proposed that, in conjunction with the attention mechanism and GNN, can learn differently the different associations and degrees of influence of the neighborhood nodes of the target node. However, GAT still performs poorly for knowledge learning embedded in the graph.
In recent years, with the explosion of video, an information carrier, research on video recommendation tasks has become more and more important. Video recommendations differ from text or picture recommendations in that video implicitly contains much information about the user's interests, but this information cannot be fully presented by video title or cover picture alone. Youtube dnn recommends a small number of videos from a large number of videos, resulting in a maximum desired viewing time period. The whole structure of the method is divided into two stages of candidate set generation (recall) and sorting. A recall phase, wherein uploading time is introduced by input and is only related to a user; and a sorting stage, wherein the user and the video data are input together, and the expected watching time length is used as an evaluation index. These methods are suitable for long video recommendations. In the prior art, complex social relations in videos are not processed, and multi-modal features are not fully utilized.
The above background disclosure is only for the purpose of assisting understanding of the concept and technical solution of the present invention and does not necessarily belong to the prior art of the present patent application, and should not be used for evaluating the novelty and inventive step of the present application in the case that there is no clear evidence that the above content is disclosed at the filing date of the present patent application.
Disclosure of Invention
The invention provides a short video recommendation method for solving the existing problems.
In order to solve the above problems, the technical solution adopted by the present invention is as follows:
a short video recommendation method comprises the following steps: s1: constructing a multi-source abnormal composition, and extracting data characteristics from data of different modes; s2: constructing a hierarchical graph neural network model, and inputting the data characteristics into the hierarchical graph neural network model for training; and recommending short videos to target users by using the trained neural network model of the hierarchical graph.
Preferably, a multi-source heterogeneous graph is constructed according to the social network and the multi-source heterogeneous network information of the target user to extract the data features; the data features include textual features, visual and auditory features, social features, and relationships between short videos.
Preferably, extracting the text feature comprises: segmenting words of the title and the brief introduction of the short video to obtain segmented words; screening the words after word segmentation by using a common stop word bank to obtain words after word segmentation and screening; and the words after word segmentation screening and original labels provided by the short video author and/or the platform of the short video form text characteristic labels of the short video.
Preferably, the visual features are extracted by using Inception V3 and dimensionality reduction is performed by using a principal component analysis method; extracting the auditory features by using VGGish and reducing the dimension by adopting a principal component analysis method.
Preferably, the obtaining the social characteristics of the target user by using an attention-based social graph neural network specifically includes: projecting friend users of the target user into a new item space to distinguish the friend users from the target user; learning an influence weight of each of the friend users using the attention-based social graph neural network; weighting all the friend users according to the learned influence weight, and mapping the weighted friend users to a social space to obtain a social vector of the target user; and splicing the social vector and the initial embedded vector of the target user to obtain the social characteristic.
Preferably, the relationship between the short videos is a relationship closeness degree between short videos of the same author, and a relationship closeness degree PMI (x, y) between two short videos is calculated by the following formula:
Figure BDA0002962985420000031
wherein, # (x, y) represents the number of times that the short videos x and y are watched together in a sliding play mode, that is, after the user finishes watching the x short video, the user continues to slide and finishes watching the y short video of the same author; # (x) represents the number of times the short video x has been viewed, and # (y) represents the number of times the short video y has been viewed;
and when the PMI (x, y) is larger than a preset threshold value, the nodes of the two short videos establish connection in the multi-source abnormal picture.
Preferably, the hierarchical graph neural network model is constructed on the basis of a graph convolution network of message passing; the multi-source heterogeneous graph comprises three types of heterogeneous nodes including a user, a short video and a label; the message transmission of the node of the short video adopts the following modes:
m i′→i =σ(W(h i′ ⊙h i ))
Figure BDA0002962985420000032
where h denotes initial embedding, m denotes intermediate vector, h denotes intermediate vector * Indicating the final embedding.
Preferably, an attention mechanism and a door mechanism are introduced to construct a message passing model; the node v of the short video and the node w of the label are taken as examples:
Figure BDA0002962985420000033
Figure BDA0002962985420000034
where h represents the corresponding initial insertion, W represents the fully-connected layer coefficient, α represents the attention coefficient, and z represents the true value.
Preferably, the method further comprises the following steps: and accelerating the convergence of the neural network model of the hierarchical graph by using a labeling task, and recommending a short video to the target user.
Preferably, accelerating convergence of the hierarchical neural network model using the tagging task comprises: describing a labeling task: given the characteristics of a short video i, obtaining a label distribution through two layers of full connection, wherein each value in the label distribution represents the possibility that a label at a corresponding position is a related label of the short video i, and the process is formulated as follows:
Figure BDA0002962985420000041
wherein e is i Representing the final characteristics of the short video i, and
Figure BDA0002962985420000042
representing the label probability distribution of the short video i, wherein sigma represents an activation function, and rho represents a softmax function;
correspondingly, the loss function is designed:
Figure BDA0002962985420000043
where s denotes a short video set that has been labeled with a label, K denotes the size of the candidate label set, and r denotes ij And
Figure BDA0002962985420000044
respectively representing the real value and the predicted value of the short video i to the jth label;
for the user-short video recommendation task, the correlation between the user and the short video is represented by the inner product of the user and the short video, and the symbol is y ui To predict the valueBy using
Figure BDA0002962985420000045
Represents; introducing a cross-entropy loss function as
Figure BDA0002962985420000046
Figure BDA0002962985420000047
The final loss function is as follows:
Figure BDA0002962985420000048
wherein alpha and lambda are super-parametes.
The invention has the beneficial effects that: the short video recommendation method is provided, and a multi-source abnormal picture is constructed by comprehensively considering the social network through combining data characteristics extracted by different modalities; the hierarchical graph neural network model constructed based on the heterogeneous graph can capture different connection relations among the user, the short video and the label and the association relation among the connection relations, the customized graph neural network can learn the feature representation of the high-level short video and the user, the characterization capability is stronger, and short video recommendation is facilitated.
Furthermore, the invention innovatively combines the graph learning technology and the multi-modal characteristics, makes full use of multi-source information existing in the short video, including the content of the short video, related texts, label information and the like, constructs a heterogeneous graph with rich information, and creates conditions for the application of the graph neural network.
Drawings
Fig. 1 is a schematic diagram of a short video recommendation method according to an embodiment of the present invention.
Fig. 2 is a flowchart illustrating a short video recommendation method according to an embodiment of the present invention.
Fig. 3 is a schematic diagram of a method for extracting the text feature in the embodiment of the present invention.
FIG. 4 is a schematic diagram illustrating a method for obtaining the social characteristics of the target user by using an attention-based social graph neural network according to an embodiment of the present invention.
FIG. 5 is a diagram of a social graph neural network based on an attention mechanism according to an embodiment of the present invention.
Detailed Description
In order to make the technical problems, technical solutions and advantageous effects to be solved by the embodiments of the present invention more clearly apparent, the present invention is further described in detail below with reference to the accompanying drawings and the embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
It will be understood that when an element is referred to as being "secured to" or "disposed on" another element, it can be directly on the other element or be indirectly on the other element. When an element is referred to as being "connected to" another element, it can be directly connected to the other element or be indirectly connected to the other element. In addition, the connection may be for either a fixing function or a circuit connection function.
It is to be understood that the terms "length," "width," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like are used in an orientation or positional relationship indicated in the drawings for convenience in describing the embodiments of the present invention and to simplify the description, and are not intended to indicate or imply that the referenced device or element must have a particular orientation, be constructed in a particular orientation, and be in any way limiting of the present invention.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the embodiments of the present invention, "a plurality" means two or more unless specifically limited otherwise.
Some short video recommendation methods in the prior art utilize multi-modal information, but ignore the relationship between short videos, so that the recommendation effect is poor; some users do not effectively utilize the interactive relation between the users and the short videos, and simultaneously do not really utilize the audio, but only utilize the subtitle information of the audio, so that the comprehensive consideration of various information is lacked; some only use matrix type interactive relation, will appear the obvious cold start problem, it is very unfriendly to new video and new user; some users use click feedback to have little use in the short video field, the number of watching behaviors of the users is far more than that of clicks, and the watching duration can also transmit more information than that of a single click. Therefore, a short video recommendation method with good recommendation effect is lacked in the prior art.
The existing recommendation method mainly relies on a certain modal characteristic to construct a model, and then training is carried out on the model, and finally a result is output. These methods, while providing preliminary recommendations, have been ineffective due to their ignorance of other characteristics. On the other hand, many proposals for improving the recommendation effect by using the image learning technique have emerged in recent years. They simply build a social networking graph, do not take advantage of multimodal information, and expose significant cold start problems.
It is understood that the short video as described below refers to a video that is transmitted on the internet within 5 minutes, and includes various characteristics such as title, video, audio, and label.
As shown in fig. 1, the present invention provides a short video recommendation method, which includes the following steps:
s1: constructing a multi-source abnormal composition, and extracting data characteristics from data of different modes;
s2: constructing a hierarchical graph neural network model, and inputting the data characteristics into the hierarchical graph neural network model for training; and recommending short videos to target users by using the trained neural network model of the hierarchical graph.
Based on the multi-modal characteristics of the short video, the common isomorphic graph obviously cannot meet the utilization of various characteristics, and therefore the multi-source heteromorphic graph is constructed for the method, and the characteristics of different data can be represented; for the multi-source abnormal composition graph, the simple graph neural network cannot meet the training requirement, and the hierarchical graph neural network is constructed based on the invention and can train different characteristics of the multi-source abnormal composition graph. According to the method, the characteristics of different data of the short video are fully considered, and the high-efficiency and high-quality short video recommendation is realized by constructing a multi-layer graph neural network model.
The method provided by the invention combines data characteristics extracted by different modalities, comprehensively considers the social network and constructs a multi-source heteromorphic graph; furthermore, a hierarchical graph neural network model is provided, and short videos are recommended to users by fully utilizing data characteristics. It can be understood that the heterogeneous graph neural network constructed based on the heterogeneous graph can capture different connection relations and incidence relations among users, short videos and labels, the customized hierarchical graph neural network model can learn the feature representation of high-level short videos and users, the characterization capability is stronger, and short video recommendation is facilitated; furthermore, the invention innovatively combines the graph learning technology and the multi-modal characteristics, makes full use of multi-source information existing in the short video, including the content of the short video, related texts, label information and the like, constructs a heterogeneous graph with rich information, and creates conditions for the application of the graph neural network.
The invention establishes the connection relationship inside the short video and between the short video and the text, enriches the interactive information of the special composition, and the previous method does not consider the connection relationship between the short video text labels and the connection relationship between the short videos.
The invention carries out the recommendation task by combining the relationship of the user social network, and has more reasonability and feasibility; the short videos have strong social attributes, the short videos watched by the users related to each other generally have correlation, and compared with the prior recommendation method which only considers the interaction relation between the users and the short videos, the method is more suitable for the field of the short videos.
Fig. 2 is a schematic flow chart of a short video recommendation method according to the present invention, which mainly includes constructing a multi-source heteromorphic graph and constructing a multi-layer graph neural network model.
Before short video recommendation, the method firstly needs to construct a multi-source heterogeneous graph according to the social network and the multi-source heterogeneous network information of the user, and extracts data characteristics from data of different modalities to serve as input of a multi-layer graph neural network model.
In one embodiment of the invention, the data features include textual features, visual and auditory features, social features, and relationships between short videos. It is understood that the data features in actual use may also include tag features, duration features, etc., which may be added or deleted as the case may be.
The extraction method of each feature will be described separately below.
As shown in fig. 3, extracting the text feature includes:
segmenting words of the title and the brief introduction of the short video to obtain segmented words;
screening the words after word segmentation by using a common stop word bank to obtain words after word segmentation and screening;
and the words after word segmentation screening and original labels provided by the short video author and/or the platform of the short video form text characteristic labels of the short video.
In particular, the title and brief introduction contents of the short video are often composed of sentences. Therefore, the title and the brief introduction need to be participled. In order to avoid the interference of language words such as 'the' and the like, the scheme utilizes the common stop word stock to screen the words after word segmentation. The words after word segmentation and screening and a small amount of original labels can form labels (tag) of each video. It will be appreciated that a small number of tags are originally classified tags for video authors and/or platforms.
In order to better utilize semantic information, the invention uses a pre-trained Bert model to obtain word vectors of each word, and the word vectors contain semantic information of the word. Because the pre-trained Bert model is used to extract semantic features, the dimension of the obtained word vector is usually 768 dimensions. In order to accelerate the update speed of the model later, PCA dimension reduction can be adopted to reduce the Embedding vector (Embedding) of each word to 64 dimensions (or 128 dimensions), and the Embedding vector is called a semantic Embedding vector. According to the scheme, a learnable ID Embedding can be trained for each label, and the learnable ID Embedding is spliced with the learnt semantic embedded vector, so that the feature vector of the label is more suitable for the recommendation task of the invention. It will be appreciated that the dimensions of the embedded vectors are not limited to fixed values, and should be dynamically adjusted depending on the requirements of a particular task.
Secondly, visual and auditory features, which the present invention creates separately, are typically video containing sequential images and sounds. For images, convolutional neural networks are widely used, such as 3DCNN, which is often used to extract video features. The scheme adopts a similar method of Youtube DNN, uses inclusion V3 provided by Google officials to extract features, extracts one frame every second, extracts 2048-dimensional vectors every frame, and uses principal component analysis PCA to reduce dimensions. The method is particularly important for some classic Ouderson song fragments with sudden red burst of a short video platform in recent years, and short videos of the same fragment in the soundtrack are similar in size. Therefore, the scheme adopts VGGish to extract auditory features, and PCA is used for dimensionality reduction to 128 dimensions.
Furthermore, the social characteristics, in particular, the invention adds an attention network in the user social information aggregation to learn the influence of different friends of the user on the social information aggregation. And weighting according to the influence degree to obtain a new social vector of the aggregated user. It will be appreciated that in one embodiment of the invention, friend users of the user may be provided through the platform.
As shown in fig. 4, acquiring the social characteristics of the target user by using an attention-based social graph neural network specifically includes:
projecting friend users of the target user into a new item space to distinguish the friend users from the target user;
learning an influence weight of each of the friend users using the attention-based social graph neural network;
weighting all the friend users according to the learned influence weight, and mapping the weighted friend users to a social space to obtain a social vector of the target user;
and splicing the social vector and the initial embedded vector of the target user to obtain the social characteristic.
FIG. 5 shows a Social Graph Neural Network (Social Attention Graph Neural Network) based on Attention mechanism of the present invention. In order to obtain a feature vector of a certain user in a social network, the invention firstly projects the friend user vector of the user to a new Item space (Item-space) to distinguish friends from the target user, and then learns the influence weight of each friend user by using an attention mechanism. And finally, weighting all friend user vectors according to the learned weight, and mapping to a Social space (Social-space) to obtain the Social vector of the user. After the social vector of the user is obtained, the social vector is spliced with the initial embedded vector ID Embedding of the user to obtain a final initial feature vector of the user.
Finally, the short video relationship is explained, and generally speaking, the short video style and audience degree produced by one author (or studio) are similar. To this end, the invention establishes relationships between short videos, in particular, for the same author. The relationship between short videos is how close the relationship between short videos of the same author is. In one embodiment of the invention, the invention uses the short video of the same author (or studio) to calculate the PMI value to represent the closeness of the relationship between the two videos, and when the PMI value of the two short videos is larger than a certain threshold value, the two short video nodes are connected in the graph. The closeness of relationship PMI (x, y) between two short videos is calculated by:
Figure BDA0002962985420000091
wherein, # (x, y) represents the number of times that the short videos x and y are watched together in a sliding play mode, that is, after the user finishes watching the x short video, the user continues to slide and finishes watching the y short video of the same author; # (x) represents the number of times the short video x has been viewed, and # (y) represents the number of times the short video y has been viewed.
And when the PMI (x, y) is larger than a preset threshold value, the nodes of the two short videos establish connection in the multi-source abnormal picture.
As shown in fig. 2, in the multi-source heterogeneous map, three types of nodes, namely, a User (User), a short video (Item), and a Tag (Tag), are mainly involved, and therefore, the present invention initially embeds the three types of nodes.
For a User, the User's ID is typically embedded as a feature. If there is a more precise user portrait feature or social relationship, the invention can be used as an attribute feature to perform feature extraction (such as the aforementioned social vector) in combination with the user ID. Similarly, the initial embedding of Item is stitched by Item ID embedding and the visual and auditory features mentioned in the first part; the initial embedding of Tag is its semantic feature.
In order to extract the structure relation characteristics of the User-Item-Tag heterogeneous graph, the invention constructs a hierarchical graph neural network model by taking a graph convolution network of message transmission as a basis.
Message passing in homogeneous nodes. In order to extract the structural features between the nodes, the scheme takes the message passing of the Item as an example, and the message passing of the node of the short video adopts the following mode:
m i′→i =σ(W(h i′ ⊙h i ))
Figure BDA0002962985420000101
where h denotes initial embedding, m denotes intermediate vector, h denotes intermediate vector * Indicating the final embedding.
It is understood that the message passing part in the isomorphic nodes is not limited to item nodes, but may also include user and Tag nodes.
Message passing between heterogeneous nodes. Namely, structural feature extraction between nodes across layers, such as user-item and item-tag. The present invention herein introduces an attention mechanism and a door mechanism to construct a messaging model. Taking node v of the short video and node w of the label as an example:
Figure BDA0002962985420000102
Figure BDA0002962985420000103
where h denotes the corresponding initial embedding, W denotes the full link layer coefficient, α denotes the attention coefficient, and z denotes the true value.
On the basis of the method, the method of the invention also comprises the following steps:
and accelerating the convergence of the neural network model of the hierarchical graph by using a labeling task, and recommending a short video to the target user.
In order to make the model quickly converge and improve the recommendation effect, the invention further introduces an auxiliary task tagging task, which can largely expand a small number of tags of the short video and enrich tag information, thereby improving the recommendation effect.
The invention innovatively introduces multiple tasks, so that the tasks can be mutually promoted. Especially, the labeling task enriches short video text information and is beneficial to improving the recommendation accuracy. Generally, the initial labels of the short videos are fewer, a multi-task method can enable a plurality of tasks to be matched with each other, and the labels for automatically labeling the short videos and the improvement of the recommendation accuracy rate complement each other in the method, so that mutual promotion can be realized, and win-win results can be achieved.
For the recommended task, the interaction relationship between the User and the Item is the most valuable information in personalized recommendation and is also a core factor in the recommended task. On the other hand, the number of tags for short video may be small, but this is an important text feature for short video. Thus, the present invention tags an auxiliary task to facilitate the expression of User and Item features. Previous approaches have shown that these two tasks complement each other, facilitating each other.
Specifically, accelerating convergence of the hierarchical neural network model using a tagging task comprises:
describing a labeling task: given the characteristics of a short video i, obtaining a label distribution through two layers of full connection, wherein each value in the label distribution represents the possibility that a label at a corresponding position is a related label of the short video i, and the process is formulated as follows:
Figure BDA0002962985420000111
wherein e is i Representing the final characteristics of the short video i, and
Figure BDA0002962985420000112
the probability distribution of the labels of the short video i is represented, sigma represents an activation function, and rho represents a softmax function;
correspondingly, the loss function is designed:
Figure BDA0002962985420000113
where s denotes a short video set that has been labeled with a label, K denotes the size of the candidate label set, and r denotes ij And
Figure BDA0002962985420000114
respectively representing the real value and the predicted value of the short video i for the jth label.
It can be understood that the labeling task in the present invention is not limited to short video, and can be adjusted according to the actual application.
For the user-short video recommendation task, the correlation between the user and the short video is represented by the inner product of the user and the short video, and the symbol is y ui For predicting the value
Figure BDA0002962985420000115
Representing; introducing a cross-entropy loss function as
Figure BDA0002962985420000116
Figure BDA0002962985420000117
The final loss function is as follows
Figure BDA0002962985420000118
Wherein alpha and lambda are super-parametes.
In a specific embodiment, one short video a can be watched by a plurality of users, one user can watch a plurality of short videos, and a user-short video graph can be constructed through the relationship between the short videos and the users. Similarly, the relationship between the short videos and the tags can be represented by a graph, for example, if the tags of a are "make a fun" or "talk show", the short video-tag graph can be constructed. A user has a friend relationship with the user, when a user likes A, the friends of the user probably like A, and after the social relationship of the user is integrated, a heteromorphic graph is obtained. Compared with a common user-short video isomorphism, the heterogeneous graph contains social relations and label information, and the social relations and the label information can bring an improvement in accuracy.
In the general graph neural network, a short video node is represented by an initial ID embedded vector, the visual characteristics and the auditory characteristics of a short video are respectively input into corresponding models, so that corresponding visual characteristic vectors and auditory characteristic vectors are obtained, the visual characteristic vectors and the auditory characteristic vectors are spliced with the initial ID embedded vector to jointly represent a short video node, and the short video node has more accurate and richer vector representation.
In the common graph neural network, a user node is represented by an initial ID embedded vector, the social relationship of a user is input into the social graph neural network based on the attention mechanism, so that a social characteristic vector is obtained, the social characteristic vector and the initial ID embedded vector are spliced to jointly represent the user node, and the user node is fused into the vector representation of the social relationship.
The method comprises the steps of carrying out word segmentation on a text of a short video, putting the text into a prelearned Bert model to obtain a corresponding semantic feature vector, and splicing the semantic feature vector with an initial ID embedded vector to jointly represent a label node.
In the common graph neural network, the short video nodes are not connected, the PMI value is utilized by the method for establishing the connection for the short video between the same video authors, when a user likes A, the user also probably likes B released by the same author, and the connection enriches the message transmission ways of the short video nodes.
In the common graph neural network, only message transmission exists between users and short video nodes, and the invention also establishes a message transmission mechanism between isomorphic nodes, such as between users, short video and short video.
In view of the fact that fewer tags are obtained from text segmentation of short videos, the invention introduces a tagging assistance task to enrich the number of tags to facilitate the expression of users and short video features. Previous methods have shown that it improves recommendation accuracy.
An embodiment of the present application further provides a control apparatus, including a processor and a storage medium for storing a computer program; wherein a processor is adapted to perform at least the method as described above when executing the computer program.
Embodiments of the present application also provide a storage medium for storing a computer program, which when executed performs at least the method described above.
Embodiments of the present application further provide a processor, where the processor executes a computer program to perform at least the method described above.
The storage medium may be implemented by any type of volatile or non-volatile storage device, or combination thereof. The nonvolatile Memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a magnetic Random Access Memory (FRAM), a Flash Memory (Flash Memory), a magnetic surface Memory, an optical Disc, or a Compact Disc Read-Only Memory (CD-ROM); the magnetic surface storage may be disk storage or tape storage. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of illustration and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Synchronous Static Random Access Memory (SSRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), Enhanced Synchronous Dynamic Random Access Memory (ESDRAMEN), Synchronous linked Dynamic Random Access Memory (DRAM), and Direct Random Access Memory (DRMBER). The storage media described in connection with the embodiments of the invention are intended to comprise, without being limited to, these and any other suitable types of memory.
In the several embodiments provided in the present application, it should be understood that the disclosed system and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only one logical function division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, all the functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.
Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: a mobile storage device, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
Alternatively, the integrated unit of the present invention may be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. Based on such understanding, the technical solutions of the embodiments of the present invention may be essentially implemented or a part contributing to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.
The methods disclosed in the several method embodiments provided in the present application may be combined arbitrarily without conflict to obtain new method embodiments.
Features disclosed in several of the product embodiments provided in the present application may be combined in any combination to yield new product embodiments without conflict.
The features disclosed in the several method or apparatus embodiments provided herein may be combined in any combination to arrive at a new method or apparatus embodiment without conflict.
The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several equivalent substitutions or obvious modifications can be made without departing from the spirit of the invention, and all the properties or uses are considered to be within the scope of the invention.

Claims (8)

1. A short video recommendation method is characterized by comprising the following steps:
s1: constructing a multi-source abnormal composition, and extracting data characteristics from data of different modes;
s2: constructing a hierarchical graph neural network model, and inputting the data characteristics into the hierarchical graph neural network model for training; recommending short videos to target users by using the trained neural network model of the hierarchical graph;
further comprising:
accelerating convergence of the neural network model of the hierarchical graph by using a labeling task, and recommending a short video to the target user;
accelerating convergence of the hierarchical neural network model using a tagging task includes:
describing a labeling task: given the characteristics of a short video i, obtaining a label distribution through two layers of full connection, wherein each value in the label distribution represents the possibility that a label at a corresponding position is a related label of the short video i, and the process is formulated as follows:
Figure FDA0003738096730000011
wherein e is i Representing the final characteristics of the short video i, and
Figure FDA0003738096730000012
representing the label probability distribution of the short video i, wherein sigma represents an activation function, and rho represents a softmax function;
correspondingly, the loss function is designed:
Figure FDA0003738096730000013
where s denotes a short video set that has been labeled with a label, K denotes the size of the candidate label set, and r denotes ij And
Figure FDA0003738096730000014
respectively representing the real value and the predicted value of the short video i to the jth label;
for the user-short video recommendation task, the correlation between the user and the short video is represented by the inner product of the user and the short video, and the symbol is y ui For predicting the value
Figure FDA0003738096730000015
Represents; introducing a cross-entropy loss function as
Figure FDA0003738096730000016
Figure FDA0003738096730000017
In the formula, σ represents an activation function;
the final loss function is as follows:
Figure FDA0003738096730000021
wherein alpha and lambda are super-parametes.
2. The short video recommendation method of claim 1, wherein a multi-source heterogeneous graph is constructed according to the social network and multi-source heterogeneous network information of the target user to extract the data features;
the data features include textual features, visual and auditory features, social features, and relationships between short videos.
3. The short video recommendation method of claim 2, wherein extracting the textual features comprises:
segmenting words from the title and the brief introduction of the short video to obtain segmented words;
screening the words after word segmentation by using a common stop word bank to obtain words after word segmentation and screening;
and the words after word segmentation screening and original labels provided by the short video author and/or the platform of the short video form text characteristic labels of the short video.
4. The short video recommendation method of claim 2, wherein the visual features are extracted using inclusion V3 and dimensionality reduced using principal component analysis;
extracting the auditory features by using VGGish and reducing the dimension by adopting a principal component analysis method.
5. The short video recommendation method of claim 2, wherein obtaining the social characteristics of the target user using an attention-based social graph neural network comprises:
projecting friend users of the target user into a new item space to distinguish the friend users from the target user;
learning an influence weight of each of the friend users using the attention-based social graph neural network;
weighting all the friend users according to the learned influence weight, and mapping the weighted friend users to a social space to obtain a social vector of the target user;
and splicing the social vector and the initial embedded vector of the target user to obtain the social characteristic.
6. The short video recommendation method of claim 2, wherein said relationship between short videos is a degree of closeness of relationship between short videos of a same author, and a degree of closeness of relationship PMI (x, y) between two of said short videos is calculated by:
Figure FDA0003738096730000022
wherein, # (x, y) represents the number of times that the short videos x and y are watched together in a sliding play mode, that is, after the user finishes watching the x short video, the user continues to slide and finishes watching the y short video of the same author; # (x) represents the number of times the short video x has been viewed, and # (y) represents the number of times the short video y has been viewed;
and when the PMI (x, y) is larger than a preset threshold value, the nodes of the two short videos establish connection in the multi-source abnormal picture.
7. The short video recommendation method of claim 6, wherein said hierarchical graph neural network model is built based on a message-passing graph convolution network;
the multi-source heterogeneous graph comprises three types of heterogeneous nodes including a user, a short video and a label; the message transmission of the node of the short video adopts the following modes:
m i′→i =σ(W(h i′ ⊙h i ))
Figure FDA0003738096730000031
where h denotes initial embedding, m denotes intermediate vector, h denotes intermediate vector * Representing the final embedding, sigma represents the activation function,w denotes the full link layer coefficient and hi denotes the initial embedding of the short video i.
8. The short video recommendation method of claim 7, wherein an attention mechanism and a door mechanism are introduced to construct a message passing model; the node v of the short video and the node w of the label are taken as examples:
Figure FDA0003738096730000032
Figure FDA0003738096730000033
where h denotes the corresponding initial embedding, W denotes the full link layer coefficient, α denotes the attention coefficient, and z denotes the true value.
CN202110242999.0A 2021-03-05 2021-03-05 Short video recommendation method Active CN112948708B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110242999.0A CN112948708B (en) 2021-03-05 2021-03-05 Short video recommendation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110242999.0A CN112948708B (en) 2021-03-05 2021-03-05 Short video recommendation method

Publications (2)

Publication Number Publication Date
CN112948708A CN112948708A (en) 2021-06-11
CN112948708B true CN112948708B (en) 2022-08-12

Family

ID=76247774

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110242999.0A Active CN112948708B (en) 2021-03-05 2021-03-05 Short video recommendation method

Country Status (1)

Country Link
CN (1) CN112948708B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114710322B (en) * 2022-03-15 2023-06-20 清华大学 Flow interaction graph-based method and device for detecting hidden malicious flow
CN114637888B (en) * 2022-05-18 2022-08-02 深圳市华曦达科技股份有限公司 Video pushing method and device
CN114866845B (en) * 2022-07-05 2022-09-23 长沙美哒网络科技有限公司 Information detection method and system based on short video release
CN115374369B (en) * 2022-10-20 2023-04-07 暨南大学 News diversity recommendation method and device based on graph neural network
CN116246214B (en) * 2023-05-08 2023-08-11 浪潮电子信息产业股份有限公司 Audio-visual event positioning method, model training method, device, equipment and medium
CN116932887A (en) * 2023-06-07 2023-10-24 哈尔滨工业大学(威海) Image recommendation system and method based on multi-modal image convolution
CN116996708B (en) * 2023-08-10 2024-02-09 广州阿凡提电子科技有限公司 Short video data tag recommendation method and system based on machine learning and cloud platform

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111246256A (en) * 2020-02-21 2020-06-05 华南理工大学 Video recommendation method based on multi-mode video content and multi-task learning
CN111241311A (en) * 2020-01-09 2020-06-05 腾讯科技(深圳)有限公司 Media information recommendation method and device, electronic equipment and storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090006368A1 (en) * 2007-06-29 2009-01-01 Microsoft Corporation Automatic Video Recommendation
US10467308B2 (en) * 2016-10-27 2019-11-05 Conduent Business Services, Llc Method and system for processing social media data for content recommendation
CN110807369B (en) * 2019-10-09 2024-02-20 南京航空航天大学 Short video content intelligent classification method based on deep learning and attention mechanism

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111241311A (en) * 2020-01-09 2020-06-05 腾讯科技(深圳)有限公司 Media information recommendation method and device, electronic equipment and storage medium
CN111246256A (en) * 2020-02-21 2020-06-05 华南理工大学 Video recommendation method based on multi-mode video content and multi-task learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Graph Neual Networks for Social Recommendation;Wenqi Fan et al;《2019 IW3C2(International World Wide Web Conference Committee)》;20190517;第417-426页 *
Item Tagging for Information Retrieval: A Tripartite Graph Neural Network based Approach;Kekong Mao et al;《Proceedings of 43rd international ACM SIGIR Conference on Research and development in information Retrieval》;20200826;第1-10页 *

Also Published As

Publication number Publication date
CN112948708A (en) 2021-06-11

Similar Documents

Publication Publication Date Title
CN112948708B (en) Short video recommendation method
Keneshloo et al. Deep reinforcement learning for sequence-to-sequence models
CN111339415B (en) Click rate prediction method and device based on multi-interactive attention network
US10217058B2 (en) Predicting interesting things and concepts in content
CN105210064B (en) Classifying resources using deep networks
US20230260303A1 (en) Cross-Modal Weak Supervision For Media Classification
CN115329127A (en) Multi-mode short video tag recommendation method integrating emotional information
CN112989212B (en) Media content recommendation method, device and equipment and computer storage medium
CN111723295B (en) Content distribution method, device and storage medium
CN113254711B (en) Interactive image display method and device, computer equipment and storage medium
US9129216B1 (en) System, method and apparatus for computer aided association of relevant images with text
US11876986B2 (en) Hierarchical video encoders
CN110166802A (en) Barrage processing method, device and storage medium
CN113705299A (en) Video identification method and device and storage medium
CN112749330B (en) Information pushing method, device, computer equipment and storage medium
CN112149604A (en) Training method of video feature extraction model, video recommendation method and device
CN116975615A (en) Task prediction method and device based on video multi-mode information
CN115618024A (en) Multimedia recommendation method and device and electronic equipment
CN114817692A (en) Method, device and equipment for determining recommended object and computer storage medium
CN112784156A (en) Search feedback method, system, device and storage medium based on intention recognition
CN116977701A (en) Video classification model training method, video classification method and device
US20230237093A1 (en) Video recommender system by knowledge based multi-modal graph neural networks
CN115238126A (en) Method, device and equipment for reordering search results and computer storage medium
CN117011737A (en) Video classification method and device, electronic equipment and storage medium
CN110969187B (en) Semantic analysis method for map migration

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant