CN111897996B

CN111897996B - Topic label recommendation method, device, equipment and storage medium

Info

Publication number: CN111897996B
Application number: CN202010797673.XA
Authority: CN
Inventors: 吴翔宇; 杨帆; 王思博
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2020-08-10
Filing date: 2020-08-10
Publication date: 2023-10-31
Anticipated expiration: 2040-08-10
Also published as: CN111897996A

Abstract

The disclosure relates to a topic label recommendation method, device, equipment and storage medium, and belongs to the technical field of multimedia. Embodiments of the present disclosure provide a method of recommending a topic label based on characteristics of a video in multiple modalities, by automatically generating the topic label by a machine through images in the video and user characteristics of a video producer, recommending the topic label to a user. Because the recommended topic label is matched with the content of the video, and the information of a video producer is reflected, the matching degree between the topic label and the video is fully ensured, the accuracy of the topic label is improved, and the recommended topic label is closer to the intention of a user.

Description

Topic label recommendation method, device, equipment and storage medium

Technical Field

The disclosure relates to the technical field of multimedia, and in particular relates to a topic label recommendation method, device, equipment and storage medium.

Background

The topic label (Hashtag) of the video is text information with "#" marked by a user in the video production link when describing the video content. The topic labels play an important role in content identification, aggregation, distribution and recommendation of videos.

The process of adding the topic label by the related technology comprises the following steps: after the user shoots the video, clicking on the topic option in the publication page. The video client will display the search box. After the user thinks about the topic label matched with the video, the user executes an input operation in the search box, inputs the topic label to be added, and then executes a confirmation operation on the input topic label, so that the input topic label is added when the video is released.

By adopting the mode, users are required to manually determine the topic label matched with the video, however, subjectivity exists in manually determining the topic label, the matching between the determined topic label and the video is difficult to ensure, the accuracy of the topic label is poor, and the accuracy of the processes of content identification, distribution, recommendation and the like of the video according to the video label is further affected.

Disclosure of Invention

The disclosure provides a topic label recommendation method, device, equipment and storage medium, which can improve the accuracy of topic labels. The technical scheme of the present disclosure is as follows:

according to a first aspect of an embodiment of the present disclosure, there is provided a topic tag recommendation method, including obtaining a video;

extracting at least one frame of image from the video;

acquiring user characteristics corresponding to the user account according to the user account uploading the video;

Generating a topic label matched with the video according to the at least one frame of image and the user characteristics;

recommending the topic label to the user account.

Optionally, the generating a topic tag matched with the video according to the at least one frame of image and the user characteristic includes:

respectively extracting the characteristics of at least one frame of image to obtain the image characteristics of each frame of image;

fusing the image characteristics of the at least one frame of image with the user characteristics to obtain fusion characteristics;

determining the probability of a plurality of candidate labels according to the fusion characteristics;

the topic tag is determined among the plurality of candidate tags according to the probability of each candidate tag.

Optionally, the fusing the image features of the at least one frame of image with the user features to obtain fused features includes:

and respectively carrying out self-attention operation on the image characteristics of the at least one frame of image and the user characteristics through a multi-head attention network to obtain the fusion characteristics.

Optionally, the feature extracting the at least one frame of image to obtain an image feature of each frame of image includes:

And (3) carrying out convolution processing on the at least one frame of image in parallel through at least one classification network, and outputting the image characteristics of each frame of image.

Optionally, the obtaining, according to the user account for uploading the video, a user feature corresponding to the user account includes:

acquiring user attributes corresponding to the user account;

and performing linear mapping and nonlinear mapping on the user attributes to obtain the user characteristics.

acquiring historical behavior data corresponding to the user account;

and performing linear mapping and nonlinear mapping on the historical behavior data to obtain the user characteristics.

Optionally, after the capturing the video, the method further includes: obtaining geographic position information corresponding to the video;

the generating a topic label matched with the video according to the at least one frame of image and the user characteristics comprises the following steps: and generating a topic label matched with the video according to the at least one frame of image, the user characteristics and the geographic position information.

Optionally, after the capturing the video, the method further includes: acquiring time information corresponding to the video;

The generating a topic label matched with the video according to the at least one frame of image and the user characteristics comprises the following steps: and generating a topic label matched with the video according to the at least one frame of image, the user characteristics and the time information.

According to a second aspect of the embodiments of the present disclosure, there is provided a topic tag recommendation device, including:

a first acquisition unit configured to perform acquisition of a video;

an extraction unit configured to perform extraction of at least one frame image from the video;

the second acquisition unit is configured to execute user account uploading the video and acquire user characteristics corresponding to the user account;

a generation unit configured to perform generation of a topic tag matching the video from the at least one frame of image and the user feature;

and the recommending unit is configured to execute the recommendation of the topic label to the user account.

Optionally, the generating unit is configured to perform feature extraction on the at least one frame of image respectively to obtain an image feature of each frame of image; fusing the image characteristics of the at least one frame of image with the user characteristics to obtain fusion characteristics; determining the probability of a plurality of candidate labels according to the fusion characteristics; the topic tag is determined among the plurality of candidate tags according to the probability of each candidate tag.

Optionally, the generating unit is configured to perform self-attention operation on the image feature of the at least one frame of image and the user feature through a multi-head attention network, so as to obtain the fusion feature.

Optionally, the generating unit is configured to perform convolution processing on the at least one frame of image through at least one classification network in parallel, and output an image feature of each frame of image.

Optionally, the second obtaining unit is configured to obtain a user attribute corresponding to the user account; and performing linear mapping and nonlinear mapping on the user attributes to obtain the user characteristics.

Optionally, the second obtaining unit is configured to obtain historical behavior data corresponding to the user account; and performing linear mapping and nonlinear mapping on the historical behavior data to obtain the user characteristics.

Optionally, the apparatus further comprises: a third acquisition unit configured to perform acquisition of geographic position information corresponding to the video;

the generation unit is configured to generate a topic tag matching the video according to the at least one frame of image, the user feature and the geographical location information.

Optionally, the apparatus further comprises: a fourth acquisition unit configured to perform acquisition of time information corresponding to the video;

the generating unit is configured to generate a topic tag matched with the video according to the at least one frame of image, the user characteristic and the time information.

According to a third aspect of embodiments of the present disclosure, there is provided an electronic device, comprising:

one or more processors;

one or more memories for storing the processor-executable program code;

wherein the one or more processors are configured to execute the program code to implement the above-described topic tag recommendation method.

According to a fourth aspect of embodiments of the present disclosure, there is provided a storage medium, which when executed by a processor of an electronic device, causes the electronic device to perform the above-described topic label recommendation method.

According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product comprising one or more program codes which, when executed by a processor of an electronic device, enable the electronic device to perform the above-described topic label recommendation method.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

embodiments of the present disclosure provide a method of recommending a topic label based on characteristics of a video in multiple modalities, by automatically generating the topic label by a machine through images in the video and user characteristics of a video producer, recommending the topic label to a user. On the one hand, the characteristics of the video in the image mode are utilized when the topic label is generated, so that the machine can understand the content of the video from the visual angle according to the characteristics of the image mode, and the topic label is ensured to be matched with the content of the video. On the other hand, the characteristics of the video in the user mode are utilized when the topic label is generated, so that the machine can learn the information of the video producer according to the characteristics of the user mode, and the topic label is ensured to embody the information of the video producer. Because the recommended topic label is matched with the content of the video and the information of a video producer is reflected, the matching degree between the topic label and the video is fully ensured, the accuracy of the topic label is improved, the recommended topic label is closer to the intention of a user, and the accuracy of the processes of content identification, video distribution, video recommendation and the like of the video according to the video label is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure and do not constitute an undue limitation on the disclosure.

FIG. 1 is a block diagram of an implementation environment of a topic tag recommendation method, shown in accordance with an exemplary embodiment;

FIG. 2 is a block diagram illustrating a multimodal model for topic tag recommendation, according to an example embodiment;

FIG. 3 is a flowchart illustrating a method of topic tag recommendation, according to an example embodiment;

FIG. 4 is a flowchart illustrating a method of topic tag recommendation, according to an example embodiment;

FIG. 5 is a block diagram of a topic tag recommendation device, shown in accordance with an exemplary embodiment;

FIG. 6 is a block diagram of a terminal shown in accordance with an exemplary embodiment;

fig. 7 is a block diagram of a server, according to an example embodiment.

Detailed Description

In order to enable those skilled in the art to better understand the technical solutions of the present disclosure, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

The topic label recommending method provided by the embodiment of the application can be applied to the scene of adding topic labels when a user publishes works. For example, when a user shoots a short video and issues the short video on a platform, the server can automatically generate a matched topic label for the short video by using the topic label recommending method provided by the embodiment, and issue the topic label to a issuing page, so that a function of selecting the topic label is provided for the user. Next, a brief description is given of the topic labels and related scenes.

A topic tag (HashTag) is a textual description of video. For example, a topic tag is a text message with "#" that a user places in the video production link when describing the video content. For example, the topic labels are "# dance", "# fun", and the like. In the scene of issuing the short video, the platform can perform content aggregation and content distribution on the short video based on the topic label related to the short video, so that the accurate topic label plays a remarkable forward role in understanding, aggregation and distribution of the short video.

The topic tag generation process includes two possible implementations.

In one possible implementation, the short video topic tag is obtained by the user actively typing in. The method can reflect the real intention of the user, but has the problem of low user initiative because the method needs the user to manually determine the topic label and has high difficulty. The short video containing the topic label was found to account for about 15% by sampling in the platform short video. Therefore, the current work amount marked with the topic label is relatively small, the exposure of the topic label is insufficient, the use amount of the topic label is insufficient, the utilization rate of the topic label service is influenced, and the advantage of the topic label on content operation is insufficient to be maximized.

In another possible implementation, the visual understanding is performed by extracting image frames from the video, so as to realize the understanding of the video content, and the short video topic label is obtained according to the understood video content. This approach places high demands on both the timeliness and accuracy of understanding the video content.

In view of the above-described needs of the topic tag technology, some embodiments of the present application provide a topic tag recommendation technology based on multi-modal learning, which maintains a set of user features online based on visual understanding by using images, and trains and infers topic tags closer to the actual intention of authors by using image features and user features through a multi-modal model. In one aspect, a user is provided with the facility to manually type a topic label into a selection topic label. On the other hand, the exposure and the usage of the topic label can be directly improved.

Fig. 1 is a block diagram illustrating an implementation environment of a topic tag recommendation method, according to an example embodiment. The implementation environment comprises: a terminal 101 and a video distribution platform 110.

The terminal 101 is connected to the video distribution platform 110 through a wireless network or a wired network. The terminal 101 may be at least one of a smart phone, a game console, a desktop computer, a tablet computer, an e-book reader, an MP3 (Moving Picture Experts Grou p Audio Layer III, moving picture experts compression standard audio layer 3) player or an MP4 (Moving Picture Experts Group Audio Layer IV, moving picture experts compression standard audio layer 4) player and a laptop portable computer. The terminal 101 installs and runs an application supporting the distribution of video. The application may be a live application, a multimedia application, a short video application, etc. The terminal 101 is an exemplary terminal used by a user, and a user account is logged into an application running in the terminal 101.

The terminal 101 is connected to the video distribution platform 110 through a wireless network or a wired network.

The video distribution platform 110 includes at least one of a server, a plurality of servers, a cloud computing platform, and a virtualization center. The video distribution platform 110 is used to provide background services for applications that support video distribution, tag addition, or video playback functions. Alternatively, the video distribution platform 110 and the terminal 101 may cooperate in performing the method embodiments described below. For example, the video distribution platform 110 takes on primary work and the terminal 101 takes on secondary work; alternatively, the video distribution platform 110 takes on secondary work and the terminal 101 takes on primary work; alternatively, the video distribution platform 110 or the terminal 101, respectively, may take on the work separately.

Optionally, the video distribution platform 110 includes: an access server, a traffic server 1101 and a database 1102. The access server is used for providing access services for the terminal 101. The business server 1101 is used to provide background services related to recommended topic tags, such as training multimodal models, extracting user features, capturing sample videos, and so forth. The service server 1101 may be one or more. When the service servers 1101 are multiple, there are at least two service servers 1101 for providing different services, and/or there are at least two service servers 1101 for providing the same service, such as providing the same service in a load balancing manner, which is not limited by the embodiments of the present disclosure. Database 1102 may be used to store video, sample video, or other data related to the method embodiments described below, etc., and database 1102 may provide the stored data to terminal 101 as well as to service server 1101 when needed.

The terminal 101 may refer broadly to one of a plurality of terminals, and the present embodiment is illustrated only with the terminal 101.

Those skilled in the art will appreciate that the number of terminals 101 may be greater or lesser. For example, the number of terminals 101 may be only one, or the number of terminals 101 may be tens or hundreds, or more, where the implementation environment includes other terminals. The embodiment of the present disclosure does not limit the number of terminals and the type of devices.

The following describes a model of a topic label-oriented recommendation task related to this embodiment.

Referring to fig. 2, fig. 2 shows an architecture diagram of a model implementing the topic tag recommendation function. The multimodal model 200 of fig. 2 is used to perform the topic tag recommendation task. The multimodal model 200 is, for example, a machine learning model. The multimodal model 200 is, for example, a deep learning model. The multimodal model 200 extracts features of the video in a plurality of modalities and generates topic tags matching the video according to the features of the plurality of modalities. Modalities refer to sources or forms of data, and different modalities of the same data can characterize the data from different aspects. Modalities of video include, but are not limited to, at least one of images in video, users of video (e.g., producer or consumer), geographic location, time, audio, text, semantics.

The input parameters of the multimodal model 200 include at least one frame of images, user characteristics, and other characteristics.

The at least one frame of image includes, for example, two frames of images extracted from the video. Each frame of image is also called one-way image input.

For example, referring to fig. 2, image 1 (image#1), image 2 (image#2), and image n (image#n) in fig. 2 are examples of multi-frame images in the input parameters. The ellipses (…) represent other images that are not shown in fig. 2 and that may also serve as input parameters. The present embodiment does not limit how many frames of images the multimodal model 200 supports to be input. For example, the number of images input by the multimodal model 200 is configured according to the demand for computational effort or the demand for time delay.

A user-feature is a feature of a producer of the video, for example, a feature of an author of the video. The user feature is, for example, a feature vector. The user characteristics include, for example, a plurality of dimensions. The user characteristics include, but are not limited to, at least one of user attribute characteristics or user behavior characteristics. The user attributes include, but are not limited to, at least one of age, height, ethnicity, gender, occupation, and face value of the user. The user behavior feature is used to describe the user's historical behavior. The user behavior features include, but are not limited to, at least one of a user's production behavior feature on the video or a user's consumption behavior feature on the video. In some embodiments, 64-dimensional features are generated as user features from the user's production and consumption behaviors by a user analysis department.

Other features refer to features of the video in other modalities than images and users. Other features include, but are not limited to, at least one of geographic location information, time information. For example, the geographical position information is used to indicate the geographical position of the captured video. For example, geographic location information is used to indicate the location of when a user published a video. For example, geographic location information is used to indicate the location of the user input. For example, the time information is used to indicate the time at which the video was captured. For example, the time information is a period when the user issues a video. By inputting other features into the multi-modal model 200 and setting a module for processing the other features in the multi-modal model 200, the multi-modal model 200 is applicable to processing of more multi-modal features, and has stronger expandability and higher flexibility.

The multimodal model 200 includes at least one image processing module (image model), at least one feature processing module, a feature fusion layer (feature fusion layer), and a full connectivity layer (full connect layer).

The image processing module is used for extracting the characteristics of the image. The input parameters of each image processing module include a frame of image. The output parameters of each image processing module include image characteristics of the input image. The image features output by each image processing module are input to a feature fusion layer. For example, referring to fig. 2, the at least one image processing module includes an image processing module 211, an image processing module 212, and an image processing module 213. The image processing module 211 is used for extracting the features of the image 1, the image processing module 212 is used for extracting the features of the image 2, and the image processing module 213 is used for extracting the features of the image n. At least one image processing module is connected with the feature fusion layer.

In some embodiments, the image processing module is a classification network. The classification network is used for extracting image features and classifying according to the image features. In some embodiments, the classification network includes at least one convolution layer and a pooling layer. The convolution layers in the classification network are used to extract image features. Specifically, a first convolution layer in the classification network is used for carrying out convolution processing on an input image to obtain a feature map, and the feature map is output to a second convolution layer. The second convolution layer is used for carrying out convolution processing on the feature map output by the first convolution layer to obtain a feature map, and outputting the feature map to the third convolution layer. And by analogy, each convolution layer except the first convolution layer carries out convolution processing on the feature map output by the previous convolution layer, and the obtained feature map is output to the next convolution layer. The last convolution layer is used for carrying out convolution processing to obtain a feature map, and the feature map is output to the pooling layer. The pooling layer is used for pooling the feature map and outputting image features.

In some embodiments, the classification network is a ResNet-50 network. The ResNet-50 Network is a neural Network for image classification, specifically a Residual Network (ResNet). The ResNet-50 network performs a series of neural network operations, including 50 convolution operations, on the input image, outputting 512-dimensional image features. Referring to Table 1 below, table 1 is an illustration of the architecture of the ResNet-50 network and the operations performed by the various layers.

TABLE 1

The ResNet-50 network shown in Table 1 includes convolutional layer Conv1, convolutional layer Conv2_x, convolutional layer Conv3_x, convolutional layer Conv4_x, convolutional layer Conv5_x, and Pooling layer Pooling.

The feature map size of the convolutional layer Conv1 output is 112x112. In "7x7, 64, stride 2" corresponding to the convolutional layer Conv1, the meaning of 7x7 is that the size of each convolution kernel in the convolutional layer Conv1 is 7x7; the meaning of 64 is that the convolutional layer Conv1 comprises 64 convolutional kernels; the meaning of stride 2 is that the stride (stride) of the convolutional layer Conv1 convolutional operation is 2.

The feature map size of the convolutionally layer conv2_x output is 56x56. The convolutional layer conv2_x is also used to perform max pooling (max pool) with a window size of 3*3.

Corresponding to the convolution layer Conv2_xIn (I)>Meaning that the convolutional layer conv2_x comprises 64 convolutional kernels of size 1x1, 64 convolutional kernels of size 3x3 and 256 convolutional kernels of size 1x 1. 3 in "X3" refers to the number of convolution operations, and "X3" means that a set of convolution sequences of the convolution layer conv2_x is repeatedly performed 3 times.

The feature map size of the convolutionally layer conv3_x output is 28x28. Corresponding to the convolution layer Conv3_xIn (I)>Meaning that the convolutional layer conv3_x contains 128 convolutional kernels of size 1x1, 128 convolutional kernels of size 3x3 and 512 convolutional kernels of size 1x 1. 4 in "X4" refers to the number of convolution operations, and "X4" means that a set of convolution sequences of the convolution layer conv3_x is repeatedly performed 4 times.

The feature map size of the convolutionally layer conv4_x output is 14x14. Corresponding to the convolution layer Conv4_xIn (I)>Meaning that the convolutional layer conv4_x comprises 256 convolutional kernels of size 1x1, 256 convolutional kernels of size 3x3 and 1024 convolutional kernels of size 1x 1. 6 in "X6" refers to the number of convolution operations,the meaning of "X6" is that a set of convolution sequences of the convolution layer conv4_x is repeatedly performed 6 times.

The feature map size of the convolutional layer conv5_x output is 7x7. Corresponding to the convolution layer Conv5_xIn (I)>Meaning that the convolutional layer conv5_x contains 512 convolutional kernels of size 1x1, 512 convolutional kernels of size 3x3 and 2048 convolutional kernels of size 1x 1. 3 in "X3" refers to the number of convolution operations, and "X3" means that a set of convolution sequences of the convolution layer conv5_x is repeatedly performed 3 times.

Pooling layer Pooling outputs 512-dimensional image features.

At least one feature care module is used to process the features. At least one feature care module is connected with the feature fusion layer. For example, referring to FIG. 2, at least one feature processing module includes feature processing module 214 and feature processing module 215.

Feature processing module 214 is configured to linearly map and non-linearly map user features to map user features from low-dimensional features to high-dimensional features. The input parameters of the feature processing module 214 include user features. The output parameters of feature processing module 214 include mapped user features. The number of dimensions of the mapped user features and image features are, for example, equal. The mapped user features output by the feature processing module 214 are input to the feature fusion layer.

The feature processing module 215 is configured to perform linear mapping and nonlinear mapping on other features, so as to map the other features from low-dimensional features to high-dimensional features. The input parameters of the feature processing module 215 include other features. The output parameters of feature processing module 215 include other features after mapping. The number of dimensions of the other features and the image features after mapping is for example equal. Other mapped features output by the feature processing module 215 are input to the feature fusion layer.

In some embodiments, the feature processing module is a multi-layer perceptron (Multilayer Perceptron, MLP), and the feature processing module is a multi-layer neural network. For example, the feature processing module is a fully-connected neural network, and the feature processing module is used for performing at least one full-connection operation on the user features and outputting the processed user features to the feature fusion layer. For example, referring to table 2 below, the feature processing module includes three full connection layers, each for performing a full connection operation, and performs three full connection operations on the input user feature, resulting in a 512-dimensional user feature. Specifically, the feature processing module includes a full connection layer full connect_1, a full connection layer full connect_2, and a full connection layer full connect_3. After the full connection layer fullconnect_1 performs full connection operation on the user feature, a 128-dimensional feature vector is output. After the full connection layer fullconnect_2 performs full connection operation on the 128-dimensional feature vector, a 256-dimensional feature vector is output. After the full connection layer fullconnect_3 performs full connection operation on the 256-dimensional feature vector, a 512-dimensional feature vector is output.

TABLE 2

Layer Name (Layer Name)	Output Size (Output Size)
		FullConnect_1	128
FullConnect_2	256
		FullConnect_3	512

The feature fusion layer is used for fusing the image features of at least one frame of image with the user features to obtain fusion features. The input parameters of the feature fusion layer comprise the image features of each frame of image in at least one frame of image and the user features. The output parameters of the feature fusion layer include fusion features. The feature fusion layer is connected with the full connection layer. The fusion features output by the feature fusion layer are input to the full connection layer. For example, referring to FIG. 2, feature fusion layer 220 in FIG. 2 is an illustration of a feature fusion layer in a multimodal model. The feature fusion layer 220 fuses the image features of the image 1, the image features of the image 2 and the user features, and outputs a 2048-dimensional feature vector. In some embodiments, feature fusion layer 220 is a multi-head attention (multi-head attention) network. The multi-head attention network comprises h attention modules, wherein h is a positive integer greater than 1. Each attention module is used for carrying out self-attention operation. Each attention module includes a query weight matrix, a key weight matrix, a value weight matrix. Each attention module may implement a self-attention mechanism. The multi-head attention network is used for feature fusion through a multi-head attention mechanism.

Because the multi-head attention mechanism is utilized for feature fusion, the multi-mode model can pay attention to information in different aspects, and therefore the multi-mode model can learn richer features. Experiments prove that the image features and the user features are subjected to feature fusion by using a multi-head attention mechanism, and compared with the feature fusion by simply using a feature splicing mode, the training set has the advantage that the gain of 1 percentage point is obtained.

The full-connection layer is used for classifying according to the fusion characteristics output by the characteristic fusion layer. In this embodiment, the task of predicting a topic label, such as predicting the probability that each candidate label matches a video, may be understood as an n-class task. Each category is a candidate tag. Specifically, the full connection layer is used for determining probabilities of a plurality of candidate labels according to the fusion characteristics. The input parameters of the full connection layer include fusion features. The output parameters of the full connection layer include a probability of each of the plurality of candidate tags. The number of dimensions of the output parameters of the fully connected layer is for example equal to the number of candidate labels. For example, referring to FIG. 2, the fully connected layer 230 of FIG. 2 is an illustration of a fully connected layer in a multimodal model. In some embodiments, the output parameter of the fully connected layer is Logits. Logits refers to the unnormalized probability. Optionally, the multimodal model 200 further includes a normalized index (Softmax) layer. The Softmax layer is connected to the fully-connected layer, and the Softmax layer is used for performing Softmax operation on the output of the fully-connected layer. The result of the Softmax operation is a real number between 0 and 1, and the result of the Softmax operation can be understood as a probability.

The multimodal model 200 is trained from a plurality of sample videos and a topic tag that each sample video matches. For example, when training is performed according to a sample video a in a plurality of sample videos, at least one frame of image is extracted from the sample video a, a user feature corresponding to a user account uploading the sample video a is determined, a topic tag matched with the sample video a is determined from a topic tag library, and the image, the user feature and the topic tag of the sample video a are used as inputs of the multimodal model 200 for training.

In some embodiments, during the model training phase, the images input to the multimodal model 200 are two frames of images taken at equal intervals in the sample video. The user characteristics input to multimodal model 200 come from 64-dimensional characteristics produced by the user analysis department based on the user's production and consumption behaviors.

After obtaining the output result (such as Logits) of the full-link layer in the multimodal model 200, a Softmax operation is performed on the output result of the full-link layer. From the results of the Softmax operation and the topic labels of the sample video matches, a loss value is calculated using a cross entropy loss (cross entropy loss) function. Parameters of the multimodal model 200 are adjusted by back propagation and gradient updating according to the loss values, so that the multimodal model 200 is learned through sample video.

In some embodiments, in the model prediction phase, classification is performed by the multimodal model 200 to achieve recommendation of the topic tags. For example, a topic tag library (HashTag word library) is cleaned, and each topic tag in the cleaned topic tag library is used as a category to be identified by the multi-mode model 200. For example, two frames extracted from a short video and 64-dimensional user features are input into the multimodal model 200 to determine the probability of each category (i.e., each candidate tag) of the Softmax operation output. And ordering each category according to the probability, determining the category with the probability of being ranked at the top 10 categories, taking the topic label corresponding to the category with the probability of being ranked at the top 10 categories as the topic label to be recommended, and issuing the topic label to an editing interface of a user. For specific application of the multimodal model 200 to predict a topic tag, refer to the embodiments shown in fig. 3 or fig. 4 below. In addition, the washed topic label library comprises tens of thousands of topic labels, and covers most scenes in daily life, so that the requirements of finding matched topic labels for various scenes can be met by classifying the topic labels through the multi-mode model 200.

FIG. 3 is a flowchart illustrating a method of topic tag recommendation, as shown in FIG. 3, for use in an electronic device, according to an exemplary embodiment, including the following steps.

In step S32, the electronic apparatus acquires a video.

In some embodiments, the video is a short video.

In step S34, the electronic device extracts at least one frame image from the video.

In step S36, the electronic device obtains a user characteristic corresponding to the user account according to the user account of the uploaded video.

It should be noted that, in this embodiment, the sequence of the step S34 and the step S36 is not limited. In some embodiments, step S34 and step S36 may be performed sequentially. For example, step S34 may be performed first, and then step S36 may be performed; step S36 may be performed first, and then step S34 may be performed. In other embodiments, the step S34 and the step S36 may be performed in parallel, that is, the step S34 and the step S36 may be performed simultaneously.

In step S38, the electronic device generates a topic tag matching the video according to at least one frame of image and the user characteristics.

In step S39, the electronic device recommends a topic label to the user account.

The embodiment provides a method for recommending topic labels based on characteristics of videos in multiple modes, which automatically generates topic labels by a machine through images in the videos and user characteristics of video producers and recommends the topic labels to users. On the one hand, the characteristics of the video in the image mode are utilized when the topic label is generated, so that the machine can understand the content of the video from the visual angle according to the characteristics of the image mode, and the topic label is ensured to be matched with the content of the video. On the other hand, the characteristics of the video in the user mode are utilized when the topic label is generated, so that the machine can learn the information of the video producer according to the characteristics of the user mode, and the topic label is ensured to embody the information of the video producer. Because the recommended topic label is matched with the content of the video and the information of a video producer is reflected, the matching degree between the topic label and the video is fully ensured, the accuracy of the topic label is improved, the recommended topic label is closer to the intention of a user, and the accuracy of the processes of content identification, video distribution, video recommendation and the like of the video according to the video label is improved.

Fig. 4 is a flowchart of a method for recommending a topic tag according to an exemplary embodiment, and as shown in fig. 4, an interaction body of the method for recommending a topic tag includes a terminal and a server, including the following steps.

In step S401, the terminal transmits a topic tag recommendation request to the server.

The topic tag recommendation request is for requesting the server to recommend a topic tag that matches the video. For example, the terminal displays a video distribution interface that includes topic options. The user triggers a click operation on the topic option. And the terminal responds to the clicking operation, generates a topic label recommendation request and sends the topic label recommendation request to the server.

In step S402, the server acquires a video.

The video is taken by the terminal, for example. In some embodiments, when sending a topic tag recommendation request, the terminal sends the video to be published to the server together, and the server receives the video sent by the terminal, so that the video is obtained. In other embodiments, the server determines a user account logged in by the terminal and obtains a video published by the user account history.

In step S403, the server extracts at least one frame image from the video.

Extracting which frame images to generate a topic label from a video specifically includes a number of situations. In some embodiments, the images extracted from the video comprise equally spaced multi-frame images in the video. In some embodiments, the image extracted from the video includes a key frame in the video. In some embodiments, the image extracted from the video includes a video cover. In some embodiments, the image extracted from the video comprises a video header.

In step S404, the server obtains a user characteristic corresponding to the user account according to the user account of the uploaded video.

In some embodiments, the user characteristics are mapped by user attributes. For example, the server acquires a user attribute corresponding to the user account; and the server performs linear mapping and nonlinear mapping on the user attributes to obtain user characteristics. The way of linear mapping comprises for example a step of multiplying with a weight matrix and a step of adding with a bias. Nonlinear mapping includes, but is not limited to, maximizing, by activating a function, and the like.

Because the user features are mapped by the user attributes, the user features can express information related to the user in terms of attributes. Thus, when generating a topic tag from a user feature, the machine is able to learn information of the user in terms of attributes from the user feature such that the generated topic tag embodies the information of the user in terms of attributes. Therefore, compared with a mode of generating the topic label by simply relying on the image, the topic label generated by utilizing the image and the user attribute is more refined, and the topic label issued to the user is ensured to be closer to the user intention. For example, the video that requires a recommended topic label is a female dancing video. If a purely visual scheme is used (i.e., only based on images and not based on user characteristics) to generate a topic label for a female dancing video, the generated topic label may be a more generalized label such as "dance", "national dance", etc. After the user characteristics mapped by the user attributes are superimposed on the basis of the image, the model can learn the nationality related information of the user according to the attributes of the user in the nationality dimension, so that the fine recommendation tags such as Dai nationality dance and the like are given, and the given tags are obviously closer to the intention of the user.

In some embodiments, the user characteristics are determined based on historical behavior of the user. For example, the server acquires historical behavior data corresponding to the user account; and the server performs linear mapping and nonlinear mapping on the historical behavior data to obtain user characteristics.

Because the user characteristics are mapped by the historical behavior data, the user characteristics can express information related to the historical behavior of the user or habit of the user. Therefore, when the topic label is generated according to the user characteristics, the machine can learn the information of the user in the aspect of the historical behaviors according to the user characteristics, so that the generated topic label reflects the information of the user in the aspect of the historical behaviors. Therefore, compared with a mode of generating the topic label by simply relying on the image, the topic label generated by utilizing the image and the historical behavior data is more refined, and the topic label issued to the user is ensured to be closer to the intention of the user.

In step S405, the server generates a topic tag matching the video according to at least one frame of image and the user characteristics.

In some embodiments, the topic labels are generated based on the multimodal model 200. For example, the server inputs at least one frame of image and user features into the multimodal model 200, processes the at least one frame of image and user features through the multimodal model 200, and outputs a topic tag.

In some embodiments, the generation process of the topic tag includes the following steps (1) to (4). For example, one or more of the following steps (1) through (4) are performed by corresponding modules in the multimodal model 200.

In the step (1), the server performs feature extraction on at least one frame of image respectively to obtain the image feature of each frame of image.

In some embodiments, the process of image feature extraction is implemented by an image processing module in the multimodal model 200. For example, the server inputs at least one frame of image into at least one image processing module respectively, and the at least one frame of image is subjected to feature extraction respectively through the at least one image processing module to obtain the image feature of each frame of image.

In the case where the image processing module is a classification network, in some embodiments, the server convolves at least one frame of image in parallel through at least one classification network, outputting image features for each frame of image. The parallel convolution processing is, for example, that a plurality of classification networks perform convolution processing simultaneously. For example, the server convolves the image 1 with the classification network 1 and convolves the image 2 with the classification network 2.

The feature extraction process of the multi-path image is parallelized by using at least one classification network for parallel convolution processing, so that the feature extraction process of the video in the whole image mode is accelerated, and the efficiency of predicting the topic label is improved.

In the step (2), the server fuses the image characteristics of at least one frame of image with the user characteristics to obtain fusion characteristics.

In some embodiments, the process of feature fusion is implemented through a feature fusion layer in the multimodal model 200. Specifically, the server inputs the image features and the user features of at least one frame of image to a feature fusion layer, and the feature fusion layer fuses the image features and the user features of at least one frame of image to obtain fusion features.

In the case that the feature fusion layer is a multi-head attention network, in some embodiments, the server performs self-attention computation on the image features and the user features of at least one frame of image through the multi-head attention network, so as to obtain fusion features. For example, the image features and the user features of at least one frame of image are respectively input into the attention modules corresponding to the multi-head attention network, and the self-attention operation is performed on the image features and the user features of at least one frame of image through a plurality of attention modules in parallel.

In step (3), the server determines probabilities of a plurality of candidate tags based on the fusion features.

The probability of the candidate tag is used to indicate the likelihood that the candidate tag is a topic tag that matches the video. The higher the probability of a candidate tag, the greater the likelihood that the candidate tag is a topic tag that matches the video. The probability determination process is implemented, for example, by a fully connected layer and Softmax operation in the multimodal model 200. For example, the server inputs the fusion features to the full-connection layer, maps the fusion features through the full-connection layer, and calculates the output of the full-connection layer through the Softmax function to obtain probabilities of a plurality of candidate labels.

In step (4), the server determines a topic tag among the plurality of candidate tags according to the probability of each candidate tag.

For example, the server ranks the plurality of candidate tags according to the probability of each candidate tag, the probability ranks the plurality of candidate tags in order from large to small, selects a candidate tag with a preset number of bits before the probability ranks among the plurality of candidate tags, and takes each candidate tag with the preset number of bits before the probability ranks as a topic tag to be recommended. The number of predetermined bits is 10, for example. For example, the server selects a candidate tag with the highest probability from the plurality of candidate tags, and uses the candidate tag with the highest probability as the topic tag to be recommended.

Through the steps (1) to (4), as the image features of the multi-frame images in the video are fused with the user features, the fused features not only comprise the visual features of the video but also comprise the user features. Therefore, the fusion feature not only expresses the content of the video, but also expresses the information of the producer of the video, and obviously, the expression capability of the fusion feature is stronger, so that the probability of each candidate label can be predicted more accurately by utilizing the fusion feature, thereby ensuring that the topic label determined in the candidate label is more accurate, and the recommended topic label is closer to the intention of the user.

In some embodiments, the server also predicts the topic tag using a geographic location modality. For example, a server acquires geographic position information corresponding to a video; the server generates a topic label matched with the video according to at least one frame of image, the user characteristics and the geographic position information. For example, referring to FIG. 2, the server inputs geographic location information as other features into the multimodal model 200. The server processes the geographic location information via feature processing module 215 in multimodal model 200 to output geographic location features. The server inputs the image features, the user features and the geographic position features of at least one frame of image to a feature fusion layer, and the image features, the user features and the geographic position features of at least one frame of image are fused through the feature fusion layer to obtain fusion features.

In this way, when the topic label is generated, the characteristics of the video in the image mode and the characteristics of the video in the user mode are utilized, and the characteristics of the video in the geographic position mode are utilized, so that the generated topic label is ensured to reflect the geographic position corresponding to the video, the matching degree between the recommended topic label and the video is improved, the recommended topic label is ensured to be more refined, and the recommended topic label is more close to the user intention.

In some embodiments, the server also predicts the topic tag using a temporal modality. For example, a server acquires time information corresponding to a video; and the server generates a topic label matched with the video according to at least one frame of image, the user characteristics and the time information. For example, referring to FIG. 2, the server inputs time information as other characteristics into the multimodal model 200. The server processes the temporal information through the feature processing module 215 in the multimodal model 200, outputting temporal features. The server inputs the image characteristics, the user characteristics and the time characteristics of at least one frame of image to the characteristic fusion layer, and the image characteristics, the user characteristics and the time characteristics of at least one frame of image are fused through the characteristic fusion layer to obtain fusion characteristics.

In this way, when the topic label is generated, the characteristics of the video in the image mode and the characteristics of the video in the user mode are utilized, and the characteristics of the video in the time mode are utilized, so that the generated topic label is ensured to reflect the time corresponding to the video, the matching degree between the recommended topic label and the video is improved, the recommended topic label is ensured to be more refined, and the recommended topic label is more close to the user intention.

In step S406, the server recommends a topic label to the user account.

In step S407, the terminal displays a topic label.

Specifically, the server sends a topic tag to the terminal that has logged in the user account. The terminal receives the topic label and displays the topic label in an adding interface of the topic label. For example, the add interface includes a search box in which the terminal displays the topic tag. The user triggers a confirmation operation on the topic label displayed by the terminal. And the terminal sends a topic label adding request to the server. The server adds a topic tag to the video in response to the topic tag addition request.

In the case where the server generates a plurality of topic labels, in some embodiments, the terminal displays each of the plurality of topic labels. For example, the terminal displays a plurality of options, each including a topic label. The user triggers a confirmation operation for a target option of the plurality of options. The terminal responds to the confirmation operation, determines a target topic label in the target option, carries the target topic label in a topic label adding request, and sends the topic label adding request to the server. The server responds to the topic label adding request, obtains a target topic label from the topic label adding request and adds the target topic label to the video. By the method, not only is the recommended topic label issued for the user, but also the topic label selection function is provided, the user can select the preferred topic label from a plurality of recommended topic labels, the function of adding topic label service is expanded, and the usage amount of the topic label can be increased.

The embodiment provides a method for recommending topic labels based on characteristics of videos in multiple modes, which automatically generates topic labels by a machine through images in the videos and user characteristics of video producers and recommends the topic labels to users. On the one hand, the characteristics of the video in the image mode are utilized when the topic label is generated, so that the machine can understand the content of the video from the visual angle according to the characteristics of the image mode, and the topic label is ensured to be matched with the content of the video. On the other hand, the characteristics of the video in the user mode are utilized when the topic label is generated, so that the machine can learn the information of the video producer according to the characteristics of the user mode, and the topic label is ensured to embody the information of the video producer. Because the recommended topic label is matched with the content of the video, the information of a video producer is reflected, the matching degree between the recommended topic label and the video is improved, the recommended topic label is ensured to be finer, the recommended topic label is closer to the intention of a user, and the accuracy of the processes of content identification, distribution, recommendation and the like of the video according to the video label is improved.

Fig. 5 is a block diagram illustrating a topic tag recommendation device, according to an example embodiment. Referring to fig. 5, the apparatus includes a first acquisition unit 501, an extraction unit 502, a second acquisition unit 503, a generation unit 504, and a recommendation unit 505.

A first acquisition unit 501 configured to perform acquisition of video;

an extraction unit 502 configured to perform extraction of at least one frame image from a video;

a second obtaining unit 503, configured to perform obtaining a user characteristic corresponding to a user account according to the user account of the uploaded video;

a generating unit 504 configured to perform generating a topic tag matching the video based on at least one frame of the image and the user characteristics;

a recommending unit 505 configured to perform recommending of the topic label to the user account.

The embodiment provides a device for recommending topic labels based on characteristics of videos in multiple modes, which automatically generates topic labels by a machine through images in the videos and user characteristics of video producers and recommends the topic labels to users. On the one hand, the characteristics of the video in the image mode are utilized when the topic label is generated, so that the machine can understand the content of the video from the visual angle according to the characteristics of the image mode, and the topic label is ensured to be matched with the content of the video. On the other hand, the characteristics of the video in the user mode are utilized when the topic label is generated, so that the machine can learn the information of the video producer according to the characteristics of the user mode, and the topic label is ensured to embody the information of the video producer. Because the recommended topic label is matched with the content of the video and the information of a video producer is reflected, the matching degree between the topic label and the video is fully ensured, the accuracy of the topic label is improved, the recommended topic label is closer to the intention of a user, and the accuracy of the processes of content identification, video distribution, video recommendation and the like of the video according to the video label is improved.

Optionally, the generating unit 504 is configured to perform feature extraction on at least one frame of image, so as to obtain an image feature of each frame of image; fusing the image characteristics of at least one frame of image with the user characteristics to obtain fusion characteristics; determining the probability of a plurality of candidate labels according to the fusion characteristics; a topic tag is determined among the plurality of candidate tags based on the probability of each candidate tag.

Optionally, the generating unit 504 is configured to perform self-attention computation on the image feature and the user feature of at least one frame of image through the multi-head attention network, so as to obtain a fusion feature.

Optionally, the generating unit 504 is configured to perform convolution processing on at least one frame of image in parallel through at least one classification network, and output an image feature of each frame of image.

Optionally, the second obtaining unit 503 is configured to obtain a user attribute corresponding to the user account; and performing linear mapping and nonlinear mapping on the user attributes to obtain user characteristics.

Optionally, the second obtaining unit 503 is configured to obtain historical behavior data corresponding to the user account; and performing linear mapping and nonlinear mapping on the historical behavior data to obtain user characteristics.

a generating unit 504 configured to perform generating a topic tag matching the video based on the at least one frame of image, the user characteristics and the geographical location information.

a generating unit 504 configured to perform generating a topic tag matching the video based on at least one frame of image, the user characteristics and the time information.

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

The electronic device in the above-described method embodiment may be implemented as a terminal or a server, for example, fig. 6 shows a block diagram of a structure of a terminal 600 provided in an exemplary embodiment of the present application. The terminal 600 may be: a smart phone, a tablet, an MP3 (Moving Picture Experts Group Audio Layer III, motion picture expert compression standard audio plane 3) player, an MP4 (Moving Picture Experts Group Audio Layer IV, motion picture expert compression standard audio plane 4) player, a notebook or a desktop. Terminal 600 may also be referred to by other names of user devices, portable terminals, laptop terminals, desktop terminals, etc.

In general, the terminal 600 includes: one or more processors 601 and one or more memories 602.

Processor 601 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 601 may be implemented in at least one hardware form of DSP (Digital Signal Processing ), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array ). The processor 601 may also include a main processor, which is a processor for processing data in an awake state, also called a CPU (Central Processing Unit ), and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 601 may be integrated with a GPU (Graphics Processing Unit, image processor) for taking care of rendering and rendering of content that the display screen is required to display. In some embodiments, the processor 601 may also include an AI (Artificial Intelligence ) processor for processing computing operations related to machine learning.

The memory 602 may include one or more computer-readable storage media, which may be non-transitory. The memory 602 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 602 is used to store at least one program code for execution by processor 601 to implement the topic label recommendation method provided by the method embodiments of the present application.

In some embodiments, the terminal 600 may further optionally include: a peripheral interface 603, and at least one peripheral. The processor 601, memory 602, and peripheral interface 603 may be connected by a bus or signal line. The individual peripheral devices may be connected to the peripheral device interface 603 via buses, signal lines or a circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 604, a display 605, a camera assembly 606, audio circuitry 607, a positioning assembly 608, and a power supply 609.

Peripheral interface 603 may be used to connect at least one Input/Output (I/O) related peripheral to processor 601 and memory 602. In some embodiments, the processor 601, memory 602, and peripheral interface 603 are integrated on the same chip or circuit board; in some other embodiments, either or both of the processor 601, memory 602, and peripheral interface 603 may be implemented on separate chips or circuit boards, which is not limited in this embodiment.

The Radio Frequency circuit 604 is configured to receive and transmit RF (Radio Frequency) signals, also known as electromagnetic signals. The radio frequency circuit 604 communicates with a communication network and other communication devices via electromagnetic signals. The radio frequency circuit 604 converts an electrical signal into an electromagnetic signal for transmission, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 604 includes: antenna systems, RF transceivers, one or more amplifiers, tuners, oscillators, digital signal processors, codec chipsets, subscriber identity module cards, and so forth. The radio frequency circuit 604 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocol includes, but is not limited to: the world wide web, metropolitan area networks, intranets, generation mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and/or WiFi (Wireless Fidelity ) networks. In some embodiments, the radio frequency circuit 604 may also include NFC (Near Field Communication ) related circuits, which the present application is not limited to.

The display screen 605 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display 605 is a touch display, the display 605 also has the ability to collect touch signals at or above the surface of the display 605. The touch signal may be input as a control signal to the processor 601 for processing. At this point, the display 605 may also be used to provide virtual buttons and/or virtual keyboards, also referred to as soft buttons and/or soft keyboards. In some embodiments, the display 605 may be one, providing a front panel of the terminal 600; in other embodiments, the display 605 may be at least two, respectively disposed on different surfaces of the terminal 600 or in a folded design; in other embodiments, the display 605 may be a flexible display, disposed on a curved surface or a folded surface of the terminal 600. Even more, the display 605 may be arranged in a non-rectangular irregular pattern, i.e., a shaped screen. The display 605 may be made of LCD (Liquid Crystal Display ), OLED (Organic Light-Emitting Diode) or other materials.

The camera assembly 606 is used to capture images or video. Optionally, the camera assembly 606 includes a front camera and a rear camera. Typically, the front camera is disposed on the front panel of the terminal and the rear camera is disposed on the rear surface of the terminal. In some embodiments, the at least two rear cameras are any one of a main camera, a depth camera, a wide-angle camera and a tele camera, so as to realize that the main camera and the depth camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize a panoramic shooting and Virtual Reality (VR) shooting function or other fusion shooting functions. In some embodiments, camera assembly 606 may also include a flash. The flash lamp can be a single-color temperature flash lamp or a double-color temperature flash lamp. The dual-color temperature flash lamp refers to a combination of a warm light flash lamp and a cold light flash lamp, and can be used for light compensation under different color temperatures.

The audio circuit 607 may include a microphone and a speaker. The microphone is used for collecting sound waves of users and environments, converting the sound waves into electric signals, and inputting the electric signals to the processor 601 for processing, or inputting the electric signals to the radio frequency circuit 604 for voice communication. For the purpose of stereo acquisition or noise reduction, a plurality of microphones may be respectively disposed at different portions of the terminal 600. The microphone may also be an array microphone or an omni-directional pickup microphone. The speaker is used to convert electrical signals from the processor 601 or the radio frequency circuit 604 into sound waves. The speaker may be a conventional thin film speaker or a piezoelectric ceramic speaker. When the speaker is a piezoelectric ceramic speaker, not only the electric signal can be converted into a sound wave audible to humans, but also the electric signal can be converted into a sound wave inaudible to humans for ranging and other purposes. In some embodiments, the audio circuit 607 may also include a headphone jack.

The location component 608 is used to locate the current geographic location of the terminal 600 to enable navigation or LBS (Location Based Service, location based services). The positioning component 608 may be a positioning component based on the United states GPS (Global Positioning System ), the Beidou system of China, or the Galileo system of Russia.

A power supply 609 is used to power the various components in the terminal 600. The power source 609 may be alternating current, direct current, disposable battery or rechargeable battery. When the power source 609 includes a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, the terminal 600 further includes one or more sensors 610. The one or more sensors 610 include, but are not limited to: acceleration sensor 611, gyroscope sensor 612, pressure sensor 613, fingerprint sensor 614, optical sensor 615, and proximity sensor 616.

The acceleration sensor 611 can detect the magnitudes of accelerations on three coordinate axes of the coordinate system established with the terminal 600. For example, the acceleration sensor 611 may be used to detect components of gravitational acceleration in three coordinate axes. The processor 601 may control the display screen 605 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal acquired by the acceleration sensor 611. The acceleration sensor 611 may also be used for the acquisition of motion data of a game or a user.

The gyro sensor 612 may detect a body direction and a rotation angle of the terminal 600, and the gyro sensor 612 may collect a 3D motion of the user on the terminal 600 in cooperation with the acceleration sensor 611. The processor 601 may implement the following functions based on the data collected by the gyro sensor 612: motion sensing (e.g., changing UI according to a tilting operation by a user), image stabilization at shooting, game control, and inertial navigation.

The pressure sensor 613 may be disposed at a side frame of the terminal 600 and/or at a lower layer of the display 605. When the pressure sensor 613 is disposed at a side frame of the terminal 600, a grip signal of the terminal 600 by a user may be detected, and a left-right hand recognition or a shortcut operation may be performed by the processor 601 according to the grip signal collected by the pressure sensor 613. When the pressure sensor 613 is disposed at the lower layer of the display screen 605, the processor 601 controls the operability control on the UI interface according to the pressure operation of the user on the display screen 605. The operability controls include at least one of a button control, a scroll bar control, an icon control, and a menu control.

The fingerprint sensor 614 is used for collecting the fingerprint of the user, and the processor 601 identifies the identity of the user according to the fingerprint collected by the fingerprint sensor 614, or the fingerprint sensor 614 identifies the identity of the user according to the collected fingerprint. Upon recognizing that the user's identity is a trusted identity, the processor 601 authorizes the user to perform relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying for and changing settings, etc. The fingerprint sensor 614 may be provided on the front, back, or side of the terminal 600. When a physical key or vendor Logo is provided on the terminal 600, the fingerprint sensor 614 may be integrated with the physical key or vendor Logo.

The optical sensor 615 is used to collect ambient light intensity. In one embodiment, processor 601 may control the display brightness of display 605 based on the intensity of ambient light collected by optical sensor 615. Specifically, when the intensity of the ambient light is high, the display brightness of the display screen 605 is turned up; when the ambient light intensity is low, the display brightness of the display screen 605 is turned down. In another embodiment, the processor 601 may also dynamically adjust the shooting parameters of the camera assembly 606 based on the ambient light intensity collected by the optical sensor 615.

A proximity sensor 616, also referred to as a distance sensor, is typically provided on the front panel of the terminal 600. The proximity sensor 616 is used to collect the distance between the user and the front of the terminal 600. In one embodiment, when the proximity sensor 616 detects a gradual decrease in the distance between the user and the front face of the terminal 600, the processor 601 controls the display 605 to switch from the bright screen state to the off screen state; when the proximity sensor 616 detects that the distance between the user and the front surface of the terminal 600 gradually increases, the processor 601 controls the display screen 605 to switch from the off-screen state to the on-screen state.

Those skilled in the art will appreciate that the structure shown in fig. 6 is not limiting of the terminal 600 and may include more or fewer components than shown, or may combine certain components, or may employ a different arrangement of components.

The electronic device in the above method embodiment may be implemented as a server, for example, fig. 7 is a schematic structural diagram of a server provided in the embodiment of the present disclosure, where the server 700 may have a relatively large difference due to different configurations or performances, and may include one or more processors (central processing units, CPU) 701 and one or more memories 702, where at least one program code is stored in the memories 702, and the at least one program code is loaded and executed by the processors 701 to implement the topic label recommendation method provided in the above method embodiments. Of course, the server may also have a wired or wireless network interface, an input/output interface, and other components for implementing the functions of the device, which are not described herein.

In an exemplary embodiment, a storage medium is also provided, e.g. a memory, comprising program code executable by a processor of an electronic device to perform the above-described topic label recommendation method. Alternatively, the storage medium may be a non-transitory computer readable storage medium, for example, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a Read-Only optical disk (Compact Disc Read-Only Memory, CD-ROM), a magnetic tape, a floppy disk, an optical data storage device, and the like.

The user information referred to in the present disclosure may be information authorized by the user or sufficiently authorized by each party.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any adaptations, uses, or adaptations of the disclosure following the general principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method of topic tag recommendation, the method comprising:

acquiring a video;

extracting at least one frame of image from the video;

acquiring historical behavior data corresponding to a user account according to the user account uploading the video; performing linear mapping and nonlinear mapping on the historical behavior data to obtain user characteristics corresponding to the user account;

determining a topic label matched with the video in the plurality of candidate labels according to the probability of each candidate label;

recommending the topic label to the user account.

2. The method of claim 1, wherein fusing the image features of the at least one frame of image with the user features to obtain fused features comprises:

3. The method of claim 1, wherein after the capturing the video, the method further comprises: obtaining geographic position information corresponding to the video;

4. The method of claim 1, wherein after the capturing the video, the method further comprises: acquiring time information corresponding to the video;

5. A topic tag recommendation device, comprising:

a first acquisition unit configured to perform acquisition of a video;

the second acquisition unit is configured to execute the historical behavior data corresponding to the user account according to the user account uploading the video; performing linear mapping and nonlinear mapping on the historical behavior data to obtain user characteristics corresponding to the user account;

the generating unit is configured to perform feature extraction on the at least one frame of image respectively to obtain image features of each frame of image; fusing the image characteristics of the at least one frame of image with the user characteristics to obtain fusion characteristics; determining the probability of a plurality of candidate labels according to the fusion characteristics; determining a topic label matched with the video in the plurality of candidate labels according to the probability of each candidate label;

6. The topic tag recommendation device of claim 5, wherein the generating unit is configured to perform a self-attention operation on the image feature of the at least one frame of image and the user feature through a multi-head attention network to obtain the fusion feature.

7. An electronic device, comprising:

one or more processors;

one or more memories for storing the one or more processor-executable program codes;

wherein the one or more processors are configured to execute the program code to implement the topic label recommendation method as recited in any one of claims 1 to 4.

8. A storage medium, wherein the program code in the storage medium, when executed by a processor of an electronic device, enables the electronic device to perform the topic label recommendation method of any one of claims 1 to 4.