CN111897996A

CN111897996A - Topic label recommendation method, device, equipment and storage medium

Info

Publication number: CN111897996A
Application number: CN202010797673.XA
Authority: CN
Inventors: 吴翔宇; 杨帆; 王思博
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2020-08-10
Filing date: 2020-08-10
Publication date: 2020-11-06
Anticipated expiration: 2040-08-10
Also published as: CN111897996B

Abstract

The disclosure relates to a topic tag recommendation method, a topic tag recommendation device, equipment and a storage medium, and belongs to the technical field of multimedia. The embodiment of the disclosure provides a method for recommending topic labels based on the characteristics of videos in multiple modalities, wherein the topic labels are automatically generated by a machine through images in the videos and the user characteristics of a video producer, and are recommended to a user. Because the recommended topic tag is matched with the content of the video and the information of the video producer is reflected, the matching degree between the topic tag and the video is fully ensured, the accuracy of the topic tag is improved, and the recommended topic tag is closer to the intention of the user.

Description

Topic label recommendation method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of multimedia technologies, and in particular, to a method, an apparatus, a device, and a storage medium for recommending a topic tag.

Background

The topic label (Hashtag) of the video is the text information with "#" marked on the video content when the user describes the video content in the video production link. The topic tags play an important role in content identification, aggregation, distribution and recommendation of videos.

The process of adding the topic label in the related art comprises the following steps: and after the user shoots the video, clicking the topic option in the publishing page. The video client will display the search box. After thinking about the topic tag matched with the video, the user executes input operation in the search box, inputs the topic tag to be added, and then executes confirmation operation on the input topic tag, so that the input topic tag is added when the video is released.

When the method is adopted, the user needs to manually determine the topic labels matched with the video, however, the fact that the topic labels are determined manually has subjectivity, the matching between the determined topic labels and the video is difficult to guarantee, the accuracy of the topic labels is poor, and the accuracy of the processes of content identification, distribution, recommendation and the like of the video according to the video labels is affected.

Disclosure of Invention

The present disclosure provides a method, an apparatus, a device, and a storage medium for recommending a topic tag, which can improve the accuracy of the topic tag. The technical scheme of the disclosure is as follows:

according to a first aspect of the disclosed embodiments, there is provided a topic tag recommendation method, including acquiring a video;

extracting at least one frame of image from the video;

acquiring user characteristics corresponding to a user account according to the user account for uploading the video;

generating a topic label matched with the video according to the at least one frame of image and the user characteristics;

recommending the topic label to the user account.

Optionally, the generating a topic tag matched with the video according to the at least one frame of image and the user feature includes:

respectively extracting the features of the at least one frame of image to obtain the image features of each frame of image;

fusing the image characteristics of the at least one frame of image with the user characteristics to obtain fused characteristics;

determining the probability of a plurality of candidate labels according to the fusion characteristics;

determining the topic tag among the plurality of candidate tags according to the probability of each candidate tag.

Optionally, the fusing the image feature of the at least one frame of image with the user feature to obtain a fused feature includes:

and respectively carrying out self-attention operation on the image characteristics of the at least one frame of image and the user characteristics through a multi-head attention network to obtain the fusion characteristics.

Optionally, the performing feature extraction on the at least one frame of image respectively to obtain image features of each frame of image includes:

and performing convolution processing on the at least one frame of image in parallel through at least one classification network, and outputting the image characteristics of each frame of image.

Optionally, the obtaining, according to the user account for uploading the video, the user characteristic corresponding to the user account includes:

acquiring a user attribute corresponding to the user account;

and performing linear mapping and nonlinear mapping on the user attributes to obtain the user characteristics.

acquiring historical behavior data corresponding to the user account;

and performing linear mapping and nonlinear mapping on the historical behavior data to obtain the user characteristics.

Optionally, after the video is acquired, the method further includes: acquiring geographic position information corresponding to the video;

the generating a topic label matched with the video according to the at least one frame of image and the user characteristics comprises: and generating a topic label matched with the video according to the at least one frame of image, the user characteristics and the geographic position information.

Optionally, after the video is acquired, the method further includes: acquiring time information corresponding to the video;

the generating a topic label matched with the video according to the at least one frame of image and the user characteristics comprises: and generating a topic label matched with the video according to the at least one frame of image, the user characteristics and the time information.

According to a second aspect of the embodiments of the present disclosure, there is provided a topic tag recommendation apparatus including:

a first acquisition unit configured to perform acquisition of a video;

an extraction unit configured to perform extraction of at least one frame of image from the video;

the second acquisition unit is configured to execute user account uploading of the video and acquire user characteristics corresponding to the user account;

a generating unit configured to generate a topic label matched with the video according to the at least one frame of image and the user feature;

a recommending unit configured to perform recommending the topic tag to the user account.

Optionally, the generating unit is configured to perform feature extraction on the at least one frame of image respectively to obtain image features of each frame of image; fusing the image characteristics of the at least one frame of image with the user characteristics to obtain fused characteristics; determining the probability of a plurality of candidate labels according to the fusion characteristics; determining the topic tag among the plurality of candidate tags according to the probability of each candidate tag.

Optionally, the generating unit is configured to perform a self-attention operation on the image feature of the at least one frame of image and the user feature through a multi-head attention network, so as to obtain the fusion feature.

Optionally, the generating unit is configured to perform convolution processing on the at least one frame of image through at least one classification network in parallel, and output the image features of each frame of image.

Optionally, the second obtaining unit is configured to perform obtaining of a user attribute corresponding to the user account; and performing linear mapping and nonlinear mapping on the user attributes to obtain the user characteristics.

Optionally, the second obtaining unit is configured to perform obtaining of historical behavior data corresponding to the user account; and performing linear mapping and nonlinear mapping on the historical behavior data to obtain the user characteristics.

Optionally, the apparatus further comprises: the third acquisition unit is configured to acquire the geographic position information corresponding to the video;

the generating unit is configured to execute generating a topic label matched with the video according to the at least one frame of image, the user feature and the geographic position information.

Optionally, the apparatus further comprises: a fourth obtaining unit configured to perform obtaining time information corresponding to the video;

the generating unit is configured to generate a topic label matched with the video according to the at least one frame of image, the user characteristic and the time information.

According to a third aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including:

one or more processors;

one or more memories for storing the processor-executable program code;

wherein the one or more processors are configured to execute the program code to implement the above-described topic tag recommendation method.

According to a fourth aspect of the embodiments of the present disclosure, there is provided a storage medium, wherein when program codes in the storage medium are executed by a processor of an electronic device, the electronic device is enabled to execute the above-mentioned topic tag recommendation method.

According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product comprising one or more program codes which, when executed by a processor of an electronic device, enable the electronic device to perform the above-mentioned topic tag recommendation method.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

the embodiment of the disclosure provides a method for recommending topic labels based on the characteristics of videos in multiple modalities, wherein the topic labels are automatically generated by a machine through images in the videos and the user characteristics of a video producer, and are recommended to a user. On one hand, the characteristics of the video in the image modality are utilized when the topic tag is generated, so that the machine can understand the content of the video from the visual angle according to the characteristics of the image modality, and the topic tag is ensured to be matched with the content of the video. On the other hand, the characteristics of the video in the user mode are utilized when the topic tag is generated, so that the machine can learn the information of the video producer according to the characteristics of the user mode, and the topic tag can be ensured to reflect the information of the video producer. The recommended topic tag is matched with the content of the video, and the information of a video producer is reflected, so that the matching degree between the topic tag and the video is fully ensured, the accuracy of the topic tag is improved, the recommended topic tag is closer to the intention of a user, and the accuracy of the processes of content identification, video distribution, video recommendation and the like of the video according to the video tag is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

FIG. 1 is an architecture diagram illustrating an environment for implementing a method of topic tag recommendation in accordance with an exemplary embodiment;

FIG. 2 is a block diagram illustrating a multimodal model for topic tag recommendation in accordance with an exemplary embodiment;

FIG. 3 is a flow diagram illustrating a method of topic tag recommendation in accordance with an exemplary embodiment;

FIG. 4 is a flow diagram illustrating a method of topic tag recommendation in accordance with an exemplary embodiment;

FIG. 5 is a block diagram illustrating a topic tag recommendation apparatus in accordance with an exemplary embodiment;

FIG. 6 is a block diagram illustrating a terminal in accordance with an exemplary embodiment;

FIG. 7 is a block diagram illustrating a server in accordance with an example embodiment.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

The topic tag recommendation method provided by the embodiment of the application can be applied to a scene that a user adds the topic tag when publishing works. For example, when a user shoots a short video and wants to publish the short video on a platform, the server can automatically generate a matched topic tag for the short video through the topic tag recommendation method provided by the embodiment, and issue the topic tag to a publication page, so that a function of selecting the topic tag is provided for the user. In the following, the topic labels and the related scenes are briefly introduced.

A HashTag (HashTag) is a textual description of video. For example, the topic label is a text message with "#" printed by the user in the video production link when describing the video content. For example, the topic labels are "# dance", "# fun", and the like. In a scene of issuing short videos, the platform performs content aggregation and content distribution on the short videos based on the topic tags associated with the short videos, so that the accurate topic tags play a significant forward role in understanding, aggregation and distribution of the short videos.

The topic tag generation process includes two possible implementations.

In one possible implementation, the short video topic tag is obtained by the user actively typing. The mode can represent and reflect the real intention of the user, but the mode needs the user to manually determine the topic label, so that the difficulty is high, and the problem of low user initiative exists. By sampling in the platform short video, the short video containing the topic tag is found to account for about 15%. Therefore, the work amount of the current marked topic label is less, the exposure amount of the topic label is insufficient, the use amount of the topic label is insufficient, the utilization rate of the topic label service is influenced, and the advantage of the topic label to content operation is not maximized enough.

In another possible implementation manner, the image frames are extracted from the video, the image frames are utilized for visual understanding, further understanding of the video content is achieved, and the short video topic tags are obtained according to the understood video content. This approach has high requirements on both timeliness and accuracy of video content understanding.

In view of the above-described requirements of the topic label technology, some embodiments of the present application provide a topic label recommendation technology based on multi-modal learning, which maintains a set of user features on line based on visual understanding by using images, and performs training and reasoning by using image features and user features through a multi-modal model, thereby issuing a topic label closer to the real intention of an author. In one aspect, the user is provided with the convenience of manually entering a hashtag to select a hashtag. On the other hand, the exposure and the usage of the topic label can be directly improved.

Fig. 1 is an architecture diagram illustrating an implementation environment of a method for topic tag recommendation in accordance with an exemplary embodiment. The implementation environment includes: a terminal 101 and a video distribution platform 110.

The terminal 101 is connected to the video distribution platform 110 through a wireless network or a wired network. The terminal 101 may be at least one of a smart phone, a game console, a desktop computer, a tablet computer, an e-book reader, an MP3(Moving Picture Experts Group Audio Layer III, motion Picture Experts compression standard Audio Layer 3) player, or an MP4(Moving Picture Experts Group Audio Layer IV, motion Picture Experts compression standard Audio Layer 4) player, and a laptop computer. The terminal 101 is installed and operated with an application program supporting distribution of video. The application may be a live application, a multimedia application, a short video application, and the like. Illustratively, the terminal 101 is a terminal used by a user, and a user account is registered in an application running in the terminal 101.

The terminal 101 is connected to the video distribution platform 110 through a wireless network or a wired network.

The video distribution platform 110 includes at least one of a server, a plurality of servers, a cloud computing platform, and a virtualization center. The video publishing platform 110 is used to provide background services for applications that support video publishing, topic tag adding, or video playing functions. Optionally, the video distribution platform 110 and the terminal 101 may work in cooperation during execution of the method embodiments described below. For example, the video distribution platform 110 undertakes primary work, and the terminal 101 undertakes secondary work; or, the video distribution platform 110 undertakes the secondary work, and the terminal 101 undertakes the primary work; alternatively, the video distribution platform 110 or the terminal 101 may be respectively responsible for the work alone.

Optionally, the video distribution platform 110 includes: an access server, a traffic server 1101 and a database 1102. The access server is used to provide access services for the terminal 101. The business server 1101 is used to provide background services related to recommending topic labels, such as training multimodal models, extracting user features, collecting sample videos, and the like. The service server 1101 may be one or more. When the business servers 1101 are multiple, at least two business servers 1101 are present for providing different services, and/or at least two business servers 1101 are present for providing the same service, for example, providing the same service in a load balancing manner, which is not limited by the embodiment of the present disclosure. The database 1102 may be used to store videos, sample videos, or other data related to the method embodiments described below, etc., and the database 1102 may provide the stored data to the terminal 101 and the service server 1101 when needed.

The terminal 101 may be generally referred to as one of a plurality of terminals, and the embodiment is only illustrated by the terminal 101.

Those skilled in the art will appreciate that the number of terminals 101 may be greater or fewer. For example, the number of the terminal 101 may be only one, or the number of the terminal 101 may be tens or hundreds, or more, and in this case, the implementation environment further includes other terminals. The number of terminals and the type of the device are not limited in the embodiments of the present disclosure.

The following describes a model for a topic tag recommendation task according to this embodiment.

Referring to fig. 2, fig. 2 shows an architecture diagram of a model for implementing the hashtag recommendation function. The multimodal model 200 of FIG. 2 is used to perform a topic tag recommendation task. Multimodal model 200 is, for example, a machine learning model. The multimodal model 200 is, for example, a deep learning model. The multi-modal model 200 extracts features of the video in multiple modalities and generates topic labels matching the video according to the features of the multiple modalities. Modality refers to the source or form of data, and different modalities of the same data may characterize the data from different aspects. The modalities of the video include, without limitation, at least one of images in the video, users of the video (e.g., producers or consumers), geographic location, time, audio, text, semantics.

The input parameters of multimodal model 200 include at least one frame of image, user characteristics, and other characteristics.

The at least one frame of image includes, for example, two frames of images extracted from the video. Each frame image is also called a path of image input.

For example, referring to fig. 2, image 1(image #1), image 2(image #2), and image n (image # n) in fig. 2 are examples of multi-frame images in the input parameters. The ellipses (…) represent other images not shown in fig. 2 but which may also serve as input parameters. The present embodiment does not limit how many frames of images are supported by the multimodal model 200. For example, the number of images input by the multimodal model 200 can be configured according to computational requirements or latency requirements.

User-feature is a feature of the producer of the video, e.g. the user-feature is a feature of the author of the video. The user features are, for example, feature vectors. The user features include, for example, a plurality of dimensions. The user characteristics include, without limitation, at least one of user attribute characteristics or user behavior characteristics. The user attributes include, without limitation, at least one of age, height, ethnicity, gender, occupation, and color value of the user. The user behavior features are used to describe the historical behavior of the user. The user behavior characteristics include, without limitation, at least one of user production behavior characteristics for the video or user consumption behavior characteristics for the video. In some embodiments, 64-dimensional features are generated as user features from the production behavior and the consumption behavior of the user by the user analysis department.

Other features (other features) refer to features of other modalities of video besides images and users. Other features include, without limitation, at least one of geographic location information, time information. For example, the geographical location information is used to indicate the geographical location where the video was taken. For example, the geographical location information is used to indicate the location where the user published the video. For example, the geographical location information is used to indicate a location input by the user. For example, the time information is used to indicate the time at which the video was captured. For example, the time information is a time period when the user publishes the video. By inputting other characteristics into the multi-modal model 200 and setting modules for processing other characteristics in the multi-modal model 200, the multi-modal model 200 is suitable for processing more modal characteristics, and has stronger expandability and higher flexibility.

The multimodal model 200 includes at least one image processing module (image model), at least one feature processing module, a feature fusion layer (feature fusion layer), and a full connect layer (full connect layer).

The image processing module is used for extracting the features of the image. The input parameters of each image processing module comprise a frame of image. The output parameters of each image processing module include image characteristics of the input image. The image features output by each image processing module are input into the feature fusion layer. For example, referring to fig. 2, the at least one image processing module includes an image processing module 211, an image processing module 212, and an image processing module 213. The image processing module 211 is used for extracting the features of the image 1, the image processing module 212 is used for extracting the features of the image 2, and the image processing module 213 is used for extracting the features of the image n. At least one image processing module is connected with the feature fusion layer.

In some embodiments, the image processing module is a classification network. The classification network is used for extracting image features and classifying according to the image features. In some embodiments, the classification network includes at least one convolutional layer and a pooling layer. The convolutional layers in the classification network are used to extract image features. Specifically, the first convolutional layer in the classification network is used for performing convolution processing on the input image to obtain a feature map, and the feature map is output to the second convolutional layer. The second convolutional layer is used for carrying out convolution processing on the feature map output by the first convolutional layer to obtain a feature map, and the feature map is output to the third convolutional layer. And by analogy, each convolutional layer except the first convolutional layer performs convolution processing on the feature map output by the previous convolutional layer, and outputs the obtained feature map to the next convolutional layer. And the last convolution layer is used for performing convolution processing to obtain a characteristic diagram, and the characteristic diagram is output to the pooling layer. The pooling layer is used for pooling the feature map and outputting image features.

In some embodiments, the classification network is a ResNet-50 network. The ResNet-50 Network is a neural Network used for image classification, and particularly is a Residual Network (ResNet). The ResNet-50 network performs a series of neural network operations including 50 convolution operations on the input image and outputs 512-dimensional image features. Referring to table 1 below, table 1 is an illustration of the architecture of the ResNet-50 network and the operations performed by the various layers.

TABLE 1

The ResNet-50 network shown in Table 1 includes convolutional layer Conv1, convolutional layer Conv2_ x, convolutional layer Conv3_ x, convolutional layer Conv4_ x, convolutional layer Conv5_ x, and Pooling layer Pooling.

The feature map output by convolutional layer Conv1 was 112x112 in size. In "7 x7, 64, stride 2" corresponding to the convolutional layer Conv1, 7x7 means that the size of each convolution kernel in the convolutional layer Conv1 is 7x 7; 64 means that the convolutional layer Conv1 includes 64 convolutional kernels; stride2 means that the step size (stride) of the convolution layer Conv1 convolution operation is 2.

The signature size of the convolutional layer Conv2_ x output is 56x 56. The convolutional layer Conv2_ x is also used to perform maximum pooling (max pool) with a window size of 3x 3.

Corresponding to convolutional layer Conv2_ x

In (1),

meaning that convolutional layer Conv2_ x contains 64 convolutional kernels of size 1x1, 64 convolutional kernels of size 3x3, and 256 convolutional kernels of size 1x 1. The number of times of convolution operations is denoted by 3 in "X3", meaning that a set of convolution sequences of the convolution layer Conv2_ X is repeatedly performed 3 times by "X3".

The signature size of the convolutional layer Conv3_ x output is 28x 28. Corresponding to convolutional layer Conv3_ x

In (1),

means that the convolutional layer Conv3_ x contains 128 convolutional kernels of size 1x1 and 128 convolutional kernels of size 128A convolution kernel of 3x3 and 512 convolution kernels of size 1x 1. The number of times of convolution operations is denoted by 4 in "X4", meaning that a set of convolution sequences of the convolution layer Conv3_ X is repeatedly performed 4 times by "X4".

The convolutional layer Conv4_ x output has a signature size of 14x 14. Corresponding to convolutional layer Conv4_ x

In (1),

meaning that convolutional layer Conv4_ x contains 256 convolutional kernels of size 1x1, 256 convolutional kernels of size 3x3, and 1024 convolutional kernels of size 1x 1. The number of times of convolution operations is denoted by 6 in "X6", meaning that a set of convolution sequences of the convolution layer Conv4_ X is repeatedly performed 6 times by "X6".

The signature size of the convolutional layer Conv5_ x output is 7x 7. Corresponding to convolutional layer Conv5_ x

In (1),

meaning that convolutional layer Conv5_ x contains 512 convolutional kernels of size 1x1, 512 convolutional kernels of size 3x3, and 2048 convolutional kernels of size 1x 1. The number of times of convolution operations is denoted by 3 in "X3", meaning that a set of convolution sequences of the convolution layer Conv5_ X is repeatedly performed 3 times by "X3".

The Pooling layer Pooling outputs 512-dimensional image features.

At least one feature care module is for processing the features. At least one feature care module is connected to the feature fusion layer. For example, referring to fig. 2, the at least one feature processing module includes a feature processing module 214 and a feature processing module 215.

The feature processing module 214 is configured to perform linear mapping and non-linear mapping on the user features, so as to map the user features from low-dimensional features to high-dimensional features. The input parameters of the feature processing module 214 include user features. The output parameters of the feature processing module 214 include the mapped user features. The number of dimensions of the mapped user features and image features is, for example, equal. The mapped user features output by the feature processing module 214 are input to the feature fusion layer.

The feature processing module 215 is configured to perform linear mapping and non-linear mapping on the other features, so as to map the other features from low-dimensional features to high-dimensional features. The input parameters of the feature processing module 215 include other features. The output parameters of feature processing module 215 include the mapped other features. The number of dimensions of the mapped other features and the image features is, for example, equal. The mapped other features output by the feature processing module 215 are input to the feature fusion layer.

In some embodiments, the feature processing module is a Multilayer Perceptron (MLP) and the feature processing module is a Multilayer neural network. For example, the feature processing module is a fully-connected neural network, and the feature processing module is configured to perform at least one fully-connected operation on the user feature and output the processed user feature to the feature fusion layer. For example, referring to table 2 below, the feature processing module includes three full-connection layers, each full-connection layer is used for performing a full-connection operation, and the feature processing module performs three full-connection operations on the input user feature to obtain a 512-dimensional user feature. Specifically, the feature processing module includes a fully-connected layer FullConnect _1, a fully-connected layer FullConnect _2, and a fully-connected layer FullConnect _ 3. After the full connection operation is performed on the user features by the full connection layer FullConnect _1, a 128-dimensional feature vector is output. After the full connection layer FullConnect _2 performs full connection operation on the 128-dimensional feature vector, a 256-dimensional feature vector is output. After the full connection operation is performed on the 256-dimensional feature vector by the full connection layer FullConnect _3, a 512-dimensional feature vector is output.

TABLE 2

Layer Name (Layer Name)	Output Size (Output Size)
		FullConnect_1	128
FullConnect_2	256
		FullConnect_3	512

The feature fusion layer is used for fusing image features of at least one frame of image with user features to obtain fusion features. The input parameters of the feature fusion layer comprise image features of each frame of image in at least one frame of image and user features. The output parameters of the feature fusion layer include fusion features. The feature fusion layer is connected with the full connection layer. The fused features output by the feature fusion layer are input to the full link layer. For example, referring to FIG. 2, the feature fusion layer 220 in FIG. 2 is illustrative of a feature fusion layer in a multi-modal model. The feature fusion layer 220 fuses the image features of the image 1, the image features of the image 2, and the user features, and outputs a feature vector of 2048 dimensions. In some embodiments, the feature fusion layer 220 is a multi-head attention (multi-head integration) network. The multi-head attention network comprises h attention modules, wherein h is a positive integer larger than 1. Each attention module is used for carrying out self-attention operation. Each attention module includes a query weight matrix, a key weight matrix, and a value weight matrix. Each attention module may implement a self-attentive mechanism. The multi-head attention network is used for feature fusion through a multi-head attention mechanism.

Due to the fact that the multi-mode model can focus on information in different aspects by means of feature fusion through the multi-attention mechanism, the multi-mode model can learn richer features. Experiments prove that the multi-head attention mechanism is used for carrying out feature fusion on the image features and the user features, and compared with the method of simply using feature splicing to carry out feature fusion, the method obtains the gain of 1 percentage point on a training set.

And the full connection layer is used for classifying according to the fusion characteristics output by the characteristic fusion layer. In this embodiment, the task of predicting the topic label can be understood as an n-class task, and the predicted topic label is, for example, a probability that each candidate label is matched with the video. Each category is a candidate label. In particular, the fully-connected layer is used to determine probabilities of multiple candidate tags from the fused features. The input parameters of the fully connected layer include the fusion characteristics. The output parameters of the fully-connected layer include a probability for each of the plurality of candidate tags. The number of dimensions of the output parameters of the fully connected layer is equal to the number of candidate tags, for example. For example, referring to FIG. 2, the fully connected layer 230 in FIG. 2 is illustrative of a fully connected layer in a multi-modal model. In some embodiments, the output parameter of the fully-connected layer is logs. Logs refers to the non-normalized probability. Optionally, multimodal model 200 also includes a normalized exponent (Softmax) layer. And the Softmax layer is connected with the full connection layer and used for performing Softmax operation on the output of the full connection layer. The result of the Softmax operation is a real number between 0 and 1, and the result of the Softmax operation can be understood as a probability.

The multimodal model 200 is trained from a plurality of sample videos and topic labels matched with each sample video. For example, when training is performed according to a sample video a in a plurality of sample videos, at least one frame of image is extracted from the sample video a, a user feature corresponding to a user account for uploading the sample video a is determined, a topic tag matched with the sample video a is determined from a topic tag library, and the image, the user feature and the topic tag of the sample video a are used as input of the multimodal model 200 for training.

In some embodiments, during the model training phase, the images input to multimodal model 200 are two frames of images taken at equal intervals in the sample video. The user characteristics input to the multimodal model 200 come from 64-dimensional characteristics generated by the user analysis department based on the user's production and consumption behaviors.

After the output results (e.g., logs) of the full-connectivity layer in the multimodal model 200 are obtained, the output results of the full-connectivity layer are subjected to Softmax. Based on the results of the Softmax operation and the topic labels of the sample video match, a loss value is calculated using a cross entropy loss (cross entropy loss) function. According to the loss value, parameters of the multi-modal model 200 are adjusted through back propagation and gradient updating, so that the multi-modal model 200 is learned through the sample video.

In some embodiments, in the model prediction stage, classification is performed by the multimodal model 200 to enable recommendation of topic tags. For example, a topic tag library (HashTag thesaurus) is cleaned, and each topic tag in the cleaned topic tag library is used as a category to be recognized by the multimodal model 200. For example, two frames extracted from the short video and 64-dimensional user features are input into the multimodal model 200, and the probability of each category (i.e., each candidate tag) output by the Softmax operation is determined. And sequencing each category according to the probability, determining the categories with the probability arranged at the top 10, taking the topic labels corresponding to the categories with the probability arranged at the top 10 as the topic labels to be recommended, and issuing the topic labels to an editing interface of a user. For details on how to apply the multimodal model 200 to predict topic labels, reference is made to the embodiment shown in FIG. 3 or FIG. 4 below. In addition, the cleaned topic label library comprises tens of thousands of topic labels, and covers most scenes in daily life, so that the requirement of finding matched topic labels for various scenes can be met by classifying through the multi-modal model 200.

Fig. 3 is a flowchart illustrating a topic tag recommendation method according to an exemplary embodiment, where the topic tag recommendation method is used in an electronic device as shown in fig. 3, and includes the following steps.

In step S32, the electronic device acquires a video.

In some embodiments, the video is a short video.

In step S34, the electronic device extracts at least one frame of image from the video.

In step S36, the electronic device obtains, according to the user account of the uploaded video, a user characteristic corresponding to the user account.

It should be noted that, in this embodiment, the order of step S34 and step S36 is not limited. In some embodiments, steps S34 and S36 may be performed sequentially. For example, step S34 may be executed first, and then step S36 may be executed; step S36 may be executed first, and then step S34 may be executed. In other embodiments, step S34 and step S36 may be executed in parallel, that is, step S34 and step S36 may be executed simultaneously.

In step S38, the electronic device generates a topic label matching the video according to at least one frame of image and the user characteristics.

In step S39, the electronic device recommends a hashtag to the user account.

The embodiment provides a method for recommending topic labels based on the characteristics of videos in multiple modalities, wherein the topic labels are automatically generated by a machine through images in the videos and the user characteristics of video producers, and are recommended to users. On one hand, the characteristics of the video in the image modality are utilized when the topic tag is generated, so that the machine can understand the content of the video from the visual angle according to the characteristics of the image modality, and the topic tag is ensured to be matched with the content of the video. On the other hand, the characteristics of the video in the user mode are utilized when the topic tag is generated, so that the machine can learn the information of the video producer according to the characteristics of the user mode, and the topic tag can be ensured to reflect the information of the video producer. The recommended topic tag is matched with the content of the video, and the information of a video producer is reflected, so that the matching degree between the topic tag and the video is fully ensured, the accuracy of the topic tag is improved, the recommended topic tag is closer to the intention of a user, and the accuracy of the processes of content identification, video distribution, video recommendation and the like of the video according to the video tag is improved.

Fig. 4 is a flowchart illustrating a topic tag recommendation method according to an exemplary embodiment, where as shown in fig. 4, an interaction subject of the topic tag recommendation method includes a terminal and a server, and includes the following steps.

In step S401, the terminal transmits a topic tag recommendation request to the server.

The topic tag recommendation request is used for requesting the server to recommend the topic tags matched with the videos. For example, the terminal displays a video publishing interface that includes topic options. And triggering a clicking operation on the topic option by the user. The terminal responds to the click operation, generates a topic tag recommendation request and sends the topic tag recommendation request to the server.

In step S402, the server acquires a video.

The video is for example captured by a terminal. In some embodiments, when the terminal sends the topic tag recommendation request, the terminal sends videos to be published to the server together, and the server receives the videos sent by the terminal, so that the videos are obtained. In other embodiments, the server determines a user account logged in by the terminal, and obtains a video historically issued by the user account.

In step S403, the server extracts at least one frame of image from the video.

Specifically, there are many situations in which frame images are extracted from a video to generate a topic tag. In some embodiments, the images extracted from the video comprise frames of images in the video that are equally spaced. In some embodiments, the images extracted from the video include key frames in the video. In some embodiments, the image extracted from the video includes a video cover. In some embodiments, the image extracted from the video includes a video header.

In step S404, the server obtains a user characteristic corresponding to the user account according to the user account for uploading the video.

In some embodiments, the user characteristics are mapped by user attributes. For example, the server acquires a user attribute corresponding to a user account; and the server performs linear mapping and nonlinear mapping on the user attributes to obtain the user characteristics. The way of linear mapping includes, for example, a step of multiplying with a weight matrix and a step of adding with an offset. Non-linear mapping includes, without limitation, maximum, by activation function, and the like.

Since the user characteristics are mapped by the user attributes, the user characteristics can express the information related to the user in terms of the attributes. Therefore, when generating the topic label according to the user characteristics, the machine can learn the information of the user in the attribute according to the user characteristics, so that the generated topic label embodies the information of the user in the attribute. Therefore, compared with a mode of generating the topic label by only depending on the image, the topic label generated by the image and the user attribute is more refined, and the topic label issued to the user is ensured to be closer to the intention of the user. For example, the video that needs the recommended topic tag is a female dance video. If a purely visual scheme (i.e. only according to images and not according to user characteristics) is used for generating the topic labels for the female dancing videos, the generated topic labels can be generalized labels such as 'dance', 'national dance' and the like. After the user characteristics mapped by the user attributes are superposed on the basis of the image, the model can learn the nationality related information of the user according to the attributes of the user in the nationality dimensionality, so that refined recommended labels such as Dai dance and the like are provided, and the provided labels are obviously closer to the intention of the user.

In some embodiments, the user characteristics are determined from historical behavior of the user. For example, the server acquires historical behavior data corresponding to a user account; and the server performs linear mapping and nonlinear mapping on the historical behavior data to obtain the user characteristics.

Since the user characteristics are mapped by the historical behavior data, the user characteristics can express the information related to the historical behavior of the user, or the habit of the user. Therefore, when generating the topic label according to the user characteristics, the machine can learn the information of the user in the historical behavior according to the user characteristics, so that the generated topic label embodies the information of the user in the historical behavior. Therefore, compared with a mode of generating the topic label by only depending on the image, the topic label generated by the image and the historical behavior data is more refined, and the topic label issued to the user is ensured to be closer to the intention of the user.

In step S405, the server generates a topic tag matching the video from at least one frame of image and the user characteristics.

In some embodiments, the topic tags are generated based on the multimodal model 200. For example, the server inputs at least one frame of image and user characteristics into the multimodal model 200, processes the at least one frame of image and user characteristics by the multimodal model 200, and outputs a topic label.

In some embodiments, the generation process of the topic tag includes the following steps (1) to (4). For example, one or more of the following steps (1) through (4) are performed by respective modules in the multimodal model 200.

In the step (1), the server respectively extracts the features of at least one frame of image to obtain the image features of each frame of image.

In some embodiments, the process of image feature extraction is implemented by an image processing module in multimodal model 200. For example, the server inputs at least one frame of image into at least one image processing module respectively, and performs feature extraction on at least one frame of image through at least one image processing module respectively to obtain image features of each frame of image.

In the case where the image processing module is a classification network, in some embodiments, the server performs a convolution process on at least one frame of image in parallel through at least one classification network to output image features of each frame of image. The parallel convolution processing is, for example, a plurality of classification networks simultaneously performing convolution processing. For example, the server convolutes the image 2 through the classification network 2 while convolutes the image 1 through the classification network 1.

By using at least one classification network for parallel convolution processing, the feature extraction process of a plurality of paths of images is parallelized, so that the feature extraction process of the video in the whole image mode is accelerated, and the efficiency of predicting the topic label is improved.

In the step (2), the server fuses the image features of at least one frame of image with the user features to obtain fusion features.

In some embodiments, the process of feature fusion is implemented through a feature fusion layer in multimodal model 200. Specifically, the server inputs the image features and the user features of at least one frame of image into the feature fusion layer, and the image features and the user features of at least one frame of image are fused through the feature fusion layer pair to obtain fusion features.

In some embodiments, the server performs self-attention operations on the image features and the user features of at least one frame of image through the multi-head attention network to obtain the fusion features. For example, the image features and the user features of at least one frame of image are respectively input into the attention modules corresponding to the multi-head attention network, and the image features and the user features of at least one frame of image are subjected to self-attention operation in parallel through the plurality of attention modules.

In step (3), the server determines the probability of a plurality of candidate tags according to the fusion features.

The probability of the candidate tag is used to indicate the likelihood that the candidate tag is a topic tag that matches the video. The higher the probability of the candidate tag, the greater the likelihood that the candidate tag is a topic tag that matches the video. The probability determination process is implemented, for example, by the fully connected layers in multimodal model 200 and by Softmax operations. For example, the server inputs the fusion features into the full-link layer, maps the fusion features through the full-link layer, and calculates the output of the full-link layer through a Softmax function to obtain the probabilities of the plurality of candidate tags.

In step (4), the server determines a topic tag among the plurality of candidate tags according to the probability of each candidate tag.

For example, the server sorts the plurality of candidate tags in order of decreasing probability according to the probability of each candidate tag, selects the candidate tag with the probability of the top preset number of bits from the plurality of candidate tags, and takes each candidate tag with the probability of the top preset number of bits as the topic tag to be recommended. The number of the preset bits is, for example, 10. For example, the server selects a candidate tag having the highest probability from a plurality of candidate tags, and sets the candidate tag having the highest probability as a topic tag to be recommended.

Through the steps (1) to (4), the image features of the multi-frame images in the video are fused with the user features, so that the fused features not only comprise the visual features of the video, but also comprise the user features. Therefore, the fusion features not only express the content of the video, but also express the producer information of the video, and obviously, the expression capability of the fusion features is stronger, so that the probability of each candidate label can be predicted more accurately by using the fusion features, thereby ensuring that the topic labels determined in the candidate labels are more accurate, and leading the recommended topic labels to be closer to the intention of the user.

In some embodiments, the server also predicts the topic tags using a geo-location modality. For example, the server acquires geographic position information corresponding to the video; and the server generates a topic label matched with the video according to the at least one frame of image, the user characteristics and the geographic position information. For example, referring to FIG. 2, the server enters the geo-location information as an additional feature into the multimodal model 200. The server processes the geographic location information through the feature processing module 215 in the multimodal model 200 and outputs the geographic location features. The server inputs the image characteristics, the user characteristics and the geographic position characteristics of at least one frame of image into the characteristic fusion layer, and the image characteristics, the user characteristics and the geographic position characteristics of at least one frame of image are fused through the characteristic fusion layer to obtain fusion characteristics.

In this way, when the topic tag is generated, not only the characteristics of the video in the image modality and the characteristics of the video in the user modality are utilized, but also the characteristics of the video in the geographic position modality are utilized, so that the generated topic tag is ensured to reflect the geographic position corresponding to the video, the matching degree between the recommended topic tag and the video is improved, the recommended topic tag is ensured to be more refined, and the recommended topic tag is closer to the intention of the user.

In some embodiments, the server also predicts topic tags using temporal modalities. For example, the server acquires time information corresponding to the video; and the server generates a topic label matched with the video according to the at least one frame of image, the user characteristics and the time information. For example, referring to fig. 2, the server enters temporal information as an additional feature into the multimodal model 200. The server processes the time information through the feature processing module 215 in the multimodal model 200 and outputs the time features. The server inputs the image characteristics, the user characteristics and the time characteristics of at least one frame of image into the characteristic fusion layer, and the image characteristics, the user characteristics and the time characteristics of at least one frame of image are fused through the characteristic fusion layer to obtain fusion characteristics.

In this way, when the topic tag is generated, the characteristics of the video in the image modality and the characteristics of the video in the user modality are utilized, and the characteristics of the video in the time modality are also utilized, so that the generated topic tag is ensured to reflect the time corresponding to the video, the matching degree between the recommended topic tag and the video is improved, the recommended topic tag is ensured to be more refined, and the recommended topic tag is closer to the intention of the user.

In step S406, the server recommends a hashtag to the user account.

In step S407, the terminal displays the topic tag.

Specifically, the server transmits the topic tag to the terminal in which the user account is registered. And the terminal receives the topic label and displays the topic label in an adding interface of the topic label. For example, the add interface includes a search box in which the terminal displays the hashtag. And the user triggers a confirmation operation on the topic label displayed by the terminal. The terminal sends a topic tag addition request to the server. The server adds the topic tag to the video in response to the topic tag addition request.

In the case where the server generates a plurality of hashtags, in some embodiments, the terminal displays each of the plurality of hashtags. For example, the terminal displays a plurality of options, each option including a hashtag. The user triggers a confirmation operation on a target option of the plurality of options. And the terminal responds to the confirmation operation, determines a target topic label in the target option, carries the target topic label on the topic label adding request, and sends the topic label adding request to the server. The server responds to the topic tag adding request, obtains the target topic tag from the topic tag adding request, and adds the target topic tag to the video. By the method, the recommended topic tags are issued for the user, the selection function of the topic tags is provided, the user can select the preferred topic tags from the recommended topic tags, the function of adding the topic tag service is expanded, and the using amount of the topic tags can be increased.

The embodiment provides a method for recommending topic labels based on the characteristics of videos in multiple modalities, wherein the topic labels are automatically generated by a machine through images in the videos and the user characteristics of video producers, and are recommended to users. On one hand, the characteristics of the video in the image modality are utilized when the topic tag is generated, so that the machine can understand the content of the video from the visual angle according to the characteristics of the image modality, and the topic tag is ensured to be matched with the content of the video. On the other hand, the characteristics of the video in the user mode are utilized when the topic tag is generated, so that the machine can learn the information of the video producer according to the characteristics of the user mode, and the topic tag can be ensured to reflect the information of the video producer. The recommended topic label is matched with the content of the video, and the information of a video producer is reflected, so that the matching degree between the recommended topic label and the video is improved, the recommended topic label is more refined, the recommended topic label is closer to the intention of a user, and the accuracy of the processes of identifying, distributing and recommending the content of the video according to the video label is improved.

Fig. 5 is a block diagram illustrating a topic tag recommendation apparatus in accordance with an exemplary embodiment. Referring to fig. 5, the apparatus includes a first acquisition unit 501, an extraction unit 502, a second acquisition unit 503, a generation unit 504, and a recommendation unit 505.

A first acquisition unit 501 configured to perform acquisition of a video;

an extracting unit 502 configured to perform extracting at least one frame of image from a video;

a second obtaining unit 503, configured to execute obtaining, according to the user account of the uploaded video, a user characteristic corresponding to the user account;

a generating unit 504 configured to generate a topic label matched with the video according to at least one frame of image and the user feature;

a recommending unit 505 configured to execute recommending the topic label to the user account.

The embodiment provides a device for recommending the topic labels based on the characteristics of the video in multiple modalities, wherein the topic labels are automatically generated by a machine through images in the video and the user characteristics of a video producer, and are recommended to a user. On one hand, the characteristics of the video in the image modality are utilized when the topic tag is generated, so that the machine can understand the content of the video from the visual angle according to the characteristics of the image modality, and the topic tag is ensured to be matched with the content of the video. On the other hand, the characteristics of the video in the user mode are utilized when the topic tag is generated, so that the machine can learn the information of the video producer according to the characteristics of the user mode, and the topic tag can be ensured to reflect the information of the video producer. The recommended topic tag is matched with the content of the video, and the information of a video producer is reflected, so that the matching degree between the topic tag and the video is fully ensured, the accuracy of the topic tag is improved, the recommended topic tag is closer to the intention of a user, and the accuracy of the processes of content identification, video distribution, video recommendation and the like of the video according to the video tag is improved.

Optionally, the generating unit 504 is configured to perform feature extraction on at least one frame of image respectively, so as to obtain image features of each frame of image; fusing image characteristics of at least one frame of image with user characteristics to obtain fused characteristics; determining the probability of a plurality of candidate labels according to the fusion characteristics; determining a topic tag among the plurality of candidate tags according to the probability of each candidate tag.

Optionally, the generating unit 504 is configured to perform a self-attention operation on the image feature and the user feature of at least one frame of image through the multi-head attention network, so as to obtain a fusion feature.

Optionally, the generating unit 504 is configured to perform a convolution process on at least one frame of image through at least one classification network in parallel, and output an image feature of each frame of image.

Optionally, the second obtaining unit 503 is configured to perform obtaining of a user attribute corresponding to the user account; and carrying out linear mapping and nonlinear mapping on the user attributes to obtain the user characteristics.

Optionally, the second obtaining unit 503 is configured to perform obtaining of historical behavior data corresponding to the user account; and performing linear mapping and nonlinear mapping on the historical behavior data to obtain the user characteristics.

Optionally, the apparatus further comprises: the third acquisition unit is configured to execute acquisition of geographic position information corresponding to the video;

a generating unit 504 configured to execute generating a topic label matched with the video according to at least one frame of image, the user feature and the geographic position information.

Optionally, the apparatus further comprises: a fourth acquiring unit configured to perform acquiring time information corresponding to the video;

a generating unit 504 configured to generate a topic label matching the video according to at least one frame of image, the user feature and the time information.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

The electronic device in the above method embodiment may be implemented as a terminal or a server, for example, fig. 6 shows a block diagram of a terminal 600 provided in an exemplary embodiment of the present application. The terminal 600 may be: a smart phone, a tablet computer, an MP3(Moving Picture Experts Group Audio Layer III, motion video Experts compression standard Audio Layer 3) player, an MP4(Moving Picture Experts Group Audio Layer IV, motion video Experts compression standard Audio Layer 4) player, a notebook computer or a desktop computer. The terminal 600 may also be referred to by other names such as user equipment, portable terminal, laptop terminal, desktop terminal, etc.

In general, the terminal 600 includes: one or more processors 601 and one or more memories 602.

The processor 601 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so on. The processor 601 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 601 may also include a main processor and a coprocessor, where the main processor is a processor for processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 601 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content that the display screen needs to display. In some embodiments, processor 601 may also include an AI (Artificial Intelligence) processor for processing computational operations related to machine learning.

The memory 602 may include one or more computer-readable storage media, which may be non-transitory. The memory 602 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in the memory 602 is used to store at least one program code for execution by the processor 601 to implement the topic tag recommendation method provided by the method embodiments herein.

In some embodiments, the terminal 600 may further optionally include: a peripheral interface 603 and at least one peripheral. The processor 601, memory 602, and peripheral interface 603 may be connected by buses or signal lines. Various peripheral devices may be connected to the peripheral interface 603 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of a radio frequency circuit 604, a display 605, a camera assembly 606, an audio circuit 607, a positioning component 608, and a power supply 609.

The peripheral interface 603 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 601 and the memory 602. In some embodiments, the processor 601, memory 602, and peripheral interface 603 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 601, the memory 602, and the peripheral interface 603 may be implemented on a separate chip or circuit board, which is not limited in this embodiment.

The Radio Frequency circuit 604 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 604 communicates with communication networks and other communication devices via electromagnetic signals. The rf circuit 604 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 604 comprises: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuitry 604 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: the world wide web, metropolitan area networks, intranets, generations of mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the rf circuit 604 may further include NFC (Near Field Communication) related circuits, which are not limited in this application.

The display 605 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 605 is a touch display screen, the display screen 605 also has the ability to capture touch signals on or over the surface of the display screen 605. The touch signal may be input to the processor 601 as a control signal for processing. At this point, the display 605 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display 605 may be one, providing the front panel of the terminal 600; in other embodiments, the display 605 may be at least two, respectively disposed on different surfaces of the terminal 600 or in a folded design; in other embodiments, the display 605 may be a flexible display disposed on a curved surface or a folded surface of the terminal 600. Even more, the display 605 may be arranged in a non-rectangular irregular pattern, i.e., a shaped screen. The Display 605 may be made of LCD (liquid crystal Display), OLED (Organic Light-Emitting Diode), and the like.

The camera assembly 606 is used to capture images or video. Optionally, camera assembly 606 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 606 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.

Audio circuitry 607 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 601 for processing or inputting the electric signals to the radio frequency circuit 604 to realize voice communication. For the purpose of stereo sound collection or noise reduction, a plurality of microphones may be provided at different portions of the terminal 600. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 601 or the radio frequency circuit 604 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, audio circuitry 607 may also include a headphone jack.

The positioning component 608 is used to locate the current geographic location of the terminal 600 to implement navigation or LBS (location based Service). The positioning component 608 can be a positioning component based on the GPS (global positioning System) in the united states, the beidou System in china, or the galileo System in russia.

Power supply 609 is used to provide power to the various components in terminal 600. The power supply 609 may be ac, dc, disposable or rechargeable. When the power supply 609 includes a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, the terminal 600 also includes one or more sensors 610. The one or more sensors 610 include, but are not limited to: acceleration sensor 611, gyro sensor 612, pressure sensor 613, fingerprint sensor 614, optical sensor 615, and proximity sensor 616.

The acceleration sensor 611 may detect the magnitude of acceleration in three coordinate axes of the coordinate system established with the terminal 600. For example, the acceleration sensor 611 may be used to detect components of the gravitational acceleration in three coordinate axes. The processor 601 may control the display screen 605 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 611. The acceleration sensor 611 may also be used for acquisition of motion data of a game or a user.

The gyro sensor 612 may detect a body direction and a rotation angle of the terminal 600, and the gyro sensor 612 and the acceleration sensor 611 may cooperate to acquire a 3D motion of the user on the terminal 600. The processor 601 may implement the following functions according to the data collected by the gyro sensor 612: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.

Pressure sensors 613 may be disposed on the side bezel of terminal 600 and/or underneath display screen 605. When the pressure sensor 613 is disposed on the side frame of the terminal 600, a user's holding signal of the terminal 600 can be detected, and the processor 601 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 613. When the pressure sensor 613 is disposed at the lower layer of the display screen 605, the processor 601 controls the operability control on the UI interface according to the pressure operation of the user on the display screen 605. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.

The fingerprint sensor 614 is used for collecting a fingerprint of a user, and the processor 601 identifies the identity of the user according to the fingerprint collected by the fingerprint sensor 614, or the fingerprint sensor 614 identifies the identity of the user according to the collected fingerprint. Upon identifying that the user's identity is a trusted identity, the processor 601 authorizes the user to perform relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying, and changing settings, etc. The fingerprint sensor 614 may be disposed on the front, back, or side of the terminal 600. When a physical button or vendor Logo is provided on the terminal 600, the fingerprint sensor 614 may be integrated with the physical button or vendor Logo.

The optical sensor 615 is used to collect the ambient light intensity. In one embodiment, processor 601 may control the display brightness of display screen 605 based on the ambient light intensity collected by optical sensor 615. Specifically, when the ambient light intensity is high, the display brightness of the display screen 605 is increased; when the ambient light intensity is low, the display brightness of the display screen 605 is adjusted down. In another embodiment, the processor 601 may also dynamically adjust the shooting parameters of the camera assembly 606 according to the ambient light intensity collected by the optical sensor 615.

A proximity sensor 616, also known as a distance sensor, is typically disposed on the front panel of the terminal 600. The proximity sensor 616 is used to collect the distance between the user and the front surface of the terminal 600. In one embodiment, when proximity sensor 616 detects that the distance between the user and the front face of terminal 600 gradually decreases, processor 601 controls display 605 to switch from the bright screen state to the dark screen state; when the proximity sensor 616 detects that the distance between the user and the front face of the terminal 600 is gradually increased, the processor 601 controls the display 605 to switch from the breath-screen state to the bright-screen state.

Those skilled in the art will appreciate that the configuration shown in fig. 6 is not intended to be limiting of terminal 600 and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components may be used.

The electronic device in the above method embodiment may be implemented as a server, for example, fig. 7 is a schematic structural diagram of a server provided in the present disclosure, and the server 700 may generate a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 701 and one or more memories 702, where at least one program code is stored in the memory 702, and the at least one program code is loaded and executed by the processor 701 to implement the topic tag recommendation method provided in each of the above method embodiments. Of course, the server may also have a wired or wireless network interface, an input/output interface, and other components to facilitate input and output, and the server may also include other components for implementing the functions of the device, which are not described herein again.

In an exemplary embodiment, there is also provided a storage medium, such as a memory, including program code executable by a processor of an electronic device to perform the above-described topic tag recommendation method. Alternatively, the storage medium may be a non-transitory computer readable storage medium, for example, a Read-Only Memory (ROM), a Random Access Memory (RAM), a Compact Disc Read-Only Memory (CD-ROM), a magnetic tape, a floppy disk, an optical data storage device, and the like.

The user information to which the present disclosure relates may be information authorized by the user or sufficiently authorized by each party.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method for topic tag recommendation, the method comprising:

acquiring a video;

extracting at least one frame of image from the video;

recommending the topic label to the user account.

2. The topic tag recommendation method according to claim 1, wherein the generating a topic tag matching the video according to the at least one frame of image and the user feature comprises:

3. The topic tag recommendation method according to claim 2, wherein the fusing the image feature of the at least one frame of image with the user feature to obtain a fused feature comprises:

4. The topic tag recommendation method of claim 1, wherein after the obtaining the video, the method further comprises: acquiring geographic position information corresponding to the video;

5. The topic tag recommendation method of claim 1, wherein after the obtaining the video, the method further comprises: acquiring time information corresponding to the video;

6. A topic tag recommendation apparatus, comprising:

a first acquisition unit configured to perform acquisition of a video;

7. The topic tag recommendation device according to claim 6, wherein the generating unit is configured to perform feature extraction on the at least one frame of image respectively to obtain image features of each frame of image; fusing the image characteristics of the at least one frame of image with the user characteristics to obtain fused characteristics; determining the probability of a plurality of candidate labels according to the fusion characteristics; determining the topic tag among the plurality of candidate tags according to the probability of each candidate tag.

8. The topic tag recommendation device according to claim 7, wherein the generating unit is configured to perform a self-attention operation on the image feature of the at least one frame of image and the user feature through a multi-head attention network to obtain the fusion feature.

9. An electronic device, comprising:

one or more processors;

one or more memories for storing the one or more processor-executable program codes;

wherein the one or more processors are configured to execute the program code to implement the topic tag recommendation method of any one of claims 1 to 5.

10. A storage medium characterized in that, when program code in the storage medium is executed by a processor of an electronic device, the electronic device is enabled to execute the topic tag recommendation method according to any one of claims 1 to 5.